Improving the Jaccard Distance Similarity Measurement Constriction

Noor Aznimah Abdul Aziz, Siti Salwa Salleh, Daud Mohammad and Megawati Omar of UiTM, Shah Alam, Malaysia, studied the conventional Jaccard Distance performance to recognize shapes with and without pre processing tasks.

Their study identified the pre processing tasks that should be conducted to improve recognition performance.

In their experiment, the conventional Jaccard Distance’s ability to recognise the handwriting of university students was tested by calculating the percentage of similarity of the shapes (between the handwriting and the Arial Font print). The program was developed using C# (C Sharp).

Ten university students (5 male and 5 female), age 23 to 26 years old, were instructed to write the numbers between 1 to 10. Seven of them were right-handed and the rest left-handed. Their handwriting then was considered as the unsupervised data of the experiment. These students, untrained to use the pen, wrote the numbers on a graphic tablet in the laboratory. There were 4 sets of handwriting, each set contained 9 numbers (1, 2, 3, 4, 5, 6, 7, 8, 9), thus all together produced 400 samples of handwriting. The samples later were divided into two groups randomly, each containing 200. All samples were in binary images of size 454 by 596.

Here, the handwriting were converted from a greyscale format into a binary of normalized size of 16 by 16 pixels. Since the original canvas size of the hand writing was large, the program consumed some time to compute it. All handwriting obtained were compared against the reference images of digit 0, 1, 2 to 9. This reference image was an image of the numbers obtained from the word processing tools of an Arial font . The reference image is considered the most ideal digits shape.

On the first run, the first 50% of the handwriting of the students was compared against the reference image. On the other run, some processing tasks were applied onto the other half (50%) of the handwriting. After the pre processing tasks were completed, the similarity of the handwriting to the reference objects was computed.

The tasks began with the region of interest (ROI) extractions. The ROI extration extracted a sample of handwriting (of a number) and isolated it from the other. Next, we rotated the object orientation to straighten it up, then it was transformed to centralise the object on a canvas. Finally, we scaled the object to almost a similar size to the reference object. Obtaining the similar size of the handwriting will normalize the size of the objects.

The similarities of the handwritten objects (the handwriting) were calculated against the reference objects, and the result was close to 1 as both objects were dissimilar. On the contrary, those similar to the reference image showed 0. The result of the experiment is shown in Table 1.


Without ROIRTS
(50% of Respondents handwritten) With ROIRTS
(50% of Respondents handwritten)
T Mean
distance measure-ment Std Dev
distance measure-ment Similarity
(%) Mean
distance measure-ment Std Dev
distance measure-ment Similarity
1 0.69 0.05 31.0 0.49 0.04 50.6
2 0.70 0.05 30.1 0.48 0.04 52.0
3 0.69 0.05 30.9 0.49 0.04 51.0
4 0.68 0.05 32.5 0.49 0.05 51.5
5 0.68 0.05 32.2 0.46 0.05 53.7
6 0.69 0.06 31.1 0.47 0.03 52.7
7 0.69 0.05 30.8 0.48 0.04 51.9
8 0.68 0.05 31.8 0.48 0.04 52.3
9 0.71 0.06 29.2 0.48 0.05 51.7

The average distance measurement of the similarity evaluation was between 0.68 and 0.71. In other words, the result showed that the similarity measure for object without pre processing was only between 29.2% and 32.5% and this was considered a very low similarity recognized by the conventional Jaccard Distance function. This result was obtained could be due to the sensitivity of the distance computation in handling variant image transforms: broken or distorted stroke. The ratio values used in the distance measurement indicated that Jaccard distance only performed by computation on overlapping or intersection points. Thus this returned low similarity results as it dealt with the variance in coordinate space, which were the translation and rotation of an image. This means that almost all handwriting here was not recognized similar to the reference object. We believed that the occurrences of noise and image inconsistency produced low precision in the similarity measurement as well. The standard deviation of the similarity measurement, which was between 0.05 and 0.06, showed that most distance various distance computed clustered around the mean.

On the other hand, the recognition rate showed some improvement into 50.6% to 53.7% after certain pre processing tasks onto the handwriting were applied. The pre processing tasks involved Region of Interests (RO1) extraction, translation, rotation and normalization of scale images (in short ROITRS). Column five onwards (in Table 1) showed the result after the modification on the variant of the object (handwriting) transformations. Both results were compared and the differences are shown in Figure 3. The improvement on the similarity measures here was approximately by 20%.

As anticipated, a low similarity objects measure due to some weaknesses using the unsupervised data without pre processing was obtained. Secondly, it was found that even if the pre processing tasks were done, the similarity measurement made by the conventional Jaccard Distance measure was still low.

Overall, the distance measurement calculation showed that the conventional Jaccard Distance function suffered from some drawbacks against essential properties of shape features such an image transformations and noises. During the computation, we found that the Jaccard Distance rigidly worked on the overlapped vector points between two images. If the translation was not at the same coordinate, it would return a minimal result in the foreground.

In addition, if the handwriting of each respondent was different in terms of scale image size, polylines thickness, stroke distorted or broken, it would also produce low similarity measure. The experiment showed that even the segmented samples based on ROI, having the same size, still returned a low result if there was no thorough noise filtering and object enhancement. Thus it was discovered that the occurrences of noise reflected the calculation of similar and dissimilar images. The Jaccard Distance performing binary computation search for the correspondence points between two images took account the number of pixels in the foreground and background of images. Therefore, if there were occurrences of noise not belonging to the shape of the object, it would increase the total number of background pixels.

With that, we see that translation, rotation and scaling are required to improve the similarity measurement. A careful process of ROITRS is essential to improvise the common essential properties of shape feature.

As the result showed low distance measures for the variance of images without transformations and noise filtering, it is proposed that a careful process of ROITRS, to improve the similarity measurement and recognition performance, be done.

The outcome of this study helped us to embark our next step in detailing the pre processing tasks and obtaining certain measures that would improve the similarity measurement. For our future work, we will develop automatic preprocessing tasks for unsupervised input obtained using a pen that will recognize the similarity between the images of two-dimensional uncontrolled handwriting.

Contact: [email protected]

Noor Aznimah Abdul Aziz and Siti Salwa Salleh
Department of Computer Science
Faculty of Computer and Mathematical Sciences
University Technology MARA
Shah Alam, Malaysia
[email protected]

Daud Mohamad
Department of Mathematics
Faculty of Computer and Mathematical Sciences
University Technology MARA
Shah Alam, Malaysia
[email protected]

Megawati Omar
Research Management Institute
University Technology MARA
Shah Alam, Malaysia

Published: 20 May 2010

Contact details:

Chief Information Officer (CIO)

Institute of Research, Development and Commersialisation (IRDC) Universiti Teknologi MARA (UiTM) Shah Alam, 50450 Shah Alam Selangor Malaysia

News topics: 
Content type: