Talk:Jaro–Winkler distance

Latest comment: 1 year ago by Showeropera in topic Computational complexity

This seems to be backwards

edit

The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 0 equates to no similarity and 1 is an exact match. The Jaro–Winkler similarity is the inversion, (1 − Jaro–Winkler distance).

Multiple matching characters

edit

The definition as currently provided does not seem to specify what to do in case of multiple possibilities for character matching. For example the S's in the pair   can be matched as   or as  . The first option yields   while the second option yields  . The definition given by William Winkler in the paper "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage" (1990) also does not seem to provide a choice between greedy or optimal matching. Should this be mentioned on the current page? 132.229.128.62 (talk) 16:11, 27 January 2014 (UTC)Reply

adjwt

edit

This makes no mention of the adjacent-weighting array (adjwt in Winkler's strcomp95.c). Is this additional function simply a fudge, or a fundamental part of the algorithm that should be mentioned? 82.2.93.1 (talk) 23:35, 9 December 2007 (UTC)Reply

I've taken the bull by the other foot, and added a note to the article. 82.2.93.1 (talk) 23:36, 9 December 2007 (UTC)Reply
As now clarified in the article, the adjwt array and string length compensation are not a part of the published versions of this algorithm, therefore are not reflected in the examples, even though they are in the reference code. 66.220.23.57 (talk) 16:38, 11 December 2007 (UTC)Reply

Number of transpositions

edit

I dont understand why t values are zero in (DWAYNE, DUANE) and (DIXON, DICKSONX) examples. I think that part is not clear. The character N and E in DWAYNE are found in DUANE, but in different positions. But it is written that t is zero not 2. If someone can clarify this, I will apreciate it. Bitkidoku (talk) 20:39, 13 May 2008 (UTC)Reply

The definition of transposition given in the article is very unclear. In comparing CRATE with TRACE we have that all the letters match, but they are in a different order. By swapping the C and the T, we can turn CRATE into TRACE. Such a swap of two elements is called a transposition. In DwAyNE versus DuANE the matching letters are already in the same order D-A-N-E, so no transpositions are needed.  --Lambiam 02:19, 19 May 2008 (UTC)Reply
I agree. The definition of transposition confused me also. I have found that (to the best of my memory) when I wrote an algorithm to compute this, I took t to be the count of letters which where within the max distance allowed for 'matching characters' but characters that did not match exactly. E.g. in the words CAT and ACT, T is a perfect match while A and C are matching but each require a transportation, so t = 2 here. Then, when it comes to actually computing the Jaro distance (or its extensions), simply divide t by 2 before using it. E.g. Set t = t/2. This achieves the desired result but 1) makes any algorithm easier to compute 2) makes it easier to understand. Hope this helps. 131.181.251.10 (talk) 01:59, 25 August 2011 (UTC)Reply

Question

edit

One thing is not clear for me : Is Jaro distance (not Jaro-Winkler distance) a metric in the mathematical sense of that term, or not ? --Mrówkotermit (talk) 08:55, 8 June 2009 (UTC)Reply

Recommend Geometric Mean When Normalizing

edit

In my experience, the geometric mean gives a much more informative result than the arithmetic mean when normalizing (heterogeneous) numbers. I recommend a variation of the Jaro-Winkler distance, call it the Jaro-Winkler-Young method, if you will:

 


Consider the strings DESSERTS and STRESSED,   and   respectively. Because the match distance is  ,   and   in this case.

The Jaro-Winkler method gives the result

 

In my opinion, this is a rather high number (in the range of [0,1]) considering the dissimilarity of the strings.

The Jaro-Winkler-Young method gives the result

 


MarkMYoung (talk) 23:40, 2 March 2010 (UTC)Reply

Mark, I like your addition to this concept. It makes sense. However, I think m would be 6 in this example, not 5 because the first 'S' in 'STRESSED' would match with the first 'S' in 'DESSERTS', and likewise with the second and third S's. So all three S's match plus the two E's and the R. The code I've written which works for all the other examples says its 6 and I've followed it through by hand, but correct me if I'm wrong. 203.9.185.137 (talk) 01:37, 9 August 2011 (UTC)Reply
I have implemented your formula Mark and tested it on many strings side-by-side to the Jaro-Winkler distance measure and found that your measure tends to penalise very heavily making it very hard to determine a good cut-off for deciding what is considered a close match and what is not. I have found that for the Jaro-Winkler score, a value of about 0.92 tends to be a very good margin - any pair of strings with 0.92 <= JW value < 1 are usually close matches while JW values <0.92 tend to not be similar. (I ran this test on about 600 first+last names which I knew had not been cleaned and contained 'duplicates' as a result of typos, etc.) 121.222.138.79 (talk) 00:15, 22 August 2011 (UTC)Reply

Why is the floor function in the CRATE-TRACE example giving a non-integer?

edit

In the CRATE-TRACE example, it says "...in comparing CRATE with TRACE, only 'R' 'A' 'E' are the matching characters, i.e, m=3. Although 'C', 'T' appear in both strings, they are farther than 1.5, i.e., (5/2)-1=1.5. Therefore, t=0." Am I misinterpretting the square brackets with tops missing (at least they are missing on my screen) as floor functions, because this should give a value of 1, not 1.5. That is, Floor(5/2)-1 = 2-1 = 1. Can anyone give me an explanation please? 203.9.185.137 (talk) 04:11, 2 August 2011 (UTC)Reply

In the C source on census.gov the search range value is an integer, so 1.5 is impossible.
But I have a different question... [s]Maybe I implemented the code wrong, but it appears that the C source will give you different scores based on the order in which you pass the two strings. This seems to be because the step where you find the matching letters is only done from one string's perspective. Anyone agree? [/s] Nevermind I had an error in my code. 67.91.189.250 (talk) 03:02, 4 August 2011 (UTC)Reply

Formula flawed for m = 0

edit

I'm writing my own code so that my colleagues can run this in Excel. The problem I'm getting is when two strings have nothing in common given a value m = 0 which causes the code to fail. Looking at the formula given for the Jaro score, division by m when m is zero seems to be an oversight. I'm guessing this should equate to a Jaro score of zero. This should perhaps be mentioned somewhere in the article, because as it stands the formula is flawed. 203.9.185.137 (talk) 23:29, 8 August 2011 (UTC)Reply

You're right. It is flawed the way it is written. I guess it should be written as a split-definition function where its equal to the given formula for m >= 1 and 0 otherwise. If I was more comfortable with LaTeX code I would fix this but alas... 121.222.138.79 (talk) 00:08, 22 August 2011 (UTC)Reply
done. Syneil (talk) 12:45, 17 July 2012 (UTC)Reply
It does not make any sense. If m=0, then stings are not similar at all distance should be 1. Alexei Kopylov (talk) 02:54, 24 March 2017 (UTC)Reply

Notability

edit

I added a journal article that is a survey of string distance metrics (Cohen, Ravikumar, & Fienberg, 2003). Jaro-Winkler is discussed in the second page of the article. Between that and Winkler's survey article from 2006, is this sufficient to remove the Notability warning template? Showeropera (talk) 21:13, 27 August 2014 (UTC)Reply

I'm not sure why the Notability warning template is still there. Google it. It's notable already... I use it every day. I wish I felt empowered to remove a warning template. Who is bold enough to do it? Joe (talk) 17:27, 6 February 2015 (UTC)Reply
I came here because this is used in Solr; seems quite notable and useful. I removed the tag. @Joedeshon: Feel free to remove such tags in the future. At worst someone else will put it back. 8) -- Beland (talk) 17:48, 17 July 2015 (UTC)Reply

Named distance, described as similarity

edit

The article clearly presents it as a string distance:

> The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 1 equates to no similarity and 0 is an exact match. The Jaro–Winkler similarity is given by 1 − Jaro–Winkler distance.

That looks plain untrue tome. The formulas clearly say the opposite: 1 equates to exact match, 0 equates to no similarity. What formulas describe here is a similarity measure, not a distance measure.

Cornuz (talk) 13:29, 5 October 2017 (UTC)Reply

Jaro-Winkler not satisfying the identity axiom

edit

It is claimed that "the Jaro–Winkler distance [also] does not satisfy the identity axiom  .". Is this generally true, or true only if   (the article also limits   from being greater than 4)? If that's the case, then I think it is worth qualifying that claim with this bit of information.

I see the Jaro-Wrinkler distance here the first, so I didn't want to edit it myself.

— manual signature: comment added by ThoAppelsin (talkcontribs) 14:45, 28 July 2020 (UTC)Reply

I came here to mention this also. If   then   and hence either   or  . The restriction on   should really be that it is strictly less than 0.25 (or  ), as mentioned in the reference listed, and under this restriction the Jaro-Winkler distance does satisfy the identity axiom. 203.161.95.162 (talk) 20:22, 18 September 2023 (UTC)Sacha RoscoeReply

Calculation of common characters

edit

The paper Overview of Record Linkage and Current Research Directions by the author of the Jaro-Winkler similarity William E. Winkler defines common characters on page 11 as:

   The definition of common is that the agreeing character must be within half the length of the shorter string

This definition differs from the one used in the article. Does someone know why there are different definitions? — Preceding unsigned comment added by Wynk (talkcontribs) 12:19, 10 December 2020 (UTC)Reply

Website to play with Jaro Winkler

edit

This Website is nice to play with different algorithms like Jaro Winkler: https://asecuritysite.com/forensics/simstring

46.126.16.87 (talk) 09:36, 3 September 2022 (UTC) LandevReply

Odd number for transpositions

edit

I wish an example is provided where the number of transpositions before halving is odd. Some implementations do integer division, others do float division. Vitcra (talk) 14:29, 21 June 2023 (UTC)Reply

Computational complexity

edit

I came to this page hoping to find the computational complexity (i.e., "big O") but didn't see it. I think adding the computational complexity would be a very helpful addition. Some of the other string similarity algorithm pages list complexity (like Smith–Waterman algorithm), which I appreciated. Showeropera (talk) 14:23, 12 December 2023 (UTC)Reply