Talk:Damerau–Levenshtein distance

Latest comment: 2 months ago by 85.186.15.66 in topic OSA Bug?

Technical Tag

edit

I added the Technical tag and the context tags to this. The reason is that this is highly technical. That is not necessarily a problem but at least the introduction should say something less technical. If I read it correctly, then the introduction should say something like: Damerau-Levenshtein distance is a distance calculated for human misspellings. The distance calculated is useful for computer software spelling checkers. Note that this is useful only for languages that use alphabetic symbols and not those that use logographic or syllabic symbols. Fanra 01:06, 23 April 2007 (UTC)Reply


Question: are the tranpositions restricted to consecutive characters or any two characters within the strings? —Preceding unsigned comment added by 141.225.10.68 (talk) 15:59, 24 April 2008 (UTC)Reply

Consecutive characters in this case. I believe one of the other algorithms on this page allows for arbitrary transpositions.

Plikarish (talk) 21:04, 4 February 2010 (UTC)Reply


There is an error in this algorithm. String indices start at 0, but the algorithm starts string comparision at index = 1. The first character will not be tested and when i = lenStr1, str1[i] will be out of range. —Preceding unsigned comment added by 68.15.100.248 (talkcontribs) 28 June 2006

There's no error — string indices start at 1, and are properly swept by the main loop. It's the d array which is indexed from zeros. The 0-th row and column are filled by two separate loops, so they can be accessed with d[i-1, j] and d[i, j-1] expressions in main loop. --CiaPan 14:33, 29 March 2007 (UTC)Reply
for j from 1 to lenStr2 should be for j from 0 to lenStr2 See [1] —Preceding unsigned comment added by 173.21.33.192 (talk) 01:28, 3 February 2010 (UTC)Reply
Which one? There are two 'for j' loops... --CiaPan (talk) 11:13, 3 February 2010 (UTC)Reply

I implemented the algo and it took me awhile to figure out what I was doing wrong. It was due to the fact that the algo assumes strings are indexed starting at 1. I think it should be explicitly stated for the reader's benefit that string indices start at 1. (Many (most?) languages treat strings as char arrays which will be 0 indexed. It's a simple fix either to change the algo or to include it as a comment. I've added a comment to this effect on the page. Plikarish (talk) 21:08, 4 February 2010 (UTC)Reply

The routine header states explicitly: char str1[1..lenStr1] and similar for str2. This defines the input, so it is an important part of the whole algorithm definition. I can't help if you read just a random part of the definition, omitting necessary prerequisities. Do read it whole, and you'll understand the algorithm properly. Then you will be able to implement it correctly. --CiaPan (talk) 10:02, 5 February 2010 (UTC)Reply

The last sentence:

An extension of the edit distance algorithm, that does satisfy a triangle inequality is described in the paper: F.J. Damerau. A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964

actually links to this paper:

Source Journal of the ACM (JACM) archive
Volume 22 , Issue 2 (April 1975) table of contents
Pages: 177 - 183
Year of Publication: 1975
ISSN:0004-5411
Authors:
Robert A. Wagner Department of Systems and Information Sciences, Vanderbilt University, Nashville, TN
Roy Lowrance 255 West Squire Drive, Rochester, NY and Vanderbilt University, Nashville, Tennessee
Publisher ACM Press New York, NY, USA

I suspect that this is in fact the intended paper and that the sentence just needs to be re-written. —Preceding unsigned comment added by 151.193.220.27 (talkcontribs) 9 March 2007


It seems to me that the comment

   // note that d is zero-indexed, while a and b are one-indexed.

is wrong: the 2 nested loops index `a` and `b` from 0 to their length-1 included.

   for i := 1 to length(a) inclusive do
       for j := 1 to length(b) inclusive do
           if a[i-1] = b[j-1] then

-- Ambrevar (talk) 11:18, 15 February 2017 (UTC)Reply

@Ambrevar: That error was a result of edit by Bawalou on 25 July 2016 (Special:Diff/731459024). I have reverted it. --CiaPan (talk) 12:28, 15 February 2017 (UTC)Reply

The algorithm and the pseudo code contain an error in the if condition for the transposition operation. Instead of

 

it should be:

 

There's a Java implementation where you only get the expected distance with the above change. Pekoli (talk) 10:08, 27 August 2020 (UTC)Reply

@Pekoli: Nope. Java uses strings indexed from 0, whilst the example pseudocode uses indices starting from 1. --CiaPan (talk) 16:53, 27 August 2020 (UTC)Reply

Original research

edit

Please someone check these edits

http://en.wiki.x.io/w/index.php?title=Damerau%E2%80%93Levenshtein_distance&diff=285208890&oldid=284007942

--CiaPan (talk) 18:16, 23 April 2009 (UTC)Reply

I don't think that's what Wikipedia is for, is it? The part I deleted is pasted here for posterity. If it's original research, it doesn't belong in the article, but I'm struggling even to see the relevance of an allegedly superior algorithm, except perhaps as a link in the 'See Also' section. SeanCollins (talk) 05:26, 11 December 2009 (UTC)Reply

Transposition using Lawrence logic:

    for (int swapcheck = 2; swapcheck <= Math.Max(string1.Length, string2.Length); swapcheck++) 
    {
        if ((sNewIdx > swapcheck)
             && (sOldIdx > swapcheck - 1)
             && (sNew[sNewIdx - 1] == sOld[sOldIdx - swapcheck])
             && (sNew[sNewIdx - swapcheck] == sOld[sOldIdx - 1]))
        {
            matrix[sNewIdx, sOldIdx] = (matrix[sNewIdx - 1, sOldIdx - 1] - 1);
        }
    }

where:

  • sNewIdx is the index value of string 1
  • sOldIdx is the index value of string 2

Kindly check out with the matrix format Levenshtein code link available below The above method is the result of experimentation for about one month, it gives lesser edit distance cost for string manipulation.

Please don't delete code. Read the paper.

edit

If no one checked on the edits, that doesn't mean you should delete. Please read the paper. I have just reverted. Having actual code on the Wikipedia page is very useful. If you don't believe me, look at the 10 links at the bottom -- everybody is saying that they implemented the code based on the Wikipedia page. Translating code does not count as original research. It is equivalent to adding to an English Wikipedia article based on a French reference article. Otherwise, my code was a almost-verbatim translation from the paper pseudocode into ActionScript 3.0. If you're not happy with the language, then translate the code into a language you prefer, although you should have a good reason why your language of choice make the article better in some way. Please do not delete the code again. I will revert, and try to get an editor to moderate.Gabiteodoru (talk) 02:39, 14 October 2010 (UTC)Reply

You got it is exactly vice versa. If they implemented code based on wikipedia, this is not a valid reference. Only sources from recognized experts are allowable, not from random geeks and hackers. Adding an English article based on an unreferenced French article is against the rule of wikipedia about verifiability and may be rightfullty turned into a verifiable stub. Please find a rule in wikipedia which says that translation of code is not original research. By the way, you don't provide reference where it is translated from. Last Lost (talk) 23:41, 25 October 2010 (UTC)Reply
The reference for the code is clearly given in reference #4. Please read the article carefully and the references before making deletions.
You are wrong about the translation policy; WP:NONENG clearly states: "Because this is the English Wikipedia, English-language sources are preferred over non-English ones, unless no English sources of equal quality and relevance are available."
When I cited others' implementation of the code, I did it as evidence of relevance, not as a reference; keep in mind that Service to the Public is also a Key Wikimedia Value. Please read my comments carefully.
I am unable to find a rule about code translation; please provide a link to the rule if it exists, or do not use that argument.
I will wait for your reply, or 3 days, whichever comes first, before reverting.Gabiteodoru (talk) 03:39, 26 October 2010 (UTC)Reply

Code

edit

Unsourced text is disallowed in wikipedia. There is no way to prove that the code is correct. And it is not a job of wikipedia to do that. Proofs of correctness of any wikipedia statements are only through references. Please don't restore the deleted text without providing references. Last Lost (talk) 23:36, 25 October 2010 (UTC)Reply

  • I used this code myself to learn about the algorithm. I completely fail to see the gain by removing it on principle, and consider its removal a net loss for Wikipedia and a degradation of the end-user experience. So maybe you're winning on principle here, but everyone else is losing. Why not put a "citation needed" in there instead of axing it? Beej71 (talk) 02:10, 27 October 2010 (UTC)Reply
I do have citations for both pieces of code. So I'm still trying to figure out why they were removed :) Hopefully will get it back up soon. You can check the history in the mean time (the one last edited by me) Gabiteodoru (talk) 02:35, 27 October 2010 (UTC)Reply
Sorry. I didn't notice the citations, since they were well above the code itself. Last Lost (talk) 22:45, 3 November 2010 (UTC)Reply

I'm in the process of implementing a Java version of the C# code provided, and I have to say it's awful. No comments to explain the algorithm, and useless tiny variable names (sd for the SortedDictionary - I already know it's a SortedDictionary, tell me what's in it! similarly DB, i1, and j1 are just as useless). By reference to the base Levenshtein pseudo-code one can figure out most of it, but the page would be greatly improved by having a better implementation or better pseudo-code (note that the current pseudo-code is labeled as being for OSA, not the full Damerau-Levenshtein). — Preceding unsigned comment added by 193.34.186.98 (talk) 13:41, 29 July 2011 (UTC)Reply

I know what you mean. I'm currently trying to implement it in C#, i.e. not even translating it to some other language. The problem is, I don't want to just copy the code, but actually understand it. I'll report back after figuring out what the various parts do and what the variables represent. --Wormbo 18:41, 6 January 2012 (UTC)Reply

I think all the code should be deleted. Way too long for an encyclopedia. I believe there are other Wikimedia projects where it would be a better fit. 5.12.181.134 (talk)

Pseudo-code of function damerauLevenshteinDistance

edit

Who can help me with this specific piece of (pseudo-)code?

DA[a[i-1]] = i;

What I understand:

  • a is an array of characters.
  • a[i-1] is the character at position i-1.
  • that is, DA is indexed by characters? But a few lines above, DA is initialized with integer indices:

for(var d:uint = 0; d<C; ++d) DA[d]=0; — Preceding unsigned comment added by 188.154.155.131 (talk) 10:04, 4 June 2011 (UTC)Reply

uh it's a really confusing algorithm; characters are represented by integers; read the paper if you want to understand itGabi Teodoru 22:11, 23 June 2011 (UTC) — Preceding unsigned comment added by Gabiteodoru (talkcontribs)


Definition vs Pseudo Code (Conflict)

edit

The definition has clearly specified that Damerau Levenshtein Distance allows transpose, whereas the pseudocode in the next section presents more restricted version of the same. The definition and the pseudo code both should point to same thing. This is conflicting. — Preceding unsigned comment added by 2620:10D:C082:1054:2ACF:E9FF:FE1E:7E97 (talk) 21:35, 7 May 2015 (UTC)Reply

Missing ref?

edit

What source is "{{ref|itman}}" at the bottom of the "Algorithm" section referring to? That note doesn't exist in the "References" section. - dcljr (talk) 23:59, 8 February 2012 (UTC)Reply

Yes, the note exists at the very end of "External links" as 'Site devoted to fuzzy searching and information retrieval'. When browsing the article's source, look for {{note|itman}}. --CiaPan (talk) 21:06, 11 February 2012 (UTC)Reply

The terms optimal string alignment and restricted edit distance

edit

I wonder whether the terms optimal string alignment and restricted edit distance originate from scientific literature. Both terms are used in the Wikipedia article for describing the first algorithm since the first version of the article from 2005, but I cannot find older literature which uses these terms. In particular, I cannot find a hint on this terminology in the cited paper of Oommen and Loke, which uses this algorithm for handling generalized transpositions. Only some publications from recent years adopted the unverified terminology from Wikipedia.

Moreover, the concept of alignment does not use transpositions at all in my understanding. In fact, optimal string alignment redirects to Levenshtein distance, and the article on sequence alignment does not consider transpositions.

The terms optimal string alignment and restricted edit distance should be removed (and maybe replace by an original name from scientific literature), if they cannot be verified. A more general question is whether the according algorithm has some purpose where it is more suitable than the “true” Damerau–Levenshtein distance or whether it is just a wrong algorithm for the the Damerau–Levenshtein distance.

-- Chwinter (talk) 20:05, 1 May 2013 (UTC)Reply

Restricted edit sequence and restricted edit distance are used by Ukkonen in his 1985 paper (Algorithms for Approximate String Matching) where he considers generic editing operation sets and generic cost functions (Sect. 4). Probably Ukkonen was the first one who introduced these terms. Hence I am fine with using these terms in the Wikipedia article. But note that Ukkonen used the terms in a more general context. Maybe it is best to use the phrase restricted version of the Damerau–Levenshtein distance together with a reference to Ukkonen's paper.

There is only one problem which arises in Ukkonen's discussion of restricted edit distances: Although he knows the Lowrance–Wagner algorithm, he states that the restricted and the unrestricted version coincide under some natural conditions, which are satisfied in the Damerau–Levenshtein case. This statement (basically Theorem 6) is disproved by the counter-example in the Wikipedia article. However, I would prefer to cite a scientific paper which clarifies this error. Is there such a paper?

I still do not see literature references for optimal string alignment. Hence I still vote for removing this term.

-- Chwinter (talk) 13:33, 12 May 2013 (UTC)Reply

Ukkonen's paper from 1985 states that “the restricted edit distance is a natural measure of similarity” “in some applications in error correction and information retrieval” (p. 115). Hence we have at least a small citable statement on the utility of the restricted edit distance. This can be inserted into the section Applications.

-- Chwinter (talk) 19:59, 4 June 2013 (UTC)Reply

Inline references

edit

Any objection to converting the reference style to a more conventional <ref>....</ref>? That way arbitrary tags do not need to be assigned. Glrx (talk) 04:48, 5 June 2013 (UTC)Reply

Where does the name "Damerau–Levenshtein distance" originate?

edit

Having read quite a bit of string algorithms literature, I've got the idea that the name "Damerau–Levenshtein distance" is a Wikipedia invention. The idea of amending Levenshtein distance by allowing transpositions appears (without reference to Damerau, or any reference for that matter) in, e.g., Ukkonen 1983, but I've never seen the particular name used by this article in any peer-reviewed article. QVVERTYVS (hm?) 21:17, 25 November 2013 (UTC)Reply

I don't know, but Wolfram seems to have adopted the term. http://reference.wolfram.com/mathematica/ref/DamerauLevenshteinDistance.html It may not have an academic source, but it may be the common name. Glrx (talk) 01:36, 27 November 2013 (UTC)Reply

Here are some examples of the use of the term from the string algorithms literature:

  • "Previous error models have all been based on Damerau-Levenshtein distance measures (Damerau 1964; Levenshtein 1966)" Brill and Moore (2000). "An Improved Error Model for Noisy Channel Spelling Correction" (PDF). Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. pp. 286–293. doi:10.3115/1075218.1075255.
  • "We show how a dictionary can be used with the Damerau-Levenshtein string-edit distance metric to construct a case-insensitive pass-phrase system." Bard (2007). "Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric". Proceedings of the fifth Australasian symposium on ACSW frontiers. Vol. 68. pp. 117–124.
  • "This direction ever evolves from simple Damerau-Levenshtein distance (Damerau, 1964; Levenshtein, 1966) to probabilistic models..." Li; Zhu; Zhang; Zhou (2006). "Exploring distributional similarity based models for query spelling correction" (PDF). Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. pp. 1025–1032. doi:10.3115/1220175.1220304.

Accordingly, I removed the "citation needed". Gdr 11:57, 11 December 2013 (UTC)Reply

edit

Rostepher added an external link to some code he put up on github today. A related link was previously removed by Stasek Lem.[2]

I have just spent a lot of time reviewing what happened and looking for information that this article once had.

Rostepher's code runs up against WP:COI, WP:OR and WP:SYN. I'm willing to let that slide on code implementations (WP:CALC), but the linked code is broken and does not show an understanding of what is going on. See, for example, the lines

// still deciphering the alogrithm from the cryptic vairable names...
unsigned db; // no idea what this is

The purpose is the same as in the OSA version. A value for db is calculated but not used (which should trigger a compiler warning). The code itself is not strictly C. The code claims origins from some stackoverflow.com postings, so there may be copyright issues. I see lots of trouble

Consequently, I removed the link to Rostepher's code.

Also troubling is that DL transposition code on the web often references an ActionScript implementation in this WP article. That implementation that was removed from the article in November 2013 as original research.[3] I have restored the ActionScript implementation but not the Csharp transliteration. The code appears to be developed from the original paper. I haven't gone further back in the history to find the original author.

I'm tempted to blow away most if not all of the ELs because the current article already has code implementations and would not contain implementations in more than one language (WP is not a code repository). I'm not going to take the time to go through them now.

I'm also concerned about a WP:COPYLINK violation with the URL for Levenshtein, reference 5. The URL appears to be a home directory rather than Doklady or Springer or author posting.

Glrx (talk) 05:30, 1 July 2014 (UTC) (modified Glrx (talk) 14:10, 1 July 2014 (UTC))Reply

I just removed all the ELs. People looking for implementations should just search for them. WP cannot vouch for the quality of such links. We also have to rewrite the ActionScript code into pseudocode, or just remove it altogether. QVVERTYVS (hm?) 08:16, 1 July 2014 (UTC)Reply
I made the distance matrix allocation implicit rather than explicit. The allocation was implicit in the OSA algorithm.
I reordered some lines (initialize DA first), made some single line for loops multiline, and added some comments.
I'm leaving it there for now, but I plan some other changes later:
make distance matrix use H[-1..a.length][-1..b.length] (match paper)
only calculate transpose distance if i1 != 0 and j1 != 0 (parallel the OSA algorithm)
Glrx (talk) 05:13, 4 July 2014 (UTC)Reply

OSA psuedocode superfluous transpositions

edit

The OSA pseudocode has one issue in that it will carry out superfluous transpositions when the source and destination contain two identical characters next to each other such as "Kitten" and "Sitting" which the algorithm as written will return a distance of 4, when the actual distance is 3. Amending the code line

  if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then

to

  if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j] and str1[i-1] <> str1[i]) then

prevents such superfluous transpositions at the cost of an additional comparison. — Preceding unsigned comment added by PaulBBSRC (talkcontribs) 23:59, 3 November 2014 (UTC)Reply

Algortihm computes the cost of the transposition but then takes the min. The change is not needed. Glrx (talk) 16:58, 8 November 2014 (UTC)Reply

A real distance called OSA

edit

I would just want to stress that there exists in the literature a real distance for sequences of symbols called OSA (for Optimal Symbol Alignment). It is a true metric (satisfying the triangular inequality), in contrast to the OSA measure that is described in this Wikipedia entry. The reference to this paper is:

J. Herranz, J. Nin and M. Solé. Optimal Symbol Alignment distance: a new distance for sequences of symbols. IEEE Transactions on Knowledge and Data Engineering, Volume 23, Issue 10, pp. 1541-1554, 2011.
( see: http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2010.190 )

Maybe it would be a good idea to add this reference and some related sentence to this Wikipedia entry. Who can do this, and how ?147.83.104.30 (talk) 13:34, 24 March 2015 (UTC)Reply

Is this reference about Damerau–Levenshtein distance, or some other distance? Because if it's not about the same distance, then it's not clear to me why it should be mentioned. And as a somewhat obscure topic (a primary source only cited 11 times according to Google scholar) it's also not clear that it warrants a separate article. —David Eppstein (talk) 05:41, 25 March 2015 (UTC)Reply

External link doesn't work for me (HTTP 404). These work better:

--CiaPan (talk) 08:11, 25 March 2015 (UTC)Reply

Just use {{cite doi}}: Herranz, J.; Nin, J.; Sole, M. (2011). "Optimal Symbol Alignment Distance: A New Distance for Sequences of Symbols". IEEE Transactions on Knowledge and Data Engineering. 23 (10): 1541. doi:10.1109/TKDE.2010.190.. QVVERTYVS (hm?) 14:45, 25 March 2015 (UTC)Reply

i and j are not defined in the article text

edit

The two variables   and   are not explained in the article text. From the source code I conclude that they are the string length, but this should be already explained under Definition. --87.139.233.163 (talk) 07:00, 30 November 2015 (UTC)Reply

It's right there: "The Damerau–Levenshtein distance between two strings   and   is given by  ". Then d is defined for arbitrary i, j. QVVERTYVS (hm?) 09:36, 30 November 2015 (UTC)Reply
Hope this edit [4] clarifies the function (and it parameters) meaning. --CiaPan (talk) 10:00, 30 November 2015 (UTC)Reply

Code for distance with adjacent transpositions

edit

I have implemented the algorithm and found that line 7 should be da[i] := 0, not da[i] := i to give plausible results. As I have not proven the correctness otherwise, I'm requesting a verification here. Moreover I would add the comment transposition to the fourth parameter of minimum(). MasterFaS (talk) 01:54, 4 February 2016 (UTC)Reply

I added your changes. Glrx (talk) 21:11, 6 February 2016 (UTC)Reply
edit

Hello fellow Wikipedians,

I have just modified 2 external links on Damerau–Levenshtein distance. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 13:27, 5 December 2016 (UTC)Reply

Wondering if this is correct

edit

Under the Algorithm section, it currently says:

"The Damerau–Levenshtein distance LD(CA,ABC) = 2 because CA → AC → ABC"

But I'm not sure how this 2nd step can actually be done using this distance: AC -> ABC?

Hungh3 (talk) 02:25, 27 April 2021 (UTC)Reply

Insert a B. —David Eppstein (talk) 04:45, 27 April 2021 (UTC)Reply

OSA Bug?

edit

For transpositions, the OSA implementation adds 1 to the transposition case in both code snippets but the definition above suggests that this should actually be the variable cost (which will be 1 or 0, depending on whether the current symbols match). Is this a necessary change? 85.186.15.66 (talk) 01:48, 27 September 2024 (UTC)Reply

Swapping equal symbols is a void operation which doesn't actually change the sequence. Similarly, 'replacing' a symbol with itself. Both operations should not be counted because skipping them does not affect the result, and the distance is defined as the minimum number of operations necessary to make a required transformation. Void operations are not necessary, hence they should not count. --CiaPan (talk) 15:20, 27 September 2024 (UTC)Reply
Well, in that case the code is indeed incorrect. You will notice that there is no check anywhere for a[i] = b[j].
if i > 1 and j > 1 and a[i] = b[j-1] and a[i-1] = b[j] then
   d[i, j] := minimum(d[i, j],
                      d[i-2, j-2] + 1)  // transposition
We really should be doing d[i, j] = minimum(d[i, j], d[i - 2, j - 2] + cost). The definition in the article also agrees with this. 85.186.15.66 (talk) 17:51, 28 September 2024 (UTC)Reply
I wish some people replied to this so I know whether the change should be made or not. 85.186.15.66 (talk) 01:04, 20 October 2024 (UTC)Reply