Talk:FM-index
This article is rated Stub-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||
|
What is an FM Index?
editI'm not convinced that this article addresses the "what" question. The last sentence of the section "FM-index data structure" has:
- The FM-index itself is a compression of the string L together with C and Occ in some form, as well as information that maps a selection of indices in L to positions in the original string T.
This really needs expansion. It seems strang to talk about the LF mapping in so much detail and then not describe how the index is structured. gringer (talk) 06:14, 16 June 2013 (UTC)
- Agreed. Of particular note - how does the compression work? Per the article's description, L is the same size as the original text, so we need to be compressing L in order for the FM-index as a whole to be smaller than the original text, but there's no indication of how that compression is done - or of how the count and locate operations access data within the compressed data structure. ExplodingCabbage (talk) 12:54, 16 March 2022 (UTC)
Poor description of "FM-index data structure"
editThe article is confusing. For example it says: "For any row i of the matrix, the character in the last column L[i] precedes the character in the first column F[i] also in T." What does it mean? It is clearly not true: If we let i be the first row of the matrix, then L[i] is the last letter in the string T, so it must come after everything else in T. Also if i represents the first row, F[i] must be the letter in T which comes first in alphabetic ordering ($ in the example). So no other letter can come before it.
Perhaps after "the matrix" we should insert "M". Perhaps before "T" we should insert "the original string". 128.16.7.220 (talk) 15:47, 7 June 2014 (UTC) Bill
- The particular sentence you quote is correct. You say that
- "If we let i be the first row of the matrix, then L[i] is the last letter in the string T"
- but this is not true. Note that
T = "abracadabra$"
, that L[i] is 'a' ans F[i] is '$'. ExplodingCabbage (talk) 13:06, 16 March 2022 (UTC)
Count list item 1.
editSurely suffix "bra" ends with a. It starts with b. 128.16.7.220 (talk) 16:05, 7 June 2014 (UTC) Bill
Count example off by one error?
editCould someone else check the example in the "Count" section. In most places in the article the indexes start at one. Assuming that, the start and end values given in the example appear to be out by one in most cases. 128.16.7.220 (talk) 16:58, 7 June 2014 (UTC) Bill
- The start and end values all look correct to me. Can you give an example of one you think is incorrect and walk through why? ExplodingCabbage (talk) 14:23, 16 March 2022 (UTC)
Locate section needs examples. Existing example needs checking
editExisting applications of FM-index use it for looking up strings, rather than counting how many times the string occurs. Hence it is more important that the "Locate" section makes sense than the "Count" section. Therefore would it be possible to add more explaination and/or examples to the "Locate" section.
The existing text says "For instance locate(7) = 8" appears to be wrong. "locate(7)" appears to mean the 7th character in L, which is "c". "c" occurs only once in "abracadabra" and that is position 5 not position 7.
128.16.7.220 (talk) 17:29, 7 June 2014 (UTC) Bill
"locate(7)" appears to mean the 7th character in L, which is "c"
- No, it's "a", not "c". The example looks correct to me. ExplodingCabbage (talk) 14:22, 16 March 2022 (UTC)
What is ε?
editThe "Locate" section uses the formulae O(p + occ logε u) and without ever defining ε.
What is ?
editThe "locate" section uses the formula without ever defining , which as far as I can spot isn't defined anywhere else in the article either.
How does Occ work? How can it POSSIBLY be computed in constant time?
editThe function Occ(c, k) is a necessary part of both the "count" and "locate" algorithms yet is skipped over with no explanation. Indeed, the article makes it sound rather magical - apparently "it is possible to compute Occ(c, k) in constant time" despite the fact that the obvious dumb approach to implementing Occ (iterate over the first k characters and count the ones lexically smaller than c) of course takes O(k) time. Nowhere are we told what algorithm is used to achieve this magic, nor what data structures within the FM-index it uses.