RESOLVED FIXED 23051
web page searching should use ICU's search so it can ignore diacritical differences
https://bugs.webkit.org/show_bug.cgi?id=23051
Summary web page searching should use ICU's search so it can ignore diacritical diffe...
Darin Adler
Reported 2008-12-30 23:37:30 PST
Right now if you search for "test" you won't find "tëst", but that should work. A good way to fix that is to use ICU's usearch.
Attachments
patch (70.35 KB, patch)
2008-12-31 00:20 PST, Darin Adler
mitz: review+
Darin Adler
Comment 1 2008-12-31 00:20:00 PST
Darin Adler
Comment 2 2008-12-31 00:20:55 PST
mitz
Comment 3 2008-12-31 16:40:04 PST
Comment on attachment 26332 [details] patch > +#if HAVE(ICU_SEARCH) > + > +private: > + String m_target; > + Vector<UChar> m_buffer; > + size_t m_overlap; > + bool m_atBreak; > + > +#else [...] > + > + bool m_atBreak; > + > +#endif Maybe m_atBreak should be common. > + Node* startContainer = r->startContainer(); > + int startOffset = r->startOffset(); > + Node* endContainer = r->endContainer(); > + int endOffset = r->endOffset(); > + if (!startContainer) > return; You can move the return right after calling startContainer(). > + Node* startNode = r->startContainer(); > + Node* endNode = r->endContainer(); > + int startOffset = r->startOffset(); > + int endOffset = r->endOffset(); > + if (!startNode) > return; Here too. I guess it makes little difference in practice. Everything else looks good, but I can't r+ this patch because I don't understand the logic here: > + // Matches that are in the overlap area are only tentative. > + // The same match may appear later, matching more characters, > + // possibly including a combining character that's not yet in the buffer. > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > + m_buffer.shrink(m_overlap); > + return 0; > + }
Darin Adler
Comment 4 2009-01-01 10:46:36 PST
(In reply to comment #3) > > +#if HAVE(ICU_SEARCH) > > + > > +private: > > + String m_target; > > + Vector<UChar> m_buffer; > > + size_t m_overlap; > > + bool m_atBreak; > > + > > +#else > [...] > > + > > + bool m_atBreak; > > + > > +#endif > > Maybe m_atBreak should be common. I'm not sure it should be. I see it as a "coincidence" that they are currently the same, and it could change easily with future revision to either side's algorithm. > > + Node* startContainer = r->startContainer(); > > + int startOffset = r->startOffset(); > > + Node* endContainer = r->endContainer(); > > + int endOffset = r->endOffset(); > > + if (!startContainer) > > return; > > You can move the return right after calling startContainer(). I'll do it. > Everything else looks good, but I can't r+ this patch because I don't > understand the logic here: > > > + // Matches that are in the overlap area are only tentative. > > + // The same match may appear later, matching more characters, > > + // possibly including a combining character that's not yet in the buffer. > > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > > + m_buffer.shrink(m_overlap); > > + return 0; > > + } I believe the idea is that if you have the following text: abcde<combing-umlaut> Then you can match abcde, but you can also match abcde<combining-umlaut>. And if the very last characters in the buffer are "abcde", then you should not accept the match until you know if the subsequent character is <combining-umlaut>. So a match where the match touches the end of the buffer is only tentative if you might be receiving additional characters in the future, and thus needs to be ignored. To detect that you're in such a situation, I suppose we could detect that the match is at the very end of the buffer. My check for "anywhere in the overlap area" is massive overkill. But "right at the end of the buffer" might also be to narrow. Could character show up even later that could make a longer match if the match wasn't at the end of the buffer? I don't know.
mitz
Comment 5 2009-01-01 12:49:16 PST
Comment on attachment 26332 [details] patch (In reply to comment #4) > > Everything else looks good, but I can't r+ this patch because I don't > > understand the logic here: > > > > > + // Matches that are in the overlap area are only tentative. > > > + // The same match may appear later, matching more characters, > > > + // possibly including a combining character that's not yet in the buffer. > > > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > > > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > > > + m_buffer.shrink(m_overlap); > > > + return 0; > > > + } > > I believe the idea is that if you have the following text: > > abcde<combing-umlaut> > > Then you can match abcde, but you can also match abcde<combining-umlaut>. And > if the very last characters in the buffer are "abcde", then you should not > accept the match until you know if the subsequent character is > <combining-umlaut>. So a match where the match touches the end of the buffer is > only tentative if you might be receiving additional characters in the future, > and thus needs to be ignored. It looks like the code checks if the match ends in the overlap area, but then it assumes that in that case, it also starts in the overlap area. I think that's what confused me, but it makes sense because the overlap area is guaranteed to be fairly large.
Darin Adler
Comment 6 2009-01-01 13:19:28 PST
(In reply to comment #5) > It looks like the code checks if the match ends in the overlap area, but then > it assumes that in that case, it also starts in the overlap area. Oops! That was a mistake. I changed it to only check if the match starts in the overlap area as I originally intended.
Darin Adler
Comment 7 2009-01-01 13:30:50 PST
Simon Fraser (smfr)
Comment 8 2009-01-01 19:51:48 PST
This seems to have caused 12-13 new test failures, due to whitespace changes.
Simon Fraser (smfr)
Comment 9 2009-01-01 19:54:00 PST
Fixed bug 23071 in the test failures.
Note You need to log in before you can comment on or make changes to this bug.