Right now if you search for "test" you won't find "tëst", but that should work. A good way to fix that is to use ICU's usearch.
<rdar://problem/3574497>
Created attachment 26332 [details] patch
Comment on attachment 26332 [details] patch > +#if HAVE(ICU_SEARCH) > + > +private: > + String m_target; > + Vector<UChar> m_buffer; > + size_t m_overlap; > + bool m_atBreak; > + > +#else [...] > + > + bool m_atBreak; > + > +#endif Maybe m_atBreak should be common. > + Node* startContainer = r->startContainer(); > + int startOffset = r->startOffset(); > + Node* endContainer = r->endContainer(); > + int endOffset = r->endOffset(); > + if (!startContainer) > return; You can move the return right after calling startContainer(). > + Node* startNode = r->startContainer(); > + Node* endNode = r->endContainer(); > + int startOffset = r->startOffset(); > + int endOffset = r->endOffset(); > + if (!startNode) > return; Here too. I guess it makes little difference in practice. Everything else looks good, but I can't r+ this patch because I don't understand the logic here: > + // Matches that are in the overlap area are only tentative. > + // The same match may appear later, matching more characters, > + // possibly including a combining character that's not yet in the buffer. > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > + m_buffer.shrink(m_overlap); > + return 0; > + }
(In reply to comment #3) > > +#if HAVE(ICU_SEARCH) > > + > > +private: > > + String m_target; > > + Vector<UChar> m_buffer; > > + size_t m_overlap; > > + bool m_atBreak; > > + > > +#else > [...] > > + > > + bool m_atBreak; > > + > > +#endif > > Maybe m_atBreak should be common. I'm not sure it should be. I see it as a "coincidence" that they are currently the same, and it could change easily with future revision to either side's algorithm. > > + Node* startContainer = r->startContainer(); > > + int startOffset = r->startOffset(); > > + Node* endContainer = r->endContainer(); > > + int endOffset = r->endOffset(); > > + if (!startContainer) > > return; > > You can move the return right after calling startContainer(). I'll do it. > Everything else looks good, but I can't r+ this patch because I don't > understand the logic here: > > > + // Matches that are in the overlap area are only tentative. > > + // The same match may appear later, matching more characters, > > + // possibly including a combining character that's not yet in the buffer. > > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > > + m_buffer.shrink(m_overlap); > > + return 0; > > + } I believe the idea is that if you have the following text: abcde<combing-umlaut> Then you can match abcde, but you can also match abcde<combining-umlaut>. And if the very last characters in the buffer are "abcde", then you should not accept the match until you know if the subsequent character is <combining-umlaut>. So a match where the match touches the end of the buffer is only tentative if you might be receiving additional characters in the future, and thus needs to be ignored. To detect that you're in such a situation, I suppose we could detect that the match is at the very end of the buffer. My check for "anywhere in the overlap area" is massive overkill. But "right at the end of the buffer" might also be to narrow. Could character show up even later that could make a longer match if the match wasn't at the end of the buffer? I don't know.
Comment on attachment 26332 [details] patch (In reply to comment #4) > > Everything else looks good, but I can't r+ this patch because I don't > > understand the logic here: > > > > > + // Matches that are in the overlap area are only tentative. > > > + // The same match may appear later, matching more characters, > > > + // possibly including a combining character that's not yet in the buffer. > > > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > > > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > > > + m_buffer.shrink(m_overlap); > > > + return 0; > > > + } > > I believe the idea is that if you have the following text: > > abcde<combing-umlaut> > > Then you can match abcde, but you can also match abcde<combining-umlaut>. And > if the very last characters in the buffer are "abcde", then you should not > accept the match until you know if the subsequent character is > <combining-umlaut>. So a match where the match touches the end of the buffer is > only tentative if you might be receiving additional characters in the future, > and thus needs to be ignored. It looks like the code checks if the match ends in the overlap area, but then it assumes that in that case, it also starts in the overlap area. I think that's what confused me, but it makes sense because the overlap area is guaranteed to be fairly large.
(In reply to comment #5) > It looks like the code checks if the match ends in the overlap area, but then > it assumes that in that case, it also starts in the overlap area. Oops! That was a mistake. I changed it to only check if the match starts in the overlap area as I originally intended.
http://trac.webkit.org/changeset/39536
This seems to have caused 12-13 new test failures, due to whitespace changes.
Fixed bug 23071 in the test failures.