23051 – web page searching should use ICU's search so it can ignore diacritical differences

RESOLVED FIXED 23051

web page searching should use ICU's search so it can ignore diacritical differences

https://bugs.webkit.org/show_bug.cgi?id=23051

Summary web page searching should use ICU's search so it can ignore diacritical diffe...

Darin Adler

Reported 2008-12-30 23:37:30 PST

Right now if you search for "test" you won't find "tëst", but that should work. A good way to fix that is to use ICU's usearch.

Attachments
patch (70.35 KB, patch) 2008-12-31 00:20 PST, Darin Adler	mitz: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Darin Adler

Comment 1 2008-12-31 00:20:00 PST

<rdar://problem/3574497>

Darin Adler

Comment 2 2008-12-31 00:20:55 PST

Created attachment 26332 [details] patch

mitz

Comment 3 2008-12-31 16:40:04 PST

Comment on attachment 26332 [details] patch > +#if HAVE(ICU_SEARCH) > + > +private: > + String m_target; > + Vector<UChar> m_buffer; > + size_t m_overlap; > + bool m_atBreak; > + > +#else [...] > + > + bool m_atBreak; > + > +#endif Maybe m_atBreak should be common. > + Node* startContainer = r->startContainer(); > + int startOffset = r->startOffset(); > + Node* endContainer = r->endContainer(); > + int endOffset = r->endOffset(); > + if (!startContainer) > return; You can move the return right after calling startContainer(). > + Node* startNode = r->startContainer(); > + Node* endNode = r->endContainer(); > + int startOffset = r->startOffset(); > + int endOffset = r->endOffset(); > + if (!startNode) > return; Here too. I guess it makes little difference in practice. Everything else looks good, but I can't r+ this patch because I don't understand the logic here: > + // Matches that are in the overlap area are only tentative. > + // The same match may appear later, matching more characters, > + // possibly including a combining character that's not yet in the buffer. > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > + m_buffer.shrink(m_overlap); > + return 0; > + }

Darin Adler

Comment 4 2009-01-01 10:46:36 PST

(In reply to comment #3) > > +#if HAVE(ICU_SEARCH) > > + > > +private: > > + String m_target; > > + Vector<UChar> m_buffer; > > + size_t m_overlap; > > + bool m_atBreak; > > + > > +#else > [...] > > + > > + bool m_atBreak; > > + > > +#endif > > Maybe m_atBreak should be common. I'm not sure it should be. I see it as a "coincidence" that they are currently the same, and it could change easily with future revision to either side's algorithm. > > + Node* startContainer = r->startContainer(); > > + int startOffset = r->startOffset(); > > + Node* endContainer = r->endContainer(); > > + int endOffset = r->endOffset(); > > + if (!startContainer) > > return; > > You can move the return right after calling startContainer(). I'll do it. > Everything else looks good, but I can't r+ this patch because I don't > understand the logic here: > > > + // Matches that are in the overlap area are only tentative. > > + // The same match may appear later, matching more characters, > > + // possibly including a combining character that's not yet in the buffer. > > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > > + m_buffer.shrink(m_overlap); > > + return 0; > > + } I believe the idea is that if you have the following text: abcde<combing-umlaut> Then you can match abcde, but you can also match abcde<combining-umlaut>. And if the very last characters in the buffer are "abcde", then you should not accept the match until you know if the subsequent character is <combining-umlaut>. So a match where the match touches the end of the buffer is only tentative if you might be receiving additional characters in the future, and thus needs to be ignored. To detect that you're in such a situation, I suppose we could detect that the match is at the very end of the buffer. My check for "anywhere in the overlap area" is massive overkill. But "right at the end of the buffer" might also be to narrow. Could character show up even later that could make a longer match if the match wasn't at the end of the buffer? I don't know.

mitz

Comment 5 2009-01-01 12:49:16 PST

Comment on attachment 26332 [details] patch (In reply to comment #4) > > Everything else looks good, but I can't r+ this patch because I don't > > understand the logic here: > > > > > + // Matches that are in the overlap area are only tentative. > > > + // The same match may appear later, matching more characters, > > > + // possibly including a combining character that's not yet in the buffer. > > > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > > > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > > > + m_buffer.shrink(m_overlap); > > > + return 0; > > > + } > > I believe the idea is that if you have the following text: > > abcde<combing-umlaut> > > Then you can match abcde, but you can also match abcde<combining-umlaut>. And > if the very last characters in the buffer are "abcde", then you should not > accept the match until you know if the subsequent character is > <combining-umlaut>. So a match where the match touches the end of the buffer is > only tentative if you might be receiving additional characters in the future, > and thus needs to be ignored. It looks like the code checks if the match ends in the overlap area, but then it assumes that in that case, it also starts in the overlap area. I think that's what confused me, but it makes sense because the overlap area is guaranteed to be fairly large.

Darin Adler

Comment 6 2009-01-01 13:19:28 PST

(In reply to comment #5) > It looks like the code checks if the match ends in the overlap area, but then > it assumes that in that case, it also starts in the overlap area. Oops! That was a mistake. I changed it to only check if the match starts in the overlap area as I originally intended.

Darin Adler

Comment 7 2009-01-01 13:30:50 PST

http://trac.webkit.org/changeset/39536

Simon Fraser (smfr)

Comment 8 2009-01-01 19:51:48 PST

This seems to have caused 12-13 new test failures, due to whitespace changes.

Simon Fraser (smfr)

Comment 9 2009-01-01 19:54:00 PST

Fixed bug 23071 in the test failures.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Mac

OS OS X 10.5

Product WebKit

Component HTML Editing

Assignee

Darin Adler

Reported

2008-12-30 23:37 PST

Modified

2009-01-01 19:54 PST History

CC List

2 users Show

URL

Keywords InRadar

Depends on

Blocks