WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
23051
web page searching should use ICU's search so it can ignore diacritical differences
https://bugs.webkit.org/show_bug.cgi?id=23051
Summary
web page searching should use ICU's search so it can ignore diacritical diffe...
Darin Adler
Reported
2008-12-30 23:37:30 PST
Right now if you search for "test" you won't find "tëst", but that should work. A good way to fix that is to use ICU's usearch.
Attachments
patch
(70.35 KB, patch)
2008-12-31 00:20 PST
,
Darin Adler
mitz: review+
Details
Formatted Diff
Diff
View All
Add attachment
proposed patch, testcase, etc.
Darin Adler
Comment 1
2008-12-31 00:20:00 PST
<
rdar://problem/3574497
>
Darin Adler
Comment 2
2008-12-31 00:20:55 PST
Created
attachment 26332
[details]
patch
mitz
Comment 3
2008-12-31 16:40:04 PST
Comment on
attachment 26332
[details]
patch
> +#if HAVE(ICU_SEARCH) > + > +private: > + String m_target; > + Vector<UChar> m_buffer; > + size_t m_overlap; > + bool m_atBreak; > + > +#else
[...]
> + > + bool m_atBreak; > + > +#endif
Maybe m_atBreak should be common.
> + Node* startContainer = r->startContainer(); > + int startOffset = r->startOffset(); > + Node* endContainer = r->endContainer(); > + int endOffset = r->endOffset(); > + if (!startContainer) > return;
You can move the return right after calling startContainer().
> + Node* startNode = r->startContainer(); > + Node* endNode = r->endContainer(); > + int startOffset = r->startOffset(); > + int endOffset = r->endOffset(); > + if (!startNode) > return;
Here too. I guess it makes little difference in practice. Everything else looks good, but I can't r+ this patch because I don't understand the logic here:
> + // Matches that are in the overlap area are only tentative. > + // The same match may appear later, matching more characters, > + // possibly including a combining character that's not yet in the buffer. > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > + m_buffer.shrink(m_overlap); > + return 0; > + }
Darin Adler
Comment 4
2009-01-01 10:46:36 PST
(In reply to
comment #3
)
> > +#if HAVE(ICU_SEARCH) > > + > > +private: > > + String m_target; > > + Vector<UChar> m_buffer; > > + size_t m_overlap; > > + bool m_atBreak; > > + > > +#else > [...] > > + > > + bool m_atBreak; > > + > > +#endif > > Maybe m_atBreak should be common.
I'm not sure it should be. I see it as a "coincidence" that they are currently the same, and it could change easily with future revision to either side's algorithm.
> > + Node* startContainer = r->startContainer(); > > + int startOffset = r->startOffset(); > > + Node* endContainer = r->endContainer(); > > + int endOffset = r->endOffset(); > > + if (!startContainer) > > return; > > You can move the return right after calling startContainer().
I'll do it.
> Everything else looks good, but I can't r+ this patch because I don't > understand the logic here: > > > + // Matches that are in the overlap area are only tentative. > > + // The same match may appear later, matching more characters, > > + // possibly including a combining character that's not yet in the buffer. > > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > > + m_buffer.shrink(m_overlap); > > + return 0; > > + }
I believe the idea is that if you have the following text: abcde<combing-umlaut> Then you can match abcde, but you can also match abcde<combining-umlaut>. And if the very last characters in the buffer are "abcde", then you should not accept the match until you know if the subsequent character is <combining-umlaut>. So a match where the match touches the end of the buffer is only tentative if you might be receiving additional characters in the future, and thus needs to be ignored. To detect that you're in such a situation, I suppose we could detect that the match is at the very end of the buffer. My check for "anywhere in the overlap area" is massive overkill. But "right at the end of the buffer" might also be to narrow. Could character show up even later that could make a longer match if the match wasn't at the end of the buffer? I don't know.
mitz
Comment 5
2009-01-01 12:49:16 PST
Comment on
attachment 26332
[details]
patch (In reply to
comment #4
)
> > Everything else looks good, but I can't r+ this patch because I don't > > understand the logic here: > > > > > + // Matches that are in the overlap area are only tentative. > > > + // The same match may appear later, matching more characters, > > > + // possibly including a combining character that's not yet in the buffer. > > > + if (!m_atBreak && matchStart + matchLength + m_overlap >= size) { > > > + memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap); > > > + m_buffer.shrink(m_overlap); > > > + return 0; > > > + } > > I believe the idea is that if you have the following text: > > abcde<combing-umlaut> > > Then you can match abcde, but you can also match abcde<combining-umlaut>. And > if the very last characters in the buffer are "abcde", then you should not > accept the match until you know if the subsequent character is > <combining-umlaut>. So a match where the match touches the end of the buffer is > only tentative if you might be receiving additional characters in the future, > and thus needs to be ignored.
It looks like the code checks if the match ends in the overlap area, but then it assumes that in that case, it also starts in the overlap area. I think that's what confused me, but it makes sense because the overlap area is guaranteed to be fairly large.
Darin Adler
Comment 6
2009-01-01 13:19:28 PST
(In reply to
comment #5
)
> It looks like the code checks if the match ends in the overlap area, but then > it assumes that in that case, it also starts in the overlap area.
Oops! That was a mistake. I changed it to only check if the match starts in the overlap area as I originally intended.
Darin Adler
Comment 7
2009-01-01 13:30:50 PST
http://trac.webkit.org/changeset/39536
Simon Fraser (smfr)
Comment 8
2009-01-01 19:51:48 PST
This seems to have caused 12-13 new test failures, due to whitespace changes.
Simon Fraser (smfr)
Comment 9
2009-01-01 19:54:00 PST
Fixed
bug 23071
in the test failures.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug