Bug 23051 - web page searching should use ICU's search so it can ignore diacritical differences
Summary: web page searching should use ICU's search so it can ignore diacritical diffe...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: HTML Editing (show other bugs)
Version: 528+ (Nightly build)
Hardware: Mac OS X 10.5
: P2 Normal
Assignee: Darin Adler
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2008-12-30 23:37 PST by Darin Adler
Modified: 2009-01-01 19:54 PST (History)
2 users (show)

See Also:


Attachments
patch (70.35 KB, patch)
2008-12-31 00:20 PST, Darin Adler
mitz: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Darin Adler 2008-12-30 23:37:30 PST
Right now if you search for "test" you won't find "tëst", but that should work. A good way to fix that is to use ICU's usearch.
Comment 1 Darin Adler 2008-12-31 00:20:00 PST
<rdar://problem/3574497>
Comment 2 Darin Adler 2008-12-31 00:20:55 PST
Created attachment 26332 [details]
patch
Comment 3 mitz 2008-12-31 16:40:04 PST
Comment on attachment 26332 [details]
patch

> +#if HAVE(ICU_SEARCH)
> +
> +private:
> +    String m_target;
> +    Vector<UChar> m_buffer;
> +    size_t m_overlap;
> +    bool m_atBreak;
> +
> +#else
[...]
> +
> +    bool m_atBreak;
> +
> +#endif

Maybe m_atBreak should be common.

> +    Node* startContainer = r->startContainer();
> +    int startOffset = r->startOffset();
> +    Node* endContainer = r->endContainer();
> +    int endOffset = r->endOffset();
> +    if (!startContainer)
>          return;

You can move the return right after calling startContainer().

> +    Node* startNode = r->startContainer();
> +    Node* endNode = r->endContainer();
> +    int startOffset = r->startOffset();
> +    int endOffset = r->endOffset();
> +    if (!startNode)
>          return;

Here too. I guess it makes little difference in practice.

Everything else looks good, but I can't r+ this patch because I don't understand the logic here:

> +    // Matches that are in the overlap area are only tentative.
> +    // The same match may appear later, matching more characters,
> +    // possibly including a combining character that's not yet in the buffer.
> +    if (!m_atBreak && matchStart + matchLength + m_overlap >= size) {
> +        memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap);
> +        m_buffer.shrink(m_overlap);
> +        return 0;
> +    }
Comment 4 Darin Adler 2009-01-01 10:46:36 PST
(In reply to comment #3)
> > +#if HAVE(ICU_SEARCH)
> > +
> > +private:
> > +    String m_target;
> > +    Vector<UChar> m_buffer;
> > +    size_t m_overlap;
> > +    bool m_atBreak;
> > +
> > +#else
> [...]
> > +
> > +    bool m_atBreak;
> > +
> > +#endif
> 
> Maybe m_atBreak should be common.

I'm not sure it should be. I see it as a "coincidence" that they are currently the same, and it could change easily with future revision to either side's algorithm.

> > +    Node* startContainer = r->startContainer();
> > +    int startOffset = r->startOffset();
> > +    Node* endContainer = r->endContainer();
> > +    int endOffset = r->endOffset();
> > +    if (!startContainer)
> >          return;
> 
> You can move the return right after calling startContainer().

I'll do it.

> Everything else looks good, but I can't r+ this patch because I don't
> understand the logic here:
> 
> > +    // Matches that are in the overlap area are only tentative.
> > +    // The same match may appear later, matching more characters,
> > +    // possibly including a combining character that's not yet in the buffer.
> > +    if (!m_atBreak && matchStart + matchLength + m_overlap >= size) {
> > +        memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap);
> > +        m_buffer.shrink(m_overlap);
> > +        return 0;
> > +    }

I believe the idea is that if you have the following text:

    abcde<combing-umlaut>

Then you can match abcde, but you can also match abcde<combining-umlaut>. And if the very last characters in the buffer are "abcde", then you should not accept the match until you know if the subsequent character is <combining-umlaut>. So a match where the match touches the end of the buffer is only tentative if you might be receiving additional characters in the future, and thus needs to be ignored. To detect that you're in such a situation, I suppose we could detect that the match is at the very end of the buffer. My check for "anywhere in the overlap area" is massive overkill. But "right at the end of the buffer" might also be to narrow. Could character show up even later that could make a longer match if the match wasn't at the end of the buffer? I don't know.
Comment 5 mitz 2009-01-01 12:49:16 PST
Comment on attachment 26332 [details]
patch

(In reply to comment #4)
> > Everything else looks good, but I can't r+ this patch because I don't
> > understand the logic here:
> > 
> > > +    // Matches that are in the overlap area are only tentative.
> > > +    // The same match may appear later, matching more characters,
> > > +    // possibly including a combining character that's not yet in the buffer.
> > > +    if (!m_atBreak && matchStart + matchLength + m_overlap >= size) {
> > > +        memcpy(m_buffer.data(), m_buffer.data() + m_buffer.size() - m_overlap, m_overlap);
> > > +        m_buffer.shrink(m_overlap);
> > > +        return 0;
> > > +    }
> 
> I believe the idea is that if you have the following text:
> 
>     abcde<combing-umlaut>
> 
> Then you can match abcde, but you can also match abcde<combining-umlaut>. And
> if the very last characters in the buffer are "abcde", then you should not
> accept the match until you know if the subsequent character is
> <combining-umlaut>. So a match where the match touches the end of the buffer is
> only tentative if you might be receiving additional characters in the future,
> and thus needs to be ignored.

It looks like the code checks if the match ends in the overlap area, but then it assumes that in that case, it also starts in the overlap area. I think that's what confused me, but it makes sense because the overlap area is guaranteed to be fairly large.
Comment 6 Darin Adler 2009-01-01 13:19:28 PST
(In reply to comment #5)
> It looks like the code checks if the match ends in the overlap area, but then
> it assumes that in that case, it also starts in the overlap area.

Oops! That was a mistake. I changed it to only check if the match starts in the overlap area as I originally intended.
Comment 7 Darin Adler 2009-01-01 13:30:50 PST
http://trac.webkit.org/changeset/39536
Comment 8 Simon Fraser (smfr) 2009-01-01 19:51:48 PST
This seems to have caused 12-13 new test failures, due to whitespace changes.
Comment 9 Simon Fraser (smfr) 2009-01-01 19:54:00 PST
Fixed bug 23071 in the test failures.