Bug 50245

Summary: Add word-prefix search options to the text search without ICU
Product: WebKit Reporter: Benjamin Poulain <benjamin>
Component: TextAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: cshu, max.hong.shen, tonikitoo
Priority: P2 Keywords: Qt, QtTriaged
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on: 50038, 60786    
Bug Blocks:    
Attachments:
Description Flags
word-prefix search for no-ICU code path (v1.0) darin: review-

Description Benjamin Poulain 2010-11-30 10:24:25 PST
The patch r72887 (https://bugs.webkit.org/show_bug.cgi?id=50038 ) adds options to Editor::findString() but does not implement the changes to the non-ICU version of SearchBuffer.

The test editing/text-iterator/findString.html fails on any case using the new options.
Comment 1 Yi Shen 2011-05-26 10:39:05 PDT
Created attachment 95002 [details]
word-prefix search for no-ICU code path (v1.0)

For Qt, this patch makes the LayoutTests/editing/text-iterator/findString.html pass all the tests except the one for the Thai words.

 var thaiWords = [
        "\u0e01\u0e23",
        "\u0e1b\u0e39\u0e40\u0e25",
        "\u0e01\u0e0a",
        "\u0e01\u0e0a\u0e01\u0e23", // thaiWords[2] + thaiWords[0]
        "\u0e01\u0e23\u0e01\u0e0a", // thaiWords[0] + thaiWords[2]
        "\u0e1a\u0e07\u0e01\u0e0a", // ends with thaiWords[2]
    ];

    testFindString(thaiWords.join(""), thaiWords[0], ["AtWordStarts"], [[0, 2], [12, 14], []]); // Failed here

After traced down the problem in SearchBuffer::isWordStartMatch() function, I found the issue is that findNextWordFromIndex() doesn't return a correct word position for Thai words, which is implemented in TextBoundariesQt.cpp by using QTextBoundaryFinder. In other word, this is a Qt issue and I will fire a Qt bug for it.
Comment 2 Chang Shu 2011-05-26 11:12:21 PDT
Comment on attachment 95002 [details]
word-prefix search for no-ICU code path (v1.0)

View in context: https://bugs.webkit.org/attachment.cgi?id=95002&action=review

I understand you didn't unskip findString.html because of the Thai char issue. But it will be good to show some tests improvement after your fix. Maybe we can split findString.html and put Thai tests in a separate test?
Overall, it's a good patch. Thanks.

> Source/WebCore/editing/TextIterator.cpp:106
> +    bool isWordStartMatch(size_t, size_t length) const;

What do you think we move all the common apis to the common place?

> Source/WebCore/editing/TextIterator.cpp:2189
> +

same thing here. We can move identical code to the common place.

> Source/WebCore/editing/TextIterator.cpp:2236
> +            memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar));

memcpy to overlapped place may cause unexpected results?
Comment 3 Chang Shu 2011-05-26 11:24:23 PDT
> > Source/WebCore/editing/TextIterator.cpp:2236
> > +            memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar));
> 
> memcpy to overlapped place may cause unexpected results?

I think memmove is a better way to do it.
Comment 4 Darin Adler 2011-06-17 11:01:27 PDT
Comment on attachment 95002 [details]
word-prefix search for no-ICU code path (v1.0)

View in context: https://bugs.webkit.org/attachment.cgi?id=95002&action=review

Need to at least fix the memcpy problem. But I think that also using a circular buffer instead of a memmove each time is the way to go. It may be tricky to get a circular buffer to work with functions such as U16_GET, though, so not sure it will be easy.

>> Source/WebCore/editing/TextIterator.cpp:2189
>> +
> 
> same thing here. We can move identical code to the common place.

We don’t want to copy and paste this code, yes, so we need a version of the patch that does not involve a second copy.

>> Source/WebCore/editing/TextIterator.cpp:2236
>> +            memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar));
> 
> memcpy to overlapped place may cause unexpected results?

That’s right. You can’t use memcpy for this; it has to be memmove instead.

And doing a memmove every time we add a character is really inefficient. I suggest considering a circular buffer instead.
Comment 5 Yi Shen 2011-06-17 13:38:08 PDT
I agree that memmove is a better way, but I think memcpy is fine in this case since the source is after the destination,right?

e.g,
str = "0123456789"
memcpy(str+2, str+1, 3);
returns: 0111356789

memcpy(str+1, str+2, 3);
returns: 0234456789

Anyway, I will change it.

(In reply to comment #3)
> > > Source/WebCore/editing/TextIterator.cpp:2236
> > > +            memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar));
> > 
> > memcpy to overlapped place may cause unexpected results?
> 
> I think memmove is a better way to do it.
Comment 6 Yi Shen 2011-06-17 13:38:49 PDT
Thanks both for the reviewing :) I will update the patch based on your comments.

(In reply to comment #4)
> (From update of attachment 95002 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=95002&action=review
> 
> Need to at least fix the memcpy problem. But I think that also using a circular buffer instead of a memmove each time is the way to go. It may be tricky to get a circular buffer to work with functions such as U16_GET, though, so not sure it will be easy.
> 
> >> Source/WebCore/editing/TextIterator.cpp:2189
> >> +
> > 
> > same thing here. We can move identical code to the common place.
> 
> We don’t want to copy and paste this code, yes, so we need a version of the patch that does not involve a second copy.
> 
> >> Source/WebCore/editing/TextIterator.cpp:2236
> >> +            memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar));
> > 
> > memcpy to overlapped place may cause unexpected results?
> 
> That’s right. You can’t use memcpy for this; it has to be memmove instead.
> 
> And doing a memmove every time we add a character is really inefficient. I suggest considering a circular buffer instead.
Comment 7 Benjamin Poulain 2011-06-17 15:31:54 PDT
(In reply to comment #5)
> I agree that memmove is a better way, but I think memcpy is fine in this case since the source is after the destination,right?
> 
> e.g,
> str = "0123456789"
> memcpy(str+2, str+1, 3);
> returns: 0111356789
> 
> memcpy(str+1, str+2, 3);
> returns: 0234456789

This is invalid, and will fail on some platform, some libc or even depending on the alignment.
The source and destination of memcpy cannot overlap (see http://linuxcertif.com/man/3/memcpy/en/ ).
Comment 8 Yi Shen 2011-06-22 07:54:29 PDT
(In reply to comment #7)
> (In reply to comment #5)
> > I agree that memmove is a better way, but I think memcpy is fine in this case since the source is after the destination,right?
> > 
> > e.g,
> > str = "0123456789"
> > memcpy(str+2, str+1, 3);
> > returns: 0111356789
> > 
> > memcpy(str+1, str+2, 3);
> > returns: 0234456789
> 
> This is invalid, and will fail on some platform, some libc or even depending on the alignment.
> The source and destination of memcpy cannot overlap (see http://linuxcertif.com/man/3/memcpy/en/ ).

Agree, will fix it :)