Summary: | Add word-prefix search options to the text search without ICU | ||||||
---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Benjamin Poulain <benjamin> | ||||
Component: | Text | Assignee: | Nobody <webkit-unassigned> | ||||
Status: | NEW --- | ||||||
Severity: | Normal | CC: | cshu, max.hong.shen, tonikitoo | ||||
Priority: | P2 | Keywords: | Qt, QtTriaged | ||||
Version: | 528+ (Nightly build) | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Bug Depends on: | 50038, 60786 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Benjamin Poulain
2010-11-30 10:24:25 PST
Created attachment 95002 [details]
word-prefix search for no-ICU code path (v1.0)
For Qt, this patch makes the LayoutTests/editing/text-iterator/findString.html pass all the tests except the one for the Thai words.
var thaiWords = [
"\u0e01\u0e23",
"\u0e1b\u0e39\u0e40\u0e25",
"\u0e01\u0e0a",
"\u0e01\u0e0a\u0e01\u0e23", // thaiWords[2] + thaiWords[0]
"\u0e01\u0e23\u0e01\u0e0a", // thaiWords[0] + thaiWords[2]
"\u0e1a\u0e07\u0e01\u0e0a", // ends with thaiWords[2]
];
testFindString(thaiWords.join(""), thaiWords[0], ["AtWordStarts"], [[0, 2], [12, 14], []]); // Failed here
After traced down the problem in SearchBuffer::isWordStartMatch() function, I found the issue is that findNextWordFromIndex() doesn't return a correct word position for Thai words, which is implemented in TextBoundariesQt.cpp by using QTextBoundaryFinder. In other word, this is a Qt issue and I will fire a Qt bug for it.
Comment on attachment 95002 [details] word-prefix search for no-ICU code path (v1.0) View in context: https://bugs.webkit.org/attachment.cgi?id=95002&action=review I understand you didn't unskip findString.html because of the Thai char issue. But it will be good to show some tests improvement after your fix. Maybe we can split findString.html and put Thai tests in a separate test? Overall, it's a good patch. Thanks. > Source/WebCore/editing/TextIterator.cpp:106 > + bool isWordStartMatch(size_t, size_t length) const; What do you think we move all the common apis to the common place? > Source/WebCore/editing/TextIterator.cpp:2189 > + same thing here. We can move identical code to the common place. > Source/WebCore/editing/TextIterator.cpp:2236 > + memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar)); memcpy to overlapped place may cause unexpected results? > > Source/WebCore/editing/TextIterator.cpp:2236
> > + memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar));
>
> memcpy to overlapped place may cause unexpected results?
I think memmove is a better way to do it.
Comment on attachment 95002 [details] word-prefix search for no-ICU code path (v1.0) View in context: https://bugs.webkit.org/attachment.cgi?id=95002&action=review Need to at least fix the memcpy problem. But I think that also using a circular buffer instead of a memmove each time is the way to go. It may be tricky to get a circular buffer to work with functions such as U16_GET, though, so not sure it will be easy. >> Source/WebCore/editing/TextIterator.cpp:2189 >> + > > same thing here. We can move identical code to the common place. We don’t want to copy and paste this code, yes, so we need a version of the patch that does not involve a second copy. >> Source/WebCore/editing/TextIterator.cpp:2236 >> + memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar)); > > memcpy to overlapped place may cause unexpected results? That’s right. You can’t use memcpy for this; it has to be memmove instead. And doing a memmove every time we add a character is really inefficient. I suggest considering a circular buffer instead. I agree that memmove is a better way, but I think memcpy is fine in this case since the source is after the destination,right? e.g, str = "0123456789" memcpy(str+2, str+1, 3); returns: 0111356789 memcpy(str+1, str+2, 3); returns: 0234456789 Anyway, I will change it. (In reply to comment #3) > > > Source/WebCore/editing/TextIterator.cpp:2236 > > > + memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar)); > > > > memcpy to overlapped place may cause unexpected results? > > I think memmove is a better way to do it. Thanks both for the reviewing :) I will update the patch based on your comments. (In reply to comment #4) > (From update of attachment 95002 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=95002&action=review > > Need to at least fix the memcpy problem. But I think that also using a circular buffer instead of a memmove each time is the way to go. It may be tricky to get a circular buffer to work with functions such as U16_GET, though, so not sure it will be easy. > > >> Source/WebCore/editing/TextIterator.cpp:2189 > >> + > > > > same thing here. We can move identical code to the common place. > > We don’t want to copy and paste this code, yes, so we need a version of the patch that does not involve a second copy. > > >> Source/WebCore/editing/TextIterator.cpp:2236 > >> + memcpy(m_context.data(), m_context.data() + 1, (m_context.size() - 1) * sizeof(UChar)); > > > > memcpy to overlapped place may cause unexpected results? > > That’s right. You can’t use memcpy for this; it has to be memmove instead. > > And doing a memmove every time we add a character is really inefficient. I suggest considering a circular buffer instead. (In reply to comment #5) > I agree that memmove is a better way, but I think memcpy is fine in this case since the source is after the destination,right? > > e.g, > str = "0123456789" > memcpy(str+2, str+1, 3); > returns: 0111356789 > > memcpy(str+1, str+2, 3); > returns: 0234456789 This is invalid, and will fail on some platform, some libc or even depending on the alignment. The source and destination of memcpy cannot overlap (see http://linuxcertif.com/man/3/memcpy/en/ ). (In reply to comment #7) > (In reply to comment #5) > > I agree that memmove is a better way, but I think memcpy is fine in this case since the source is after the destination,right? > > > > e.g, > > str = "0123456789" > > memcpy(str+2, str+1, 3); > > returns: 0111356789 > > > > memcpy(str+1, str+2, 3); > > returns: 0234456789 > > This is invalid, and will fail on some platform, some libc or even depending on the alignment. > The source and destination of memcpy cannot overlap (see http://linuxcertif.com/man/3/memcpy/en/ ). Agree, will fix it :) |