RESOLVED FIXED Bug 50038
Add word-prefix search options to the text search API
https://bugs.webkit.org/show_bug.cgi?id=50038
Summary Add word-prefix search options to the text search API
mitz
Reported 2010-11-24 13:49:14 PST
Attachments
WebCore (ICU code path only) implementation, WebKit/mac WebKit2 API, DumpRenderTree/mac test support and tests (104.25 KB, patch)
2010-11-24 13:55 PST, mitz
darin: review+
WebCore (ICU code path only) implementation, WebKit/mac WebKit2 API, DumpRenderTree/mac test support and tests (109.33 KB, patch)
2010-11-29 10:46 PST, mitz
darin: review-
https://bugs.webkit.org/attachment.cgi?bugid=50038&action=enter (120.65 KB, patch)
2010-11-29 16:22 PST, mitz
darin: review+
mitz
Comment 1 2010-11-24 13:55:12 PST
Created attachment 74790 [details] WebCore (ICU code path only) implementation, WebKit/mac WebKit2 API, DumpRenderTree/mac test support and tests
Early Warning System Bot
Comment 2 2010-11-24 14:44:30 PST
WebKit Review Bot
Comment 3 2010-11-24 15:37:09 PST
Attachment 74790 [details] did not pass style-queue: Failed to run "['WebKitTools/Scripts/update-webkit']" exit_code: 2 Updating OpenSource RA layer request failed: OPTIONS of 'http://svn.webkit.org/repository/webkit': timed out waiting for server (http://svn.webkit.org) at /usr/lib/git-core/git-svn line 2295 Died at WebKitTools/Scripts/update-webkit line 129. If any of these errors are false positives, please file a bug against check-webkit-style.
Build Bot
Comment 4 2010-11-24 16:14:46 PST
Darin Adler
Comment 5 2010-11-27 18:00:57 PST
Comment on attachment 74790 [details] WebCore (ICU code path only) implementation, WebKit/mac WebKit2 API, DumpRenderTree/mac test support and tests View in context: https://bugs.webkit.org/attachment.cgi?id=74790&action=review r=me, assuming you get this to build on all platforms before landing it > WebCore/editing/Editor.h:333 > + bool findString(const String&, FindOptions); > bool findString(const String&, bool forward, bool caseFlag, bool wrapFlag, bool startInSelection); Longer term it might be nice to return the version with four bool arguments. Maybe even put a FIXME reminding us that we want to do that? > WebCore/editing/FindOptions.h:34 > + TreatMedialCapAsWordStart = 1 << 2, Might want a comment nearby explaining what medial capitals are. Why the abbreviation “cap” for capital? > WebCore/editing/TextIterator.cpp:1841 > + static const char latin1SeparatorTable[256] = { Is char the optimal type here? If bool is not good on some platforms, perhaps unsigned char is slightly better than char? > WebCore/editing/TextIterator.cpp:1863 > + return U_GET_GC_MASK(character) & (U_GC_S_MASK | U_GC_P_MASK | U_GC_Z_MASK | U_GC_CF_MASK); Should we make this part non-inline? > WebCore/editing/TextIterator.cpp:1923 > + if (m_prefixLength) > + m_prefixLength -= min(m_prefixLength, m_buffer.size() - m_overlap); I don’t think you need the if here. The min already has a suitable branch. Unless the if is a performance optimization. > WebCore/editing/TextIterator.cpp:2044 > + U16_FWD_1(m_buffer.data(), offset, size); I suspect the non-ICU-using platforms don’t have this macro. You may need to beef up the header they use to get some of the ICU macros. > WebCore/editing/TextIterator.cpp:2062 > + size_t wordBreakSearchStart = start + length; > + while (wordBreakSearchStart > start) > + wordBreakSearchStart = findNextWordFromIndex(m_buffer.data(), m_buffer.size(), wordBreakSearchStart, false /* backwards */); > + return wordBreakSearchStart == start; Is this sufficiently efficient when called repeatedly? > WebCore/editing/TextIterator.cpp:2109 > + if (m_prefixLength) > + m_prefixLength -= min(m_prefixLength, size - overlap); Again, I suspect the if is not needed. > WebCore/editing/TextIterator.cpp:2128 > + if (m_prefixLength) > + m_prefixLength -= min<size_t>(m_prefixLength, matchStart + 1); Yet again, the if is not necessary. > WebCore/editing/TextIterator.cpp:2469 > + int wordBoundaryContextStart = backwardsIterator.length(); > + U16_BACK_1(backwardsIterator.characters(), 0, wordBoundaryContextStart); > + wordBoundaryContextStart = startOfLastWordBoundaryContext(backwardsIterator.characters(), wordBoundaryContextStart); > + int contextLength = backwardsIterator.length() - wordBoundaryContextStart; Is int really the right type here? > WebCore/editing/TextIterator.cpp:2515 > - { > - CharacterIterator findIterator(range, TextIteratorEntersTextControls); > - matchLength = findPlainText(findIterator, target, forward, caseSensitive, matchStart); > - if (!matchLength) > - return collapsedToBoundary(range, forward); > - } > + > + CharacterIterator findIterator(range, TextIteratorEntersTextControls); > + matchLength = findPlainText(findIterator, target, options, matchStart); > + if (!matchLength) > + return collapsedToBoundary(range, !(options & Backwards)); You dropped the braces that make the character iterator fall out of scope early. Why? Was that not a helpful optimization? Another way to accomplish the same thing is to use a temporary for the character iterator instead of a named local. > WebCore/editing/TextIterator.h:62 > String plainText(const Range*, TextIteratorBehavior defaultBehavior = TextIteratorDefaultBehavior); > UChar* plainTextToMallocAllocatedBuffer(const Range*, unsigned& bufferLength, bool isDisplayString, TextIteratorBehavior defaultBehavior = TextIteratorDefaultBehavior); We should remove that misleading argument name. > WebCore/editing/TextIterator.h:64 > +PassRefPtr<Range> findPlainText(const Range*, const String&, FindOptions); > PassRefPtr<Range> findPlainText(const Range*, const String&, bool forward, bool caseSensitive); As above, should we consider deprecating and then removing the bool version? > WebCore/page/Page.h:205 > + bool findString(const String&, FindOptions); > bool findString(const String&, TextCaseSensitivity, FindDirection, bool shouldWrap); > + unsigned int markAllMatchesForText(const String&, FindOptions, bool shouldHighlight, unsigned); > unsigned int markAllMatchesForText(const String&, TextCaseSensitivity, bool shouldHighlight, unsigned); Should we deprecate the TextCaseSensitivity versions? > LayoutTests/editing/text-iterator/findString.html:12 > + log("Searching for \u2018" + target + "\u2019 " + (text.length <= 64 ? "in \u2018" + text + "\u2019 " : "in long string ") + "with options [" + options.join(", ") + "]:"); Why not use straight quotes instead of curly?
mitz
Comment 6 2010-11-29 10:46:14 PST
Created attachment 75040 [details] WebCore (ICU code path only) implementation, WebKit/mac WebKit2 API, DumpRenderTree/mac test support and tests (In reply to comment #5) > (From update of attachment 74790 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=74790&action=review > > r=me, assuming you get this to build on all platforms before landing it Thanks for the review. This version includes more build system changes. > > > WebCore/editing/Editor.h:333 > > + bool findString(const String&, FindOptions); > > bool findString(const String&, bool forward, bool caseFlag, bool wrapFlag, bool startInSelection); > > Longer term it might be nice to return the version with four bool arguments. Maybe even put a FIXME reminding us that we want to do that? Done. > > > WebCore/editing/FindOptions.h:34 > > + TreatMedialCapAsWordStart = 1 << 2, > > Might want a comment nearby explaining what medial capitals are. Why the abbreviation “cap” for capital? Added a comment. Changed to “capital”. > > > WebCore/editing/TextIterator.cpp:1841 > > + static const char latin1SeparatorTable[256] = { > > Is char the optimal type here? If bool is not good on some platforms, perhaps unsigned char is slightly better than char? Changed to bool but left the values as 0s and 1s. > > > WebCore/editing/TextIterator.cpp:1863 > > + return U_GET_GC_MASK(character) & (U_GC_S_MASK | U_GC_P_MASK | U_GC_Z_MASK | U_GC_CF_MASK); > > Should we make this part non-inline? Done, although the name might not be ideal. > > > WebCore/editing/TextIterator.cpp:1923 > > + if (m_prefixLength) > > + m_prefixLength -= min(m_prefixLength, m_buffer.size() - m_overlap); > > I don’t think you need the if here. The min already has a suitable branch. Unless the if is a performance optimization. Removed the if for now. > > > WebCore/editing/TextIterator.cpp:2044 > > + U16_FWD_1(m_buffer.data(), offset, size); > > I suspect the non-ICU-using platforms don’t have this macro. You may need to beef up the header they use to get some of the ICU macros. Done. > > > WebCore/editing/TextIterator.cpp:2062 > > + size_t wordBreakSearchStart = start + length; > > + while (wordBreakSearchStart > start) > > + wordBreakSearchStart = findNextWordFromIndex(m_buffer.data(), m_buffer.size(), wordBreakSearchStart, false /* backwards */); > > + return wordBreakSearchStart == start; > > Is this sufficiently efficient when called repeatedly? Don’t know. > > > WebCore/editing/TextIterator.cpp:2109 > > + if (m_prefixLength) > > + m_prefixLength -= min(m_prefixLength, size - overlap); > > Again, I suspect the if is not needed. Removed the if. > > > WebCore/editing/TextIterator.cpp:2128 > > + if (m_prefixLength) > > + m_prefixLength -= min<size_t>(m_prefixLength, matchStart + 1); > > Yet again, the if is not necessary. Removed it. > > > WebCore/editing/TextIterator.cpp:2469 > > + int wordBoundaryContextStart = backwardsIterator.length(); > > + U16_BACK_1(backwardsIterator.characters(), 0, wordBoundaryContextStart); > > + wordBoundaryContextStart = startOfLastWordBoundaryContext(backwardsIterator.characters(), wordBoundaryContextStart); > > + int contextLength = backwardsIterator.length() - wordBoundaryContextStart; > > Is int really the right type here? I deferred to startOfLastWordBoundary and SimplifiedBackwardsTextIterator::length(). > > > WebCore/editing/TextIterator.cpp:2515 > > - { > > - CharacterIterator findIterator(range, TextIteratorEntersTextControls); > > - matchLength = findPlainText(findIterator, target, forward, caseSensitive, matchStart); > > - if (!matchLength) > > - return collapsedToBoundary(range, forward); > > - } > > + > > + CharacterIterator findIterator(range, TextIteratorEntersTextControls); > > + matchLength = findPlainText(findIterator, target, options, matchStart); > > + if (!matchLength) > > + return collapsedToBoundary(range, !(options & Backwards)); > > You dropped the braces that make the character iterator fall out of scope early. Why? Was that not a helpful optimization? I thought I was cleaning up and didn’t realize the performance implication. Reinstated the braces. > > Another way to accomplish the same thing is to use a temporary for the character iterator instead of a named local. > > > WebCore/editing/TextIterator.h:62 > > String plainText(const Range*, TextIteratorBehavior defaultBehavior = TextIteratorDefaultBehavior); > > UChar* plainTextToMallocAllocatedBuffer(const Range*, unsigned& bufferLength, bool isDisplayString, TextIteratorBehavior defaultBehavior = TextIteratorDefaultBehavior); > > We should remove that misleading argument name. Done. > > > WebCore/editing/TextIterator.h:64 > > +PassRefPtr<Range> findPlainText(const Range*, const String&, FindOptions); > > PassRefPtr<Range> findPlainText(const Range*, const String&, bool forward, bool caseSensitive); > > As above, should we consider deprecating and then removing the bool version? Yes. Added comments to that effect. > > > WebCore/page/Page.h:205 > > + bool findString(const String&, FindOptions); > > bool findString(const String&, TextCaseSensitivity, FindDirection, bool shouldWrap); > > + unsigned int markAllMatchesForText(const String&, FindOptions, bool shouldHighlight, unsigned); > > unsigned int markAllMatchesForText(const String&, TextCaseSensitivity, bool shouldHighlight, unsigned); > > Should we deprecate the TextCaseSensitivity versions? Ditto. > > > LayoutTests/editing/text-iterator/findString.html:12 > > + log("Searching for \u2018" + target + "\u2019 " + (text.length <= 64 ? "in \u2018" + text + "\u2019 " : "in long string ") + "with options [" + options.join(", ") + "]:"); > > Why not use straight quotes instead of curly? Tough question.
Early Warning System Bot
Comment 7 2010-11-29 11:09:54 PST
mitz
Comment 8 2010-11-29 11:22:10 PST
(In reply to comment #7) > Attachment 75040 [details] did not build on qt: > Build output: http://queues.webkit.org/results/6393100 Any idea why this is happening, given the change to WebKit2’s FindOptions.h?
Darin Adler
Comment 9 2010-11-29 12:32:16 PST
Comment on attachment 75040 [details] WebCore (ICU code path only) implementation, WebKit/mac WebKit2 API, DumpRenderTree/mac test support and tests View in context: https://bugs.webkit.org/attachment.cgi?id=75040&action=review review- just because of those buffer under/overrun questions. I’m sure you also will get the Qt build working. I fear that your guess that it’s the FindOptions.h header may well be correct. > WebCore/editing/TextIterator.cpp:2026 > +inline bool SearchBuffer::isWordStartMatch(size_t start, size_t length) const This function might read more clearly decomposed a bit more. The names of the smaller functions could help make it clearer why it’s correct to do both of these checks (actual word starts are always word starts regardless of the medial capital flag). It’s OK as is, though. > WebCore/editing/TextIterator.cpp:2034 > + U16_GET(m_buffer.data(), 0, offset, size, firstCharacter); I assume that offset is guaranteed to be < size, so there is no problem here, but I did have to look twice at this to be sure. > WebCore/editing/TextIterator.cpp:2036 > + U16_PREV(m_buffer.data(), 0, offset, previousCharacter); If offset is 0, this will read past the beginning of the buffer, which is not good. > WebCore/editing/TextIterator.cpp:2051 > + U16_FWD_1(m_buffer.data(), offset, size); > + U16_GET(m_buffer.data(), 0, offset, size, nextCharacter); What guarantees this won’t read off the end of the buffer? > WebCore/editing/TextIterator.cpp:2109 > + int wordBoundaryContextStart = matchStart; > + U16_BACK_1(m_buffer.data(), 0, wordBoundaryContextStart); Is matchStart guaranteed to be > 0? If not, then U16_BACK_1 will read off the start of the buffer. > WebCore/editing/TextIterator.cpp:2124 > + if (isBadMatch(m_buffer.data() + matchStart, matchedLength) > + || m_options & AtWordStarts && !isWordStartMatch(matchStart, matchedLength)) { I would like this better all on one line. > WebCore/editing/TextIterator.cpp:2478 > + if (options & AtWordStarts) { > + RefPtr<Range> startRange = it.range(); > + RefPtr<Range> beforeStartRange = startRange->ownerDocument()->createRange(); > + ExceptionCode ec = 0; > + beforeStartRange->setEnd(startRange->startContainer(), startRange->startOffset(), ec); > + SimplifiedBackwardsTextIterator backwardsIterator(beforeStartRange.get(), TextIteratorDefaultBehavior); > + while (!backwardsIterator.atEnd()) { > + int wordBoundaryContextStart = backwardsIterator.length(); > + U16_BACK_1(backwardsIterator.characters(), 0, wordBoundaryContextStart); > + wordBoundaryContextStart = startOfLastWordBoundaryContext(backwardsIterator.characters(), wordBoundaryContextStart); > + int contextLength = backwardsIterator.length() - wordBoundaryContextStart; > + if (!buffer.prependContext(backwardsIterator.characters() + wordBoundaryContextStart, contextLength) || wordBoundaryContextStart) > + break; > + backwardsIterator.advance(); > + } > + } You could consider putting the entire body of this if statement into a separate function. It needs the iterator, the search buffer, and the options. Giving it a name might help document what it does. You could use a for loop, although it’s always annoying to do that when the initialization part of the for is long. It would be nice to have the call to the advance function explicitly part of the loop structure. The U16_BACK_1 call may need a check to see if wordBoundaryContextStart is 0. > WebCore/page/Page.cpp:522 > + if (frame->editor()->findString(target, options & ~WrapAround | StartInSelection)) { Some newer versions of gcc may warn about the mix of & and | without parentheses, correctly relying on operator precedence but confusing to some programmers. I suspect that could happening in the Qt build, or it’s possible it only applies to the && and || operators.
Andras Becsi
Comment 10 2010-11-29 14:49:43 PST
Comment on attachment 75040 [details] WebCore (ICU code path only) implementation, WebKit/mac WebKit2 API, DumpRenderTree/mac test support and tests View in context: https://bugs.webkit.org/attachment.cgi?id=75040&action=review Previously Qt had issues with clashing file names in WebCore and WebKit2, the build problem here seems to be caused by this, too. After renaming the WebCore/editing/FindOptions.h file to FindOptionFlag.h the Qt build had the following issues: > WebCore/ChangeLog:49 > + * platform/text/TextBoundaries.cpp: This file is not listed in WebCore.pro and Qt seems to lack the isAlphanumeric function, too, which is used in this source. Qt does not use ICU, AFAIK. > WebCore/ChangeLog:52 > + * platform/text/TextBoundaries.h: Ditto. > WebCore/editing/TextIterator.cpp:1941 > +size_t SearchBuffer::prependContext(const UChar* characters, size_t length) This is inside a long #if USE(ICU_UNICODE) && !UCONFIG_NO_COLLATION guard which seems to be false on Qt thus the linking fails with undefined reference. > WebCore/editing/TextIterator.cpp:2142 > -inline SearchBuffer::SearchBuffer(const String& target, bool isCaseSensitive) > +inline SearchBuffer::SearchBuffer(const String& target, FindOptions options) > : m_target(isCaseSensitive ? target : target.foldCase()) A stray isCaseSensitive parameter was left here. >> WebCore/page/Page.cpp:522 >> + if (frame->editor()->findString(target, options & ~WrapAround | StartInSelection)) { > > Some newer versions of gcc may warn about the mix of & and | without parentheses, correctly relying on operator precedence but confusing to some programmers. I suspect that could happening in the Qt build, or it’s possible it only applies to the && and || operators. The rule also applies to bitwise operators.
mitz
Comment 11 2010-11-29 16:22:34 PST
Created attachment 75080 [details] https://bugs.webkit.org/attachment.cgi?bugid=50038&action=enter Addressed Darin’s comments and made some changes needed to make this build on non-ICU platforms.
Darin Adler
Comment 12 2010-11-29 17:36:18 PST
Comment on attachment 75080 [details] https://bugs.webkit.org/attachment.cgi?bugid=50038&action=enter View in context: https://bugs.webkit.org/attachment.cgi?id=75080&action=review > WebCore/page/Page.h:205 > + unsigned int markAllMatchesForText(const String&, FindOptions, bool shouldHighlight, unsigned); Instead of “unsigned int” please do “unsigned”.
mitz
Comment 13 2010-11-29 22:41:12 PST
WebKit Review Bot
Comment 14 2010-11-29 23:28:50 PST
http://trac.webkit.org/changeset/72887 might have broken Qt Windows 32-bit Debug
Note You need to log in before you can comment on or make changes to this bug.