WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
NEW
213460
Replace unnecessary up-conversion of Strings
https://bugs.webkit.org/show_bug.cgi?id=213460
Summary
Replace unnecessary up-conversion of Strings
Sam Weinig
Reported
2020-06-22 07:17:54 PDT
Replace unnecessary upconversion of Strings
Attachments
Patch
(103.96 KB, patch)
2020-06-22 07:20 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
Patch
(105.29 KB, patch)
2020-06-22 08:02 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
WIP
(102.95 KB, patch)
2020-06-22 08:56 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
Patch
(212.44 KB, patch)
2020-06-23 18:38 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
Patch
(214.54 KB, patch)
2020-06-23 19:56 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
Patch
(217.28 KB, patch)
2020-06-24 09:38 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
Patch
(217.28 KB, patch)
2020-06-24 11:38 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
Patch
(216.39 KB, patch)
2020-06-24 12:09 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
Patch
(216.41 KB, patch)
2020-06-24 13:56 PDT
,
Sam Weinig
no flags
Details
Formatted Diff
Diff
Show Obsolete
(9)
View All
Add attachment
proposed patch, testcase, etc.
Sam Weinig
Comment 1
2020-06-22 07:20:11 PDT
Comment hidden (obsolete)
Created
attachment 402468
[details]
Patch
Sam Weinig
Comment 2
2020-06-22 08:02:58 PDT
Comment hidden (obsolete)
Created
attachment 402473
[details]
Patch
Sam Weinig
Comment 3
2020-06-22 08:56:34 PDT
Created
attachment 402479
[details]
WIP
Sam Weinig
Comment 4
2020-06-22 09:19:52 PDT
This change introduces a new WTF helper function to access the character buffer of a String like type, regardless of whether it is 8bit or 16bit via a generic lambda (or any type of function overloaded for both LChar* and UChar*. The function, currently named WTF::getCharacters(...), looks like this: template<typename StringType, typename Function> decltype(auto) getCharacters(const StringType& string, Function&& functor) { if (string.is8Bit()) { auto* ptr = string.characters8(); return functor(ptr, ptr + string.length()); } auto* ptr = string.characters16(); return functor(ptr, ptr + string.length()); } and is used like this: { .... return WTF::getCharacters(string, [](auto charactersBegin, auto charactersEnd) { /* do work with characters here */ }); } This can replace code that currently looks like this: { if (string.is8Bit()) { /* do work on string.characters8() */ } else { /* do exactly the same work on string.characters16() */ }); } and more importantly, can help remove code that does unnecessary up conversion and looks like this: { auto characters = StringView(string).upconvertedCharacters(); const UChar* begin = characters; const UChar* end = begin + value.length(); /* do work on characters */ } --- A few problems: 1) I am not sold on the name WTF::getCharacters(...). Some alternatives I thought of are: - WTF::accessCharacters(...) - WTF::characters(...) I am sure there are more. 2) I am not sold on this being a global function. It could also make sense to make it a member of the string classes: - e.g. return string.getCharacters([] (...) { ... }); 3) It's a bit annoying that you don't have easy access to the character type in the lambda. Currently, the idiom I used to get it was: using CharacterType = std::remove_pointer_t<decltype(charactersBegin)>; 4) I am not sure passing the beginning and end pointers to the lambda is the best interface for all. Other interfaces might be: - Just the beginning pointer. Let the caller construct the end via begin + string.length(). - The beginning pointer and length. - A struct with the characters and length. This would have the benefit of somewhere to put the CharacterType without requiring a std::remove_pointer_t e.g.: return getCharacters(string, [] (auto characters) { using CharacterType = decltype(characters)::CharacterType; auto being = characters.begin(); auto end = characters.end(); auto length = characters.length(); ... });
Darin Adler
Comment 5
2020-06-22 09:29:10 PDT
I like the struct idea. It could be helpful outside the getCharacters function. It’s sort of a CharacterRange. Could even have begin/end functions so it can be used to iterate code units. It’s like a compile time character type version of StringView. Almost wish it had all the same operations as StringView. The verb to me seems sort of like “read” as opposed to “get”.
Darin Adler
Comment 6
2020-06-22 12:23:25 PDT
OK, on your 4 comments: 1) I like readCharacters for this, although tastes may differ. accessCharacters sounds read/write to me. 2) Global function is better for generic programming. I think we may want to implement this for NSString and CFStringRef for example; sadly not sure when we can efficiently make them use the 8-bit path; if we knew they were all ASCII and stored as 8-bit. 3) Not that annoying. 4) Like the struct better than the separate arguments.
Sam Weinig
Comment 7
2020-06-22 18:39:39 PDT
(In reply to Darin Adler from
comment #6
)
> OK, on your 4 comments: > > 1) I like readCharacters for this, although tastes may differ. > accessCharacters sounds read/write to me. > > 2) Global function is better for generic programming. I think we may want to > implement this for NSString and CFStringRef for example; sadly not sure when > we can efficiently make them use the 8-bit path; if we knew they were all > ASCII and stored as 8-bit. > > 3) Not that annoying. > > 4) Like the struct better than the separate arguments.
I prototyped using a relatively plain class on its own. Something like this: template<typename CharacterType> class StringRange { public: constexpr const CharacterType* begin() const; constexpr const CharacterType* end() const; constexpr unsigned length() const; ... rest of StingView API private: CharacterType* m_begin; CharacterType* m_end; }; And it is kind of annoying to use in practice for these parsing routines, because you end up just pulling out the begin and end pointers. For instance, from SVGStringList.h: bool parse(const String& data, UChar delimiter) { clearItems(); return WTF::readCharacters(data, [&](auto characters) { auto ptr = characters.begin(); while (ptr < characters.end()) { auto* start = ptr; while (ptr < end && *ptr != delimiter && !isSVGSpace(*ptr)) ptr++; if (ptr == start) break; m_items.append(String(start, ptr - start)); skipOptionalSVGSpacesOrDelimiter(ptr, end, delimiter); } return ptr == characters.end(); }); } It's not horrible, but it's also not that much of an improvement over just passing the two pointers. It's actuall one line longer. Another option would be a class that acted more stream like: template<typename CharacterType> class StringParsingBuffer { public: constexpr const CharacterType* currentPosition() const; constexpr const CharacterType* end() const; constexpr CharacterType current() const { return *m_currentPosition; } constexpr unsigned charactersRemaining() const; constexpr bool hasCharactersRemaining() const { return m_currentPosition < m_end; } constexpr bool atEnd() const { return m_currentPosition == m_end; } constexpr bool advancePosition() constexpr bool advancePositionBy(unsigned); ... rest of StingView API private: CharacterType* m_currentPosition; CharacterType* m_end; }; Again, with that same SVGStringList.h example: bool parse(const String& data, UChar delimiter) { clearItems(); return WTF::readCharacters(data, [&](auto characters) { while (characters.hasCharactersRemaining()) { auto start = characters.currentPosition(); while (characters.hasCharactersRemaining() && characters.current() != delimiter && !isSVGSpace(characters.current())) characters.advance(); /* or could add StringParsingBuffer::operator++ */ if (characters.currentPosition() == start) break; m_items.append(String(start, characters.currentPosition() - start)); skipOptionalSVGSpacesOrDelimiter(characters, delimiter); /* This imagines a world where skipOptionalSVGSpacesOrDelimiter was updated to take a StringParsingBuffer */ } return characters.atEnd(); }); } We would want to make that inner while loop a skipUntil() by making those helpers a little more generic, but you get the idea. I would want to make the line: m_items.append(String(start, characters.currentPosition() - start)); easier to read as well, but I haven't fully thought of a good way. Maybe just making a String() constructor that takes two pointers: m_items.append(String { start, characters.currentPosition() }); Whay are your thoughts on this type of data structure / approach?
Sam Weinig
Comment 8
2020-06-23 18:38:28 PDT
Comment hidden (obsolete)
Created
attachment 402614
[details]
Patch
Sam Weinig
Comment 9
2020-06-23 19:56:10 PDT
Comment hidden (obsolete)
Created
attachment 402619
[details]
Patch
Sam Weinig
Comment 10
2020-06-24 09:38:02 PDT
Comment hidden (obsolete)
Created
attachment 402658
[details]
Patch
Sam Weinig
Comment 11
2020-06-24 11:38:34 PDT
Created
attachment 402665
[details]
Patch
Darin Adler
Comment 12
2020-06-24 12:08:47 PDT
Not a direct answers at all, but with the approach of calling it StringParsingBuffer, I think we can do all sorts of tricks to try to make it easy to use: 1) maybe the whole struct kind of acts like a smart pointer that just knows where the end is? works with styles where we advance the pointer as we parse 2) overload various parsing functions to use it directly, one at a time? 3) pull out the pointers and drop the buffer entirely for functions that are really full of things only pointers can do? 4) add lifetime checking like what we have in StringView if you use the buffer directly, so it’s not easy to accidentally keep it; just making it nonmovable and noncopyable would be sufficient for the getCharacters style, but if we ever use it as a return value, then this could be helpful 5) be willing to do other clever/tricky things to make it integrate really well with code that thinks it’s dealing with pointers
Sam Weinig
Comment 13
2020-06-24 12:09:37 PDT
Comment hidden (obsolete)
Created
attachment 402670
[details]
Patch
Sam Weinig
Comment 14
2020-06-24 13:56:58 PDT
Created
attachment 402682
[details]
Patch
Darin Adler
Comment 15
2020-06-24 14:07:05 PDT
Comment on
attachment 402682
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=402682&action=review
> Source/WebCore/loader/appcache/ManifestParser.cpp:76 > + return WTF::readCharactersForParsing(manifestAfterSignature, [&](auto parsingBuffer) {
The WTF:: should not be needed here since manifestAfterSignature is a WTF::StringView.
> Source/WebCore/page/csp/ContentSecurityPolicySourceList.cpp:87 > + skipWhile<CharacterType, isASCIISpace>(parsingBuffer);
We don’t need to pass CharacterType here, since it’s baked into the type of parsingBuffer. Should rearrange things so that can happen.
Sam Weinig
Comment 16
2020-06-24 15:00:12 PDT
(In reply to Darin Adler from
comment #15
)
> Comment on
attachment 402682
[details]
> Patch > > View in context: >
https://bugs.webkit.org/attachment.cgi?id=402682&action=review
> > > Source/WebCore/loader/appcache/ManifestParser.cpp:76 > > + return WTF::readCharactersForParsing(manifestAfterSignature, [&](auto parsingBuffer) { > > The WTF:: should not be needed here since manifestAfterSignature is a > WTF::StringView.
Nice. Will fix.
> > > Source/WebCore/page/csp/ContentSecurityPolicySourceList.cpp:87 > > + skipWhile<CharacterType, isASCIISpace>(parsingBuffer); > > We don’t need to pass CharacterType here, since it’s baked into the type of > parsingBuffer. Should rearrange things so that can happen.
Yeah, this is an artifact of passing the predicate, which is a template function itself, as a template parameter. I think if we switched this to take the predicate as a normal argument, and trusted that inlining will do its thing, we would have a nicer interface. e.g. replace: skipWhile<CharacterType, isASCIISpace>(parsingBuffer); with: skipWhile(parsingBuffer, isASCIISpace); This would have the added benefit of allowing one-off predicates: skipWhile(parsingBuffer, [](auto character) { return character != 's' || character != 'v' }); I think the patch has gotten too big, so I am going to split it up into more reviewable pieces.
Darin Adler
Comment 17
2020-06-24 17:09:36 PDT
(In reply to Sam Weinig from
comment #16
)
> Yeah, this is an artifact of passing the predicate, which is a template > function itself, as a template parameter. I think if we switched this to > take the predicate as a normal argument, and trusted that inlining will do > its thing, we would have a nicer interface.
I was sort of going the opposite direction. I was suggesting that we pass predicates as template parameters because some kinds of optimization might require that and I figured we can always "convert" from a template parameter to a function parameter, but not vice versa. I had a few places where I could choose, and I chose template parameters over function parameters. But I guess what you’re saying is that we can’t get the type for the predicate template parameter right without a character type template parameter. Maybe this turned the tide on that decision?
Sam Weinig
Comment 18
2020-06-30 12:03:28 PDT
(In reply to Darin Adler from
comment #17
)
> (In reply to Sam Weinig from
comment #16
) > > Yeah, this is an artifact of passing the predicate, which is a template > > function itself, as a template parameter. I think if we switched this to > > take the predicate as a normal argument, and trusted that inlining will do > > its thing, we would have a nicer interface. > > I was sort of going the opposite direction. I was suggesting that we pass > predicates as template parameters because some kinds of optimization might > require that and I figured we can always "convert" from a template parameter > to a function parameter, but not vice versa. I had a few places where I > could choose, and I chose template parameters over function parameters. > > But I guess what you’re saying is that we can’t get the type for the > predicate template parameter right without a character type template > parameter. > > Maybe this turned the tide on that decision?
Hm, I bet we could use function objects (struct with operator()(Character)), but I also think a function parameter is the way to go.
Sam Weinig
Comment 19
2020-06-30 12:57:20 PDT
Remaining uses of StringView::upconvertedCharacters() / StringView::getCharactersWithUpconvert() in WebCore by file: DataDetection.m YES. This seems addressable. Editor.cpp Needed for TelephoneNumberDetector. Probably not removable at the moment as the SPI requires a UChar*, but we should move the up-convert to TelephoneNumberDetectorCocoa.mm. TextIterator.cpp This is the SearchBuffer constructor. Needs more investigation, as the characters are passed to usearch_setPattern which seems to want the pattern as UChar*, but there might be other ways to set the pattern as well. HTMLTreeBuilder.cpp Same as Editor.cpp, Needed for TelephoneNumberDetector. Probably not removable at the moment as the SPI requires a UChar*, but we should move the up-convert to TelephoneNumberDetectorCocoa.mm. MediaPlayerPrivateAVFoundationObjC.mm Seems to be required. Passes the utf-16 string to the user as Uint16Array. StringTruncator.cpp YES. This seems addressable. At least for most cases. When an ellipse is inserted, up-conversion is required. CanvasRenderingContext2D.cpp YES. This seems addressable. WHLSLMangledNames.h Nothing to do here. This up-conversion only happens when writing into a StringBuilder (or makeString) that is using a 16 bit buffer, and no extra allocations occur. ThreadableWebSocketChannelClientWrapper.cpp YES. This seems addressable. SQLiteStatement.cpp Unclear. Needs more investigation. Used by SQLiteStatement::bindBlob(int, const String&) and SQLiteStatement::bindText(int, const String&); TextCodecICU.cpp This is likely needed, passed to ucnv_fromUnicode, though there may be ucnv API that works with latin input. Length.cpp YES. This seems addressable. SharedStringHash.cpp Probably nothing to do here. Only doing up-conversion in the case that the base url is 16-bit. It seems plausible the computation of the hash could be done without the intermediate buffer, in which case, it would probably be possible to remove the up-conversion. There are definitely some optimization opportunity even without that, such as delaying the up-conversion, but since most URLs are likely to be 8-bit still, seems unnecessary at this time. SVGTextLayoutEngine.cpp YES. This seems addressable. XMLDocumentParserLibxml2.cpp This already has the following fixme: // FIXME: Can we parse 8-bit strings directly as Latin-1 instead of upconverting to UTF-16? I guess we should answer that question. XSLStyleSheetLibxslt.cpp Potentially addressable. Need to find out if xmlCtxtReadMemory support latin1 as an encoding. Generally, it seems like libxml/libxslt are latin1 aware, so just needs a bit of digging.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug