213460 – Replace unnecessary up-conversion of Strings

NEW213460

Replace unnecessary up-conversion of Strings

https://bugs.webkit.org/show_bug.cgi?id=213460

Summary Replace unnecessary up-conversion of Strings

Sam Weinig

Reported 2020-06-22 07:17:54 PDT

Replace unnecessary upconversion of Strings

Attachments
Patch (103.96 KB, patch) 2020-06-22 07:20 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
Patch (105.29 KB, patch) 2020-06-22 08:02 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
WIP (102.95 KB, patch) 2020-06-22 08:56 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
Patch (212.44 KB, patch) 2020-06-23 18:38 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
Patch (214.54 KB, patch) 2020-06-23 19:56 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
Patch (217.28 KB, patch) 2020-06-24 09:38 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
Patch (217.28 KB, patch) 2020-06-24 11:38 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
Patch (216.39 KB, patch) 2020-06-24 12:09 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
Patch (216.41 KB, patch) 2020-06-24 13:56 PDT, Sam Weinig	no flags	Details Formatted Diff Diff
Show Obsolete (9) View All Add attachment proposed patch, testcase, etc.

Sam Weinig

Comment 1 2020-06-22 07:20:11 PDT Comment hidden (obsolete)

Created attachment 402468 [details] Patch

Sam Weinig

Comment 2 2020-06-22 08:02:58 PDT Comment hidden (obsolete)

Created attachment 402473 [details] Patch

Sam Weinig

Comment 3 2020-06-22 08:56:34 PDT

Created attachment 402479 [details] WIP

Sam Weinig

Comment 4 2020-06-22 09:19:52 PDT

This change introduces a new WTF helper function to access the character buffer of a String like type, regardless of whether it is 8bit or 16bit via a generic lambda (or any type of function overloaded for both LChar* and UChar*. The function, currently named WTF::getCharacters(...), looks like this: template<typename StringType, typename Function> decltype(auto) getCharacters(const StringType& string, Function&& functor) { if (string.is8Bit()) { auto* ptr = string.characters8(); return functor(ptr, ptr + string.length()); } auto* ptr = string.characters16(); return functor(ptr, ptr + string.length()); } and is used like this: { .... return WTF::getCharacters(string, [](auto charactersBegin, auto charactersEnd) { /* do work with characters here */ }); } This can replace code that currently looks like this: { if (string.is8Bit()) { /* do work on string.characters8() */ } else { /* do exactly the same work on string.characters16() */ }); } and more importantly, can help remove code that does unnecessary up conversion and looks like this: { auto characters = StringView(string).upconvertedCharacters(); const UChar* begin = characters; const UChar* end = begin + value.length(); /* do work on characters */ } --- A few problems: 1) I am not sold on the name WTF::getCharacters(...). Some alternatives I thought of are: - WTF::accessCharacters(...) - WTF::characters(...) I am sure there are more. 2) I am not sold on this being a global function. It could also make sense to make it a member of the string classes: - e.g. return string.getCharacters([] (...) { ... }); 3) It's a bit annoying that you don't have easy access to the character type in the lambda. Currently, the idiom I used to get it was: using CharacterType = std::remove_pointer_t<decltype(charactersBegin)>; 4) I am not sure passing the beginning and end pointers to the lambda is the best interface for all. Other interfaces might be: - Just the beginning pointer. Let the caller construct the end via begin + string.length(). - The beginning pointer and length. - A struct with the characters and length. This would have the benefit of somewhere to put the CharacterType without requiring a std::remove_pointer_t e.g.: return getCharacters(string, [] (auto characters) { using CharacterType = decltype(characters)::CharacterType; auto being = characters.begin(); auto end = characters.end(); auto length = characters.length(); ... });

Darin Adler

Comment 5 2020-06-22 09:29:10 PDT

I like the struct idea. It could be helpful outside the getCharacters function. It’s sort of a CharacterRange. Could even have begin/end functions so it can be used to iterate code units. It’s like a compile time character type version of StringView. Almost wish it had all the same operations as StringView. The verb to me seems sort of like “read” as opposed to “get”.

Darin Adler

Comment 6 2020-06-22 12:23:25 PDT

OK, on your 4 comments: 1) I like readCharacters for this, although tastes may differ. accessCharacters sounds read/write to me. 2) Global function is better for generic programming. I think we may want to implement this for NSString and CFStringRef for example; sadly not sure when we can efficiently make them use the 8-bit path; if we knew they were all ASCII and stored as 8-bit. 3) Not that annoying. 4) Like the struct better than the separate arguments.

Sam Weinig

Comment 7 2020-06-22 18:39:39 PDT

(In reply to Darin Adler from comment #6) > OK, on your 4 comments: > > 1) I like readCharacters for this, although tastes may differ. > accessCharacters sounds read/write to me. > > 2) Global function is better for generic programming. I think we may want to > implement this for NSString and CFStringRef for example; sadly not sure when > we can efficiently make them use the 8-bit path; if we knew they were all > ASCII and stored as 8-bit. > > 3) Not that annoying. > > 4) Like the struct better than the separate arguments. I prototyped using a relatively plain class on its own. Something like this: template<typename CharacterType> class StringRange { public: constexpr const CharacterType* begin() const; constexpr const CharacterType* end() const; constexpr unsigned length() const; ... rest of StingView API private: CharacterType* m_begin; CharacterType* m_end; }; And it is kind of annoying to use in practice for these parsing routines, because you end up just pulling out the begin and end pointers. For instance, from SVGStringList.h: bool parse(const String& data, UChar delimiter) { clearItems(); return WTF::readCharacters(data, [&](auto characters) { auto ptr = characters.begin(); while (ptr < characters.end()) { auto* start = ptr; while (ptr < end && *ptr != delimiter && !isSVGSpace(*ptr)) ptr++; if (ptr == start) break; m_items.append(String(start, ptr - start)); skipOptionalSVGSpacesOrDelimiter(ptr, end, delimiter); } return ptr == characters.end(); }); } It's not horrible, but it's also not that much of an improvement over just passing the two pointers. It's actuall one line longer. Another option would be a class that acted more stream like: template<typename CharacterType> class StringParsingBuffer { public: constexpr const CharacterType* currentPosition() const; constexpr const CharacterType* end() const; constexpr CharacterType current() const { return *m_currentPosition; } constexpr unsigned charactersRemaining() const; constexpr bool hasCharactersRemaining() const { return m_currentPosition < m_end; } constexpr bool atEnd() const { return m_currentPosition == m_end; } constexpr bool advancePosition() constexpr bool advancePositionBy(unsigned); ... rest of StingView API private: CharacterType* m_currentPosition; CharacterType* m_end; }; Again, with that same SVGStringList.h example: bool parse(const String& data, UChar delimiter) { clearItems(); return WTF::readCharacters(data, [&](auto characters) { while (characters.hasCharactersRemaining()) { auto start = characters.currentPosition(); while (characters.hasCharactersRemaining() && characters.current() != delimiter && !isSVGSpace(characters.current())) characters.advance(); /* or could add StringParsingBuffer::operator++ */ if (characters.currentPosition() == start) break; m_items.append(String(start, characters.currentPosition() - start)); skipOptionalSVGSpacesOrDelimiter(characters, delimiter); /* This imagines a world where skipOptionalSVGSpacesOrDelimiter was updated to take a StringParsingBuffer */ } return characters.atEnd(); }); } We would want to make that inner while loop a skipUntil() by making those helpers a little more generic, but you get the idea. I would want to make the line: m_items.append(String(start, characters.currentPosition() - start)); easier to read as well, but I haven't fully thought of a good way. Maybe just making a String() constructor that takes two pointers: m_items.append(String { start, characters.currentPosition() }); Whay are your thoughts on this type of data structure / approach?

Sam Weinig

Comment 8 2020-06-23 18:38:28 PDT Comment hidden (obsolete)

Created attachment 402614 [details] Patch

Sam Weinig

Comment 9 2020-06-23 19:56:10 PDT Comment hidden (obsolete)

Created attachment 402619 [details] Patch

Sam Weinig

Comment 10 2020-06-24 09:38:02 PDT Comment hidden (obsolete)

Created attachment 402658 [details] Patch

Sam Weinig

Comment 11 2020-06-24 11:38:34 PDT

Created attachment 402665 [details] Patch

Darin Adler

Comment 12 2020-06-24 12:08:47 PDT

Not a direct answers at all, but with the approach of calling it StringParsingBuffer, I think we can do all sorts of tricks to try to make it easy to use: 1) maybe the whole struct kind of acts like a smart pointer that just knows where the end is? works with styles where we advance the pointer as we parse 2) overload various parsing functions to use it directly, one at a time? 3) pull out the pointers and drop the buffer entirely for functions that are really full of things only pointers can do? 4) add lifetime checking like what we have in StringView if you use the buffer directly, so it’s not easy to accidentally keep it; just making it nonmovable and noncopyable would be sufficient for the getCharacters style, but if we ever use it as a return value, then this could be helpful 5) be willing to do other clever/tricky things to make it integrate really well with code that thinks it’s dealing with pointers

Sam Weinig

Comment 13 2020-06-24 12:09:37 PDT Comment hidden (obsolete)

Created attachment 402670 [details] Patch

Sam Weinig

Comment 14 2020-06-24 13:56:58 PDT

Created attachment 402682 [details] Patch

Darin Adler

Comment 15 2020-06-24 14:07:05 PDT

Comment on attachment 402682 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=402682&action=review > Source/WebCore/loader/appcache/ManifestParser.cpp:76 > + return WTF::readCharactersForParsing(manifestAfterSignature, [&](auto parsingBuffer) { The WTF:: should not be needed here since manifestAfterSignature is a WTF::StringView. > Source/WebCore/page/csp/ContentSecurityPolicySourceList.cpp:87 > + skipWhile<CharacterType, isASCIISpace>(parsingBuffer); We don’t need to pass CharacterType here, since it’s baked into the type of parsingBuffer. Should rearrange things so that can happen.

Sam Weinig

Comment 16 2020-06-24 15:00:12 PDT

(In reply to Darin Adler from comment #15) > Comment on attachment 402682 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=402682&action=review > > > Source/WebCore/loader/appcache/ManifestParser.cpp:76 > > + return WTF::readCharactersForParsing(manifestAfterSignature, [&](auto parsingBuffer) { > > The WTF:: should not be needed here since manifestAfterSignature is a > WTF::StringView. Nice. Will fix. > > > Source/WebCore/page/csp/ContentSecurityPolicySourceList.cpp:87 > > + skipWhile<CharacterType, isASCIISpace>(parsingBuffer); > > We don’t need to pass CharacterType here, since it’s baked into the type of > parsingBuffer. Should rearrange things so that can happen. Yeah, this is an artifact of passing the predicate, which is a template function itself, as a template parameter. I think if we switched this to take the predicate as a normal argument, and trusted that inlining will do its thing, we would have a nicer interface. e.g. replace: skipWhile<CharacterType, isASCIISpace>(parsingBuffer); with: skipWhile(parsingBuffer, isASCIISpace); This would have the added benefit of allowing one-off predicates: skipWhile(parsingBuffer, [](auto character) { return character != 's' || character != 'v' }); I think the patch has gotten too big, so I am going to split it up into more reviewable pieces.

Darin Adler

Comment 17 2020-06-24 17:09:36 PDT

(In reply to Sam Weinig from comment #16) > Yeah, this is an artifact of passing the predicate, which is a template > function itself, as a template parameter. I think if we switched this to > take the predicate as a normal argument, and trusted that inlining will do > its thing, we would have a nicer interface. I was sort of going the opposite direction. I was suggesting that we pass predicates as template parameters because some kinds of optimization might require that and I figured we can always "convert" from a template parameter to a function parameter, but not vice versa. I had a few places where I could choose, and I chose template parameters over function parameters. But I guess what you’re saying is that we can’t get the type for the predicate template parameter right without a character type template parameter. Maybe this turned the tide on that decision?

Sam Weinig

Comment 18 2020-06-30 12:03:28 PDT

(In reply to Darin Adler from comment #17) > (In reply to Sam Weinig from comment #16) > > Yeah, this is an artifact of passing the predicate, which is a template > > function itself, as a template parameter. I think if we switched this to > > take the predicate as a normal argument, and trusted that inlining will do > > its thing, we would have a nicer interface. > > I was sort of going the opposite direction. I was suggesting that we pass > predicates as template parameters because some kinds of optimization might > require that and I figured we can always "convert" from a template parameter > to a function parameter, but not vice versa. I had a few places where I > could choose, and I chose template parameters over function parameters. > > But I guess what you’re saying is that we can’t get the type for the > predicate template parameter right without a character type template > parameter. > > Maybe this turned the tide on that decision? Hm, I bet we could use function objects (struct with operator()(Character)), but I also think a function parameter is the way to go.

Sam Weinig

Comment 19 2020-06-30 12:57:20 PDT

Remaining uses of StringView::upconvertedCharacters() / StringView::getCharactersWithUpconvert() in WebCore by file: DataDetection.m YES. This seems addressable. Editor.cpp Needed for TelephoneNumberDetector. Probably not removable at the moment as the SPI requires a UChar*, but we should move the up-convert to TelephoneNumberDetectorCocoa.mm. TextIterator.cpp This is the SearchBuffer constructor. Needs more investigation, as the characters are passed to usearch_setPattern which seems to want the pattern as UChar*, but there might be other ways to set the pattern as well. HTMLTreeBuilder.cpp Same as Editor.cpp, Needed for TelephoneNumberDetector. Probably not removable at the moment as the SPI requires a UChar*, but we should move the up-convert to TelephoneNumberDetectorCocoa.mm. MediaPlayerPrivateAVFoundationObjC.mm Seems to be required. Passes the utf-16 string to the user as Uint16Array. StringTruncator.cpp YES. This seems addressable. At least for most cases. When an ellipse is inserted, up-conversion is required. CanvasRenderingContext2D.cpp YES. This seems addressable. WHLSLMangledNames.h Nothing to do here. This up-conversion only happens when writing into a StringBuilder (or makeString) that is using a 16 bit buffer, and no extra allocations occur. ThreadableWebSocketChannelClientWrapper.cpp YES. This seems addressable. SQLiteStatement.cpp Unclear. Needs more investigation. Used by SQLiteStatement::bindBlob(int, const String&) and SQLiteStatement::bindText(int, const String&); TextCodecICU.cpp This is likely needed, passed to ucnv_fromUnicode, though there may be ucnv API that works with latin input. Length.cpp YES. This seems addressable. SharedStringHash.cpp Probably nothing to do here. Only doing up-conversion in the case that the base url is 16-bit. It seems plausible the computation of the hash could be done without the intermediate buffer, in which case, it would probably be possible to remove the up-conversion. There are definitely some optimization opportunity even without that, such as delaying the up-conversion, but since most URLs are likely to be 8-bit still, seems unnecessary at this time. SVGTextLayoutEngine.cpp YES. This seems addressable. XMLDocumentParserLibxml2.cpp This already has the following fixme: // FIXME: Can we parse 8-bit strings directly as Latin-1 instead of upconverting to UTF-16? I guess we should answer that question. XSLStyleSheetLibxslt.cpp Potentially addressable. Need to find out if xmlCtxtReadMemory support latin1 as an encoding. Generally, it seems like libxml/libxslt are latin1 aware, so just needs a bit of digging.

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Sam Weinig

Reported

2020-06-22 07:17 PDT

Modified

2022-04-19 10:37 PDT History

CC List

24 users Show

URL

Keywords

Depends on

213588 213631 213633 213635

Blocks

Dependencies

tree graph

annulen
benjamin
cdumez
changseok
cmarcelo
darin
dino
eric.carlson
esprehn+autocc
ews-watchlist
fmalita
glenn
gyuyoung.kim
japhet
jer.noble
koivisto
mifenton
mkwst
pdr
philipj
ryuan.choi
sabouhallawa
schenney
sergio