The PAL::TextEncoding() constructor can take a StringView instead of a String. This allows some call sites to be more efficient.
Created attachment 456875 [details] Patch
Created attachment 456882 [details] Patch
Comment on attachment 456882 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review > Source/WebCore/platform/network/HTTPParsers.cpp:419 > + while (pos < length && mediaType[pos] <= ' ') Turns out this function was doing unsafe indexing in the String but we didn't realize until I switched to a StringView because this is how String's operator[] is implemented: ``` inline UChar String::characterAt(unsigned index) const { if (!m_impl || index >= m_impl->length()) return 0; return (*m_impl)[index]; } ``` It returns 0 for out of bound access instead of crashing. The StringView's operator[] isn't as forgiving :)
Comment on attachment 456882 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review > Source/WebCore/Modules/fetch/FetchResponse.cpp:136 > + r->m_internalResponse.setTextEncodingName(extractCharsetFromMediaType(contentType).toString()); Seems unfortunate that setTextEncodingName can’t take the String we allocate with move semantics. > Source/WebCore/PAL/pal/text/TextEncoding.h:39 > PAL_EXPORT TextEncoding(const char* name); Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. > Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:124 > - return PAL::TextEncoding(stripLeadingAndTrailingHTMLSpaces(charset.toStringWithoutCopying())); > + return PAL::TextEncoding(charset.stripLeadingAndTrailingMatchedCharacters(isHTMLSpace<UChar>)); What a strange idiom shift with a quite different way to saying the same thing. Doesn’t need to be fixed in this patch, but it seems like we should add overloads or something so code doesn’t have to be so different in the two cases like this. We still need the String one for some call sites, I guess, because it optimizes the case where the string doesn’t have any leading and trailing spaces. But at least we can have String and StringView supported side by side or something. >> Source/WebCore/platform/network/HTTPParsers.cpp:419 >> + while (pos < length && mediaType[pos] <= ' ') > > Turns out this function was doing unsafe indexing in the String but we didn't realize until I switched to a StringView because this is how String's operator[] is implemented: > ``` > inline UChar String::characterAt(unsigned index) const > { > if (!m_impl || index >= m_impl->length()) > return 0; > return (*m_impl)[index]; > } > ``` > > It returns 0 for out of bound access instead of crashing. The StringView's operator[] isn't as forgiving :) I would say "this function was relying on String's operator[]"; it wasn’t *unsafe* indexing for String, although agreed, it would be unsafe for an array or StringView. On the other hand, the case below where it checks pos instead of endpos was more obviously a mistake rather than intentionally relying on how String works, although the code behaved correctly despite the mistake because of the String operator[] behavior. This function is written in a peculiar way. For example, the while loop condition isn’t needed at all; redundant since the findIgnoringASCIICase will return notFound. It seems to me the loop should be more like this: while ((pos = mediaType.findIgnoringASCIICase("charset", pos)) != notFound) The "charsetLen = 0" is also unneeded. Seems like we could rewrite this with some more modern parsing helpers to be a lot clearer, straighten out the syntax, and do less checking on individual characters. But it’s OK with me to just fix it without changing its idioms like we have done here.
Comment on attachment 456882 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review >> Source/WebCore/PAL/pal/text/TextEncoding.h:39 >> PAL_EXPORT TextEncoding(const char* name); > > Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. > > When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. I did not look too deeply into it but the constructor taking a `const char*` looked substantially different so I didn't dare drop it. In particular, it ends up calling atomCanonicalTextEncodingName(const char* name) which does a simple lookup in textEncodingNameMap. atomCanonicalTextEncodingName(const String& alias) / atomCanonicalTextEncodingName(StringView) however, they end up calling atomCanonicalTextEncodingName(const CharacterType* characters, size_t length) which ends up allocating a new char array and copying the characters over one by one. Then it will call the atomCanonicalTextEncodingName(const char* name). So it looks to me that the TextEncoding(StringView) constructor is currently less efficient than the TextEncoding(const char*) one.
(In reply to Chris Dumez from comment #5) > Comment on attachment 456882 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=456882&action=review > > >> Source/WebCore/PAL/pal/text/TextEncoding.h:39 > >> PAL_EXPORT TextEncoding(const char* name); > > > > Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. > > > > When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. > > I did not look too deeply into it but the constructor taking a `const char*` > looked substantially different so I didn't dare drop it. > In particular, it ends up calling atomCanonicalTextEncodingName(const char* > name) which does a simple lookup in textEncodingNameMap. > > atomCanonicalTextEncodingName(const String& alias) / > atomCanonicalTextEncodingName(StringView) however, they end up calling > atomCanonicalTextEncodingName(const CharacterType* characters, size_t > length) which ends up allocating a new char array and copying the characters > over one by one. > Then it will call the atomCanonicalTextEncodingName(const char* name). So it > looks to me that the TextEncoding(StringView) constructor is currently less > efficient than the TextEncoding(const char*) one. Maybe I could optimize the atomCanonicalTextEncodingName(StringView s) to simply call atomCanonicalTextEncodingName(const char* name) with a reinterpret_cast<const char*>(s.characters8()) if s.is8Bit()? And copy the characters to a new char array if !is8Bit(). Then I think the performance would be more comparable and we could get rid of the TextEncoding(const char*) constructor.
Comment on attachment 456882 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review >>>> Source/WebCore/PAL/pal/text/TextEncoding.h:39 >>>> PAL_EXPORT TextEncoding(const char* name); >>> >>> Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. >>> >>> When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. >> >> I did not look too deeply into it but the constructor taking a `const char*` looked substantially different so I didn't dare drop it. >> In particular, it ends up calling atomCanonicalTextEncodingName(const char* name) which does a simple lookup in textEncodingNameMap. >> >> atomCanonicalTextEncodingName(const String& alias) / atomCanonicalTextEncodingName(StringView) however, they end up calling atomCanonicalTextEncodingName(const CharacterType* characters, size_t length) which ends up allocating a new char array and copying the characters over one by one. >> Then it will call the atomCanonicalTextEncodingName(const char* name). So it looks to me that the TextEncoding(StringView) constructor is currently less efficient than the TextEncoding(const char*) one. > > Maybe I could optimize the atomCanonicalTextEncodingName(StringView s) to simply call atomCanonicalTextEncodingName(const char* name) with a reinterpret_cast<const char*>(s.characters8()) if s.is8Bit()? And copy the characters to a new char array if !is8Bit(). Then I think the performance would be more comparable and we could get rid of the TextEncoding(const char*) constructor. Doesn’t seem that important.
Created attachment 456969 [details] Patch
Committed r292588 (249421@main): <https://commits.webkit.org/249421@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 456969 [details].
<rdar://problem/91459274>