238905 – Add PAL::TextEncoding() constructor that takes in a StringView

RESOLVED FIXED 238905

Add PAL::TextEncoding() constructor that takes in a StringView

https://bugs.webkit.org/show_bug.cgi?id=238905

Summary Add PAL::TextEncoding() constructor that takes in a StringView

Chris Dumez

Reported 2022-04-06 15:46:44 PDT

The PAL::TextEncoding() constructor can take a StringView instead of a String. This allows some call sites to be more efficient.

Attachments
Patch (20.36 KB, patch) 2022-04-06 15:50 PDT, Chris Dumez	no flags	Details Formatted Diff Diff
Patch (21.87 KB, patch) 2022-04-06 19:45 PDT, Chris Dumez	no flags	Details Formatted Diff Diff
Patch (24.00 KB, patch) 2022-04-07 14:32 PDT, Chris Dumez	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Chris Dumez

Comment 1 2022-04-06 15:50:55 PDT

Created attachment 456875 [details] Patch

Chris Dumez

Comment 2 2022-04-06 19:45:53 PDT

Created attachment 456882 [details] Patch

Chris Dumez

Comment 3 2022-04-07 07:57:22 PDT

Comment on attachment 456882 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review > Source/WebCore/platform/network/HTTPParsers.cpp:419 > + while (pos < length && mediaType[pos] <= ' ') Turns out this function was doing unsafe indexing in the String but we didn't realize until I switched to a StringView because this is how String's operator[] is implemented: ``` inline UChar String::characterAt(unsigned index) const { if (!m_impl || index >= m_impl->length()) return 0; return (*m_impl)[index]; } ``` It returns 0 for out of bound access instead of crashing. The StringView's operator[] isn't as forgiving :)

Darin Adler

Comment 4 2022-04-07 12:05:40 PDT

Comment on attachment 456882 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review > Source/WebCore/Modules/fetch/FetchResponse.cpp:136 > + r->m_internalResponse.setTextEncodingName(extractCharsetFromMediaType(contentType).toString()); Seems unfortunate that setTextEncodingName can’t take the String we allocate with move semantics. > Source/WebCore/PAL/pal/text/TextEncoding.h:39 > PAL_EXPORT TextEncoding(const char* name); Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. > Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:124 > - return PAL::TextEncoding(stripLeadingAndTrailingHTMLSpaces(charset.toStringWithoutCopying())); > + return PAL::TextEncoding(charset.stripLeadingAndTrailingMatchedCharacters(isHTMLSpace<UChar>)); What a strange idiom shift with a quite different way to saying the same thing. Doesn’t need to be fixed in this patch, but it seems like we should add overloads or something so code doesn’t have to be so different in the two cases like this. We still need the String one for some call sites, I guess, because it optimizes the case where the string doesn’t have any leading and trailing spaces. But at least we can have String and StringView supported side by side or something. >> Source/WebCore/platform/network/HTTPParsers.cpp:419 >> + while (pos < length && mediaType[pos] <= ' ') > > Turns out this function was doing unsafe indexing in the String but we didn't realize until I switched to a StringView because this is how String's operator[] is implemented: > ``` > inline UChar String::characterAt(unsigned index) const > { > if (!m_impl || index >= m_impl->length()) > return 0; > return (*m_impl)[index]; > } > ``` > > It returns 0 for out of bound access instead of crashing. The StringView's operator[] isn't as forgiving :) I would say "this function was relying on String's operator[]"; it wasn’t *unsafe* indexing for String, although agreed, it would be unsafe for an array or StringView. On the other hand, the case below where it checks pos instead of endpos was more obviously a mistake rather than intentionally relying on how String works, although the code behaved correctly despite the mistake because of the String operator[] behavior. This function is written in a peculiar way. For example, the while loop condition isn’t needed at all; redundant since the findIgnoringASCIICase will return notFound. It seems to me the loop should be more like this: while ((pos = mediaType.findIgnoringASCIICase("charset", pos)) != notFound) The "charsetLen = 0" is also unneeded. Seems like we could rewrite this with some more modern parsing helpers to be a lot clearer, straighten out the syntax, and do less checking on individual characters. But it’s OK with me to just fix it without changing its idioms like we have done here.

Chris Dumez

Comment 5 2022-04-07 12:15:18 PDT

Comment on attachment 456882 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review >> Source/WebCore/PAL/pal/text/TextEncoding.h:39 >> PAL_EXPORT TextEncoding(const char* name); > > Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. > > When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. I did not look too deeply into it but the constructor taking a `const char*` looked substantially different so I didn't dare drop it. In particular, it ends up calling atomCanonicalTextEncodingName(const char* name) which does a simple lookup in textEncodingNameMap. atomCanonicalTextEncodingName(const String& alias) / atomCanonicalTextEncodingName(StringView) however, they end up calling atomCanonicalTextEncodingName(const CharacterType* characters, size_t length) which ends up allocating a new char array and copying the characters over one by one. Then it will call the atomCanonicalTextEncodingName(const char* name). So it looks to me that the TextEncoding(StringView) constructor is currently less efficient than the TextEncoding(const char*) one.

Chris Dumez

Comment 6 2022-04-07 12:20:20 PDT

(In reply to Chris Dumez from comment #5) > Comment on attachment 456882 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=456882&action=review > > >> Source/WebCore/PAL/pal/text/TextEncoding.h:39 > >> PAL_EXPORT TextEncoding(const char* name); > > > > Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. > > > > When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. > > I did not look too deeply into it but the constructor taking a `const char*` > looked substantially different so I didn't dare drop it. > In particular, it ends up calling atomCanonicalTextEncodingName(const char* > name) which does a simple lookup in textEncodingNameMap. > > atomCanonicalTextEncodingName(const String& alias) / > atomCanonicalTextEncodingName(StringView) however, they end up calling > atomCanonicalTextEncodingName(const CharacterType* characters, size_t > length) which ends up allocating a new char array and copying the characters > over one by one. > Then it will call the atomCanonicalTextEncodingName(const char* name). So it > looks to me that the TextEncoding(StringView) constructor is currently less > efficient than the TextEncoding(const char*) one. Maybe I could optimize the atomCanonicalTextEncodingName(StringView s) to simply call atomCanonicalTextEncodingName(const char* name) with a reinterpret_cast<const char*>(s.characters8()) if s.is8Bit()? And copy the characters to a new char array if !is8Bit(). Then I think the performance would be more comparable and we could get rid of the TextEncoding(const char*) constructor.

Darin Adler

Comment 7 2022-04-07 12:39:36 PDT

Comment on attachment 456882 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review >>>> Source/WebCore/PAL/pal/text/TextEncoding.h:39 >>>> PAL_EXPORT TextEncoding(const char* name); >>> >>> Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. >>> >>> When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. >> >> I did not look too deeply into it but the constructor taking a `const char*` looked substantially different so I didn't dare drop it. >> In particular, it ends up calling atomCanonicalTextEncodingName(const char* name) which does a simple lookup in textEncodingNameMap. >> >> atomCanonicalTextEncodingName(const String& alias) / atomCanonicalTextEncodingName(StringView) however, they end up calling atomCanonicalTextEncodingName(const CharacterType* characters, size_t length) which ends up allocating a new char array and copying the characters over one by one. >> Then it will call the atomCanonicalTextEncodingName(const char* name). So it looks to me that the TextEncoding(StringView) constructor is currently less efficient than the TextEncoding(const char*) one. > > Maybe I could optimize the atomCanonicalTextEncodingName(StringView s) to simply call atomCanonicalTextEncodingName(const char* name) with a reinterpret_cast<const char*>(s.characters8()) if s.is8Bit()? And copy the characters to a new char array if !is8Bit(). Then I think the performance would be more comparable and we could get rid of the TextEncoding(const char*) constructor. Doesn’t seem that important.

Chris Dumez

Comment 8 2022-04-07 14:32:05 PDT

Created attachment 456969 [details] Patch

EWS

Comment 9 2022-04-07 19:31:45 PDT

Committed r292588 (249421@main): <https://commits.webkit.org/249421@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 456969 [details].

Radar WebKit Bug Importer

Comment 10 2022-04-07 19:32:18 PDT

<rdar://problem/91459274>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component WebCore Misc.

Assignee

Chris Dumez

Reported

2022-04-06 15:46 PDT

Modified

2022-04-07 19:32 PDT History

CC List

4 users Show

URL

Keywords InRadar

Depends on

Blocks