WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
238905
Add PAL::TextEncoding() constructor that takes in a StringView
https://bugs.webkit.org/show_bug.cgi?id=238905
Summary
Add PAL::TextEncoding() constructor that takes in a StringView
Chris Dumez
Reported
2022-04-06 15:46:44 PDT
The PAL::TextEncoding() constructor can take a StringView instead of a String. This allows some call sites to be more efficient.
Attachments
Patch
(20.36 KB, patch)
2022-04-06 15:50 PDT
,
Chris Dumez
no flags
Details
Formatted Diff
Diff
Patch
(21.87 KB, patch)
2022-04-06 19:45 PDT
,
Chris Dumez
no flags
Details
Formatted Diff
Diff
Patch
(24.00 KB, patch)
2022-04-07 14:32 PDT
,
Chris Dumez
no flags
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Chris Dumez
Comment 1
2022-04-06 15:50:55 PDT
Created
attachment 456875
[details]
Patch
Chris Dumez
Comment 2
2022-04-06 19:45:53 PDT
Created
attachment 456882
[details]
Patch
Chris Dumez
Comment 3
2022-04-07 07:57:22 PDT
Comment on
attachment 456882
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=456882&action=review
> Source/WebCore/platform/network/HTTPParsers.cpp:419 > + while (pos < length && mediaType[pos] <= ' ')
Turns out this function was doing unsafe indexing in the String but we didn't realize until I switched to a StringView because this is how String's operator[] is implemented: ``` inline UChar String::characterAt(unsigned index) const { if (!m_impl || index >= m_impl->length()) return 0; return (*m_impl)[index]; } ``` It returns 0 for out of bound access instead of crashing. The StringView's operator[] isn't as forgiving :)
Darin Adler
Comment 4
2022-04-07 12:05:40 PDT
Comment on
attachment 456882
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=456882&action=review
> Source/WebCore/Modules/fetch/FetchResponse.cpp:136 > + r->m_internalResponse.setTextEncodingName(extractCharsetFromMediaType(contentType).toString());
Seems unfortunate that setTextEncodingName can’t take the String we allocate with move semantics.
> Source/WebCore/PAL/pal/text/TextEncoding.h:39 > PAL_EXPORT TextEncoding(const char* name);
Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete.
> Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:124 > - return PAL::TextEncoding(stripLeadingAndTrailingHTMLSpaces(charset.toStringWithoutCopying())); > + return PAL::TextEncoding(charset.stripLeadingAndTrailingMatchedCharacters(isHTMLSpace<UChar>));
What a strange idiom shift with a quite different way to saying the same thing. Doesn’t need to be fixed in this patch, but it seems like we should add overloads or something so code doesn’t have to be so different in the two cases like this. We still need the String one for some call sites, I guess, because it optimizes the case where the string doesn’t have any leading and trailing spaces. But at least we can have String and StringView supported side by side or something.
>> Source/WebCore/platform/network/HTTPParsers.cpp:419 >> + while (pos < length && mediaType[pos] <= ' ') > > Turns out this function was doing unsafe indexing in the String but we didn't realize until I switched to a StringView because this is how String's operator[] is implemented: > ``` > inline UChar String::characterAt(unsigned index) const > { > if (!m_impl || index >= m_impl->length()) > return 0; > return (*m_impl)[index]; > } > ``` > > It returns 0 for out of bound access instead of crashing. The StringView's operator[] isn't as forgiving :)
I would say "this function was relying on String's operator[]"; it wasn’t *unsafe* indexing for String, although agreed, it would be unsafe for an array or StringView. On the other hand, the case below where it checks pos instead of endpos was more obviously a mistake rather than intentionally relying on how String works, although the code behaved correctly despite the mistake because of the String operator[] behavior. This function is written in a peculiar way. For example, the while loop condition isn’t needed at all; redundant since the findIgnoringASCIICase will return notFound. It seems to me the loop should be more like this: while ((pos = mediaType.findIgnoringASCIICase("charset", pos)) != notFound) The "charsetLen = 0" is also unneeded. Seems like we could rewrite this with some more modern parsing helpers to be a lot clearer, straighten out the syntax, and do less checking on individual characters. But it’s OK with me to just fix it without changing its idioms like we have done here.
Chris Dumez
Comment 5
2022-04-07 12:15:18 PDT
Comment on
attachment 456882
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=456882&action=review
>> Source/WebCore/PAL/pal/text/TextEncoding.h:39 >> PAL_EXPORT TextEncoding(const char* name); > > Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. > > When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete.
I did not look too deeply into it but the constructor taking a `const char*` looked substantially different so I didn't dare drop it. In particular, it ends up calling atomCanonicalTextEncodingName(const char* name) which does a simple lookup in textEncodingNameMap. atomCanonicalTextEncodingName(const String& alias) / atomCanonicalTextEncodingName(StringView) however, they end up calling atomCanonicalTextEncodingName(const CharacterType* characters, size_t length) which ends up allocating a new char array and copying the characters over one by one. Then it will call the atomCanonicalTextEncodingName(const char* name). So it looks to me that the TextEncoding(StringView) constructor is currently less efficient than the TextEncoding(const char*) one.
Chris Dumez
Comment 6
2022-04-07 12:20:20 PDT
(In reply to Chris Dumez from
comment #5
)
> Comment on
attachment 456882
[details]
> Patch > > View in context: >
https://bugs.webkit.org/attachment.cgi?id=456882&action=review
> > >> Source/WebCore/PAL/pal/text/TextEncoding.h:39 > >> PAL_EXPORT TextEncoding(const char* name); > > > > Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. > > > > When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. > > I did not look too deeply into it but the constructor taking a `const char*` > looked substantially different so I didn't dare drop it. > In particular, it ends up calling atomCanonicalTextEncodingName(const char* > name) which does a simple lookup in textEncodingNameMap. > > atomCanonicalTextEncodingName(const String& alias) / > atomCanonicalTextEncodingName(StringView) however, they end up calling > atomCanonicalTextEncodingName(const CharacterType* characters, size_t > length) which ends up allocating a new char array and copying the characters > over one by one. > Then it will call the atomCanonicalTextEncodingName(const char* name). So it > looks to me that the TextEncoding(StringView) constructor is currently less > efficient than the TextEncoding(const char*) one.
Maybe I could optimize the atomCanonicalTextEncodingName(StringView s) to simply call atomCanonicalTextEncodingName(const char* name) with a reinterpret_cast<const char*>(s.characters8()) if s.is8Bit()? And copy the characters to a new char array if !is8Bit(). Then I think the performance would be more comparable and we could get rid of the TextEncoding(const char*) constructor.
Darin Adler
Comment 7
2022-04-07 12:39:36 PDT
Comment on
attachment 456882
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=456882&action=review
>>>> Source/WebCore/PAL/pal/text/TextEncoding.h:39 >>>> PAL_EXPORT TextEncoding(const char* name); >>> >>> Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code. >>> >>> When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete. >> >> I did not look too deeply into it but the constructor taking a `const char*` looked substantially different so I didn't dare drop it. >> In particular, it ends up calling atomCanonicalTextEncodingName(const char* name) which does a simple lookup in textEncodingNameMap. >> >> atomCanonicalTextEncodingName(const String& alias) / atomCanonicalTextEncodingName(StringView) however, they end up calling atomCanonicalTextEncodingName(const CharacterType* characters, size_t length) which ends up allocating a new char array and copying the characters over one by one. >> Then it will call the atomCanonicalTextEncodingName(const char* name). So it looks to me that the TextEncoding(StringView) constructor is currently less efficient than the TextEncoding(const char*) one. > > Maybe I could optimize the atomCanonicalTextEncodingName(StringView s) to simply call atomCanonicalTextEncodingName(const char* name) with a reinterpret_cast<const char*>(s.characters8()) if s.is8Bit()? And copy the characters to a new char array if !is8Bit(). Then I think the performance would be more comparable and we could get rid of the TextEncoding(const char*) constructor.
Doesn’t seem that important.
Chris Dumez
Comment 8
2022-04-07 14:32:05 PDT
Created
attachment 456969
[details]
Patch
EWS
Comment 9
2022-04-07 19:31:45 PDT
Committed
r292588
(
249421@main
): <
https://commits.webkit.org/249421@main
> All reviewed patches have been landed. Closing bug and clearing flags on
attachment 456969
[details]
.
Radar WebKit Bug Importer
Comment 10
2022-04-07 19:32:18 PDT
<
rdar://problem/91459274
>
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug