Bug 238905 - Add PAL::TextEncoding() constructor that takes in a StringView
Summary: Add PAL::TextEncoding() constructor that takes in a StringView
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2022-04-06 15:46 PDT by Chris Dumez
Modified: 2022-04-07 19:32 PDT (History)
4 users (show)

See Also:


Attachments
Patch (20.36 KB, patch)
2022-04-06 15:50 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (21.87 KB, patch)
2022-04-06 19:45 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (24.00 KB, patch)
2022-04-07 14:32 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Dumez 2022-04-06 15:46:44 PDT
The PAL::TextEncoding() constructor can take a StringView instead of a String. This allows some call sites to be more efficient.
Comment 1 Chris Dumez 2022-04-06 15:50:55 PDT
Created attachment 456875 [details]
Patch
Comment 2 Chris Dumez 2022-04-06 19:45:53 PDT
Created attachment 456882 [details]
Patch
Comment 3 Chris Dumez 2022-04-07 07:57:22 PDT
Comment on attachment 456882 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review

> Source/WebCore/platform/network/HTTPParsers.cpp:419
> +        while (pos < length && mediaType[pos] <= ' ')

Turns out this function was doing unsafe indexing in the String but we didn't realize until I switched to a StringView because this is how String's operator[] is implemented:
```
inline UChar String::characterAt(unsigned index) const
{
    if (!m_impl || index >= m_impl->length())
        return 0;
    return (*m_impl)[index];
}
```

It returns 0 for out of bound access instead of crashing. The StringView's operator[] isn't as forgiving :)
Comment 4 Darin Adler 2022-04-07 12:05:40 PDT
Comment on attachment 456882 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review

> Source/WebCore/Modules/fetch/FetchResponse.cpp:136
> +    r->m_internalResponse.setTextEncodingName(extractCharsetFromMediaType(contentType).toString());

Seems unfortunate that setTextEncodingName can’t take the String we allocate with move semantics.

> Source/WebCore/PAL/pal/text/TextEncoding.h:39
>      PAL_EXPORT TextEncoding(const char* name);

Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code.

When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete.

> Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp:124
> -        return PAL::TextEncoding(stripLeadingAndTrailingHTMLSpaces(charset.toStringWithoutCopying()));
> +        return PAL::TextEncoding(charset.stripLeadingAndTrailingMatchedCharacters(isHTMLSpace<UChar>));

What a strange idiom shift with a quite different way to saying the same thing. Doesn’t need to be fixed in this patch, but it seems like we should add overloads or something so code doesn’t have to be so different in the two cases like this. We still need the String one for some call sites, I guess, because it optimizes the case where the string doesn’t have any leading and trailing spaces. But at least we can have String and StringView supported side by side or something.

>> Source/WebCore/platform/network/HTTPParsers.cpp:419
>> +        while (pos < length && mediaType[pos] <= ' ')
> 
> Turns out this function was doing unsafe indexing in the String but we didn't realize until I switched to a StringView because this is how String's operator[] is implemented:
> ```
> inline UChar String::characterAt(unsigned index) const
> {
>     if (!m_impl || index >= m_impl->length())
>         return 0;
>     return (*m_impl)[index];
> }
> ```
> 
> It returns 0 for out of bound access instead of crashing. The StringView's operator[] isn't as forgiving :)

I would say "this function was relying on String's operator[]"; it wasn’t *unsafe* indexing for String, although agreed, it would be unsafe for an array or StringView. On the other hand, the case below where it checks pos instead of endpos was more obviously a mistake rather than intentionally relying on how String works, although the code behaved correctly despite the mistake because of the String operator[] behavior.

This function is written in a peculiar way. For example, the while loop condition isn’t needed at all; redundant since the findIgnoringASCIICase will return notFound. It seems to me the loop should be more like this:

    while ((pos = mediaType.findIgnoringASCIICase("charset", pos)) != notFound)

The "charsetLen = 0" is also unneeded. Seems like we could rewrite this with some more modern parsing helpers to be a lot clearer, straighten out the syntax, and do less checking on individual characters. But it’s OK with me to just fix it without changing its idioms like we have done here.
Comment 5 Chris Dumez 2022-04-07 12:15:18 PDT
Comment on attachment 456882 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review

>> Source/WebCore/PAL/pal/text/TextEncoding.h:39
>>      PAL_EXPORT TextEncoding(const char* name);
> 
> Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code.
> 
> When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete.

I did not look too deeply into it but the constructor taking a `const char*` looked substantially different so I didn't dare drop it.
In particular, it ends up calling atomCanonicalTextEncodingName(const char* name) which does a simple lookup in textEncodingNameMap.

atomCanonicalTextEncodingName(const String& alias) / atomCanonicalTextEncodingName(StringView) however, they end up calling atomCanonicalTextEncodingName(const CharacterType* characters, size_t length) which ends up allocating a new char array and copying the characters over one by one.
Then it will call the atomCanonicalTextEncodingName(const char* name). So it looks to me that the TextEncoding(StringView) constructor is currently less efficient than the TextEncoding(const char*) one.
Comment 6 Chris Dumez 2022-04-07 12:20:20 PDT
(In reply to Chris Dumez from comment #5)
> Comment on attachment 456882 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=456882&action=review
> 
> >> Source/WebCore/PAL/pal/text/TextEncoding.h:39
> >>      PAL_EXPORT TextEncoding(const char* name);
> > 
> > Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code.
> > 
> > When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete.
> 
> I did not look too deeply into it but the constructor taking a `const char*`
> looked substantially different so I didn't dare drop it.
> In particular, it ends up calling atomCanonicalTextEncodingName(const char*
> name) which does a simple lookup in textEncodingNameMap.
> 
> atomCanonicalTextEncodingName(const String& alias) /
> atomCanonicalTextEncodingName(StringView) however, they end up calling
> atomCanonicalTextEncodingName(const CharacterType* characters, size_t
> length) which ends up allocating a new char array and copying the characters
> over one by one.
> Then it will call the atomCanonicalTextEncodingName(const char* name). So it
> looks to me that the TextEncoding(StringView) constructor is currently less
> efficient than the TextEncoding(const char*) one.

Maybe I could optimize the atomCanonicalTextEncodingName(StringView s) to simply call atomCanonicalTextEncodingName(const char* name) with a reinterpret_cast<const char*>(s.characters8()) if s.is8Bit()? And copy the characters to a new char array if !is8Bit(). Then I think the performance would be more comparable and we could get rid of the TextEncoding(const char*) constructor.
Comment 7 Darin Adler 2022-04-07 12:39:36 PDT
Comment on attachment 456882 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=456882&action=review

>>>> Source/WebCore/PAL/pal/text/TextEncoding.h:39
>>>>      PAL_EXPORT TextEncoding(const char* name);
>>> 
>>> Do we still need the const char* overload? Before, we clearly did because otherwise there would be an extra String create/destroy every time. But now maybe we could get rid of it since we have a StringView version. If we did that and it didn’t cause a performance problem, we could also get rid of the const String& overload. Nice cleanup changing this to be more straightforward code.
>>> 
>>> When I originally wrote this very long ago, I thought of text encoding names as all ASCII literals, and back then this class was sort of a AtomString for ASCII literals. Seems sort of obsolete.
>> 
>> I did not look too deeply into it but the constructor taking a `const char*` looked substantially different so I didn't dare drop it.
>> In particular, it ends up calling atomCanonicalTextEncodingName(const char* name) which does a simple lookup in textEncodingNameMap.
>> 
>> atomCanonicalTextEncodingName(const String& alias) / atomCanonicalTextEncodingName(StringView) however, they end up calling atomCanonicalTextEncodingName(const CharacterType* characters, size_t length) which ends up allocating a new char array and copying the characters over one by one.
>> Then it will call the atomCanonicalTextEncodingName(const char* name). So it looks to me that the TextEncoding(StringView) constructor is currently less efficient than the TextEncoding(const char*) one.
> 
> Maybe I could optimize the atomCanonicalTextEncodingName(StringView s) to simply call atomCanonicalTextEncodingName(const char* name) with a reinterpret_cast<const char*>(s.characters8()) if s.is8Bit()? And copy the characters to a new char array if !is8Bit(). Then I think the performance would be more comparable and we could get rid of the TextEncoding(const char*) constructor.

Doesn’t seem that important.
Comment 8 Chris Dumez 2022-04-07 14:32:05 PDT
Created attachment 456969 [details]
Patch
Comment 9 EWS 2022-04-07 19:31:45 PDT
Committed r292588 (249421@main): <https://commits.webkit.org/249421@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 456969 [details].
Comment 10 Radar WebKit Bug Importer 2022-04-07 19:32:18 PDT
<rdar://problem/91459274>