Bug 232962 - Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8
Summary: Add WTF::String::utf8Length and use it to allocate correct size buffer when c...
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-11-10 12:21 PST by Alex Christensen
Modified: 2021-11-29 20:56 PST (History)
12 users (show)

See Also:


Attachments
Patch (20.55 KB, patch)
2021-11-10 12:22 PST, Alex Christensen
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (26.20 KB, patch)
2021-11-10 12:32 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (25.47 KB, patch)
2021-11-10 14:42 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (29.58 KB, patch)
2021-11-29 15:45 PST, Alex Christensen
ews-feeder: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2021-11-10 12:21:21 PST
Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8
Comment 1 Alex Christensen 2021-11-10 12:22:39 PST
Created attachment 443845 [details]
Patch
Comment 2 Alex Christensen 2021-11-10 12:32:10 PST
Created attachment 443848 [details]
Patch
Comment 3 Darin Adler 2021-11-10 14:10:24 PST
Comment on attachment 443848 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=443848&action=review

> Source/JavaScriptCore/ChangeLog:3
> +        Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8

Is scanning the characters twice to compute the length and then convert better for performance? Did you measure?

> Source/WTF/wtf/unicode/UTF8Conversion.cpp:46
> +        uint8_t buffer[U8_MAX_LENGTH];
> +        int32_t i = 0;
> +        // All 256 possible code points are valid, and the buffer is always big enough,
> +        // so it's safe to use U8_APPEND_UNSAFE.
> +        U8_APPEND_UNSAFE(buffer, i, *source);
> +        length += i;

Why use U8_APPEND_UNSAFE instead of U8_LENGTH?

> Source/WTF/wtf/unicode/UTF8Conversion.cpp:75
> +    const UChar* source = sourceStart;
> +    while (source < sourceEnd) {

Going with the flow of the U16_NEXT macro, we would use sourceStart and j, and not need a source pointer.

> Source/WTF/wtf/unicode/UTF8Conversion.cpp:78
> +        U16_NEXT(source, j, sourceEnd - source, ch);

Is this fast enough? Should we make a faster version for when there are no surrogates at all?

> Source/WTF/wtf/unicode/UTF8Conversion.cpp:93
> +        // We already checked that ch is a valid code point, and the buffer is always big enough,
> +        // so it's safe to use U8_APPEND_UNSAFE.
> +        uint8_t buffer[U8_MAX_LENGTH];
> +        int32_t i = 0;
> +        U8_APPEND_UNSAFE(buffer, i, ch);
> +        length += i;

Why use U8_APPEND_UNSAFE instead of U8_LENGTH?

> Source/WTF/wtf/unicode/UTF8Conversion.h:35
>      ConversionOK, // conversion successful

I’d rename this to just "OK" or "Success" when making this an enum class. Conversion::Conversion is inelegant.

> Source/WTF/wtf/unicode/UTF8Conversion.h:50
> +WTF_EXPORT_PRIVATE size_t latin1ToUTF8Length(const LChar* sourceStart, const LChar* sourceEnd);

I don’t totally love this name, and it’s not quite as nice about the other names in this header. Maybe I would call it something closer to lengthLatin1AsUTF8 or lengthUTF8FromLatin1. I like having length be the first word.

> Source/WTF/wtf/unicode/UTF8Conversion.h:52
> +WTF_EXPORT_PRIVATE size_t utf16ToUTF8Length(const UChar* sourceStart, const UChar* sourceEnd, bool strict = true);

And lengthUTF8FromUTF16 maybe?
Comment 4 Alex Christensen 2021-11-10 14:42:33 PST
Created attachment 443867 [details]
Patch
Comment 5 Alex Christensen 2021-11-10 14:44:11 PST
(didn't respond to feedback yet, just trying to get it to compile and not crash)
Comment 6 Alex Christensen 2021-11-10 15:33:46 PST
Comment on attachment 443867 [details]
Patch

Also needs MUCH better tests.  Will do later
Comment 7 Alex Christensen 2021-11-10 15:47:15 PST
(In reply to Darin Adler from comment #3)
> Comment on attachment 443848 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=443848&action=review
> 
> > Source/JavaScriptCore/ChangeLog:3
> > +        Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8
> 
> Is scanning the characters twice to compute the length and then convert
> better for performance? Did you measure?

Operations needed to UTF-8 encode a string before this patch:
1. Read each character
2. Encode each character
3. Write each converted character
4. memcpy (another read and write) converted characters to another buffer
Allocations: 1 or 2

After this patch:
1. Read each character
2. Encode each character
3. Read each character
4. Encode each character again
5. Write each converted character
Allocations: 1

I'm definitely planning to measure the performance of long/short/ascii/latin1/utf16 strings on different chips to be sure, but depending on the ratio of memcpy speed to encoding speed and allocation speed I think this may be a win in most cases.  There is definitely always one fewer write operation per character after this change.
Comment 8 Darin Adler 2021-11-10 15:59:09 PST
(In reply to Alex Christensen from comment #7)
> I'm definitely planning to measure the performance of
> long/short/ascii/latin1/utf16 strings on different chips to be sure, but
> depending on the ratio of memcpy speed to encoding speed and allocation
> speed I think this may be a win in most cases.

Makes sense.

If we were always doing an allocation, then I’m sure this would be a win because fastMalloc/free is quite costly. Not as sure for smaller strings where we don’t do any extra allocation.

I hope it’s always faster.
Comment 9 Radar WebKit Bug Importer 2021-11-17 12:22:21 PST
<rdar://problem/85515842>
Comment 10 Alex Christensen 2021-11-29 15:45:27 PST
Created attachment 445357 [details]
Patch
Comment 11 Alex Christensen 2021-11-29 20:56:43 PST
Performance measurements indicate that this is a performance regression.  It's hard to beat memcpy.
I broke out the code cleanup and a slight optimization into https://bugs.webkit.org/show_bug.cgi?id=233612
I'll use this bug to add String::utf8Length but not use it in String::utf8 once that's  landed.