232962 – Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8

NEW232962

Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8

https://bugs.webkit.org/show_bug.cgi?id=232962

Summary Add WTF::String::utf8Length and use it to allocate correct size buffer when c...

Alex Christensen

Reported 2021-11-10 12:21:21 PST

Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8

Attachments
Patch (20.55 KB, patch) 2021-11-10 12:22 PST, Alex Christensen	ews-feeder: commit-queue-	Details Formatted Diff Diff
Patch (26.20 KB, patch) 2021-11-10 12:32 PST, Alex Christensen	no flags	Details Formatted Diff Diff
Patch (25.47 KB, patch) 2021-11-10 14:42 PST, Alex Christensen	no flags	Details Formatted Diff Diff
Patch (29.58 KB, patch) 2021-11-29 15:45 PST, Alex Christensen	ews-feeder: commit-queue-	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Alex Christensen

Comment 1 2021-11-10 12:22:39 PST

Created attachment 443845 [details] Patch

Alex Christensen

Comment 2 2021-11-10 12:32:10 PST

Created attachment 443848 [details] Patch

Darin Adler

Comment 3 2021-11-10 14:10:24 PST

Comment on attachment 443848 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=443848&action=review > Source/JavaScriptCore/ChangeLog:3 > + Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8 Is scanning the characters twice to compute the length and then convert better for performance? Did you measure? > Source/WTF/wtf/unicode/UTF8Conversion.cpp:46 > + uint8_t buffer[U8_MAX_LENGTH]; > + int32_t i = 0; > + // All 256 possible code points are valid, and the buffer is always big enough, > + // so it's safe to use U8_APPEND_UNSAFE. > + U8_APPEND_UNSAFE(buffer, i, *source); > + length += i; Why use U8_APPEND_UNSAFE instead of U8_LENGTH? > Source/WTF/wtf/unicode/UTF8Conversion.cpp:75 > + const UChar* source = sourceStart; > + while (source < sourceEnd) { Going with the flow of the U16_NEXT macro, we would use sourceStart and j, and not need a source pointer. > Source/WTF/wtf/unicode/UTF8Conversion.cpp:78 > + U16_NEXT(source, j, sourceEnd - source, ch); Is this fast enough? Should we make a faster version for when there are no surrogates at all? > Source/WTF/wtf/unicode/UTF8Conversion.cpp:93 > + // We already checked that ch is a valid code point, and the buffer is always big enough, > + // so it's safe to use U8_APPEND_UNSAFE. > + uint8_t buffer[U8_MAX_LENGTH]; > + int32_t i = 0; > + U8_APPEND_UNSAFE(buffer, i, ch); > + length += i; Why use U8_APPEND_UNSAFE instead of U8_LENGTH? > Source/WTF/wtf/unicode/UTF8Conversion.h:35 > ConversionOK, // conversion successful I’d rename this to just "OK" or "Success" when making this an enum class. Conversion::Conversion is inelegant. > Source/WTF/wtf/unicode/UTF8Conversion.h:50 > +WTF_EXPORT_PRIVATE size_t latin1ToUTF8Length(const LChar* sourceStart, const LChar* sourceEnd); I don’t totally love this name, and it’s not quite as nice about the other names in this header. Maybe I would call it something closer to lengthLatin1AsUTF8 or lengthUTF8FromLatin1. I like having length be the first word. > Source/WTF/wtf/unicode/UTF8Conversion.h:52 > +WTF_EXPORT_PRIVATE size_t utf16ToUTF8Length(const UChar* sourceStart, const UChar* sourceEnd, bool strict = true); And lengthUTF8FromUTF16 maybe?

Alex Christensen

Comment 4 2021-11-10 14:42:33 PST

Created attachment 443867 [details] Patch

Alex Christensen

Comment 5 2021-11-10 14:44:11 PST

(didn't respond to feedback yet, just trying to get it to compile and not crash)

Alex Christensen

Comment 6 2021-11-10 15:33:46 PST

Comment on attachment 443867 [details] Patch Also needs MUCH better tests. Will do later

Alex Christensen

Comment 7 2021-11-10 15:47:15 PST

(In reply to Darin Adler from comment #3) > Comment on attachment 443848 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=443848&action=review > > > Source/JavaScriptCore/ChangeLog:3 > > + Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8 > > Is scanning the characters twice to compute the length and then convert > better for performance? Did you measure? Operations needed to UTF-8 encode a string before this patch: 1. Read each character 2. Encode each character 3. Write each converted character 4. memcpy (another read and write) converted characters to another buffer Allocations: 1 or 2 After this patch: 1. Read each character 2. Encode each character 3. Read each character 4. Encode each character again 5. Write each converted character Allocations: 1 I'm definitely planning to measure the performance of long/short/ascii/latin1/utf16 strings on different chips to be sure, but depending on the ratio of memcpy speed to encoding speed and allocation speed I think this may be a win in most cases. There is definitely always one fewer write operation per character after this change.

Darin Adler

Comment 8 2021-11-10 15:59:09 PST

(In reply to Alex Christensen from comment #7) > I'm definitely planning to measure the performance of > long/short/ascii/latin1/utf16 strings on different chips to be sure, but > depending on the ratio of memcpy speed to encoding speed and allocation > speed I think this may be a win in most cases. Makes sense. If we were always doing an allocation, then I’m sure this would be a win because fastMalloc/free is quite costly. Not as sure for smaller strings where we don’t do any extra allocation. I hope it’s always faster.

Radar WebKit Bug Importer

Comment 9 2021-11-17 12:22:21 PST

<rdar://problem/85515842>

Alex Christensen

Comment 10 2021-11-29 15:45:27 PST

Created attachment 445357 [details] Patch

Alex Christensen

Comment 11 2021-11-29 20:56:43 PST

Performance measurements indicate that this is a performance regression. It's hard to beat memcpy. I broke out the code cleanup and a slight optimization into https://bugs.webkit.org/show_bug.cgi?id=233612 I'll use this bug to add String::utf8Length but not use it in String::utf8 once that's landed.

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Alex Christensen

Reported

2021-11-10 12:21 PST

Modified

2021-11-29 20:56 PST History

CC List

12 users Show

URL

Keywords InRadar

Depends on

Blocks