WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
NEW
232962
Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8
https://bugs.webkit.org/show_bug.cgi?id=232962
Summary
Add WTF::String::utf8Length and use it to allocate correct size buffer when c...
Alex Christensen
Reported
2021-11-10 12:21:21 PST
Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8
Attachments
Patch
(20.55 KB, patch)
2021-11-10 12:22 PST
,
Alex Christensen
ews-feeder
: commit-queue-
Details
Formatted Diff
Diff
Patch
(26.20 KB, patch)
2021-11-10 12:32 PST
,
Alex Christensen
no flags
Details
Formatted Diff
Diff
Patch
(25.47 KB, patch)
2021-11-10 14:42 PST
,
Alex Christensen
no flags
Details
Formatted Diff
Diff
Patch
(29.58 KB, patch)
2021-11-29 15:45 PST
,
Alex Christensen
ews-feeder
: commit-queue-
Details
Formatted Diff
Diff
Show Obsolete
(3)
View All
Add attachment
proposed patch, testcase, etc.
Alex Christensen
Comment 1
2021-11-10 12:22:39 PST
Created
attachment 443845
[details]
Patch
Alex Christensen
Comment 2
2021-11-10 12:32:10 PST
Created
attachment 443848
[details]
Patch
Darin Adler
Comment 3
2021-11-10 14:10:24 PST
Comment on
attachment 443848
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=443848&action=review
> Source/JavaScriptCore/ChangeLog:3 > + Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8
Is scanning the characters twice to compute the length and then convert better for performance? Did you measure?
> Source/WTF/wtf/unicode/UTF8Conversion.cpp:46 > + uint8_t buffer[U8_MAX_LENGTH]; > + int32_t i = 0; > + // All 256 possible code points are valid, and the buffer is always big enough, > + // so it's safe to use U8_APPEND_UNSAFE. > + U8_APPEND_UNSAFE(buffer, i, *source); > + length += i;
Why use U8_APPEND_UNSAFE instead of U8_LENGTH?
> Source/WTF/wtf/unicode/UTF8Conversion.cpp:75 > + const UChar* source = sourceStart; > + while (source < sourceEnd) {
Going with the flow of the U16_NEXT macro, we would use sourceStart and j, and not need a source pointer.
> Source/WTF/wtf/unicode/UTF8Conversion.cpp:78 > + U16_NEXT(source, j, sourceEnd - source, ch);
Is this fast enough? Should we make a faster version for when there are no surrogates at all?
> Source/WTF/wtf/unicode/UTF8Conversion.cpp:93 > + // We already checked that ch is a valid code point, and the buffer is always big enough, > + // so it's safe to use U8_APPEND_UNSAFE. > + uint8_t buffer[U8_MAX_LENGTH]; > + int32_t i = 0; > + U8_APPEND_UNSAFE(buffer, i, ch); > + length += i;
Why use U8_APPEND_UNSAFE instead of U8_LENGTH?
> Source/WTF/wtf/unicode/UTF8Conversion.h:35 > ConversionOK, // conversion successful
I’d rename this to just "OK" or "Success" when making this an enum class. Conversion::Conversion is inelegant.
> Source/WTF/wtf/unicode/UTF8Conversion.h:50 > +WTF_EXPORT_PRIVATE size_t latin1ToUTF8Length(const LChar* sourceStart, const LChar* sourceEnd);
I don’t totally love this name, and it’s not quite as nice about the other names in this header. Maybe I would call it something closer to lengthLatin1AsUTF8 or lengthUTF8FromLatin1. I like having length be the first word.
> Source/WTF/wtf/unicode/UTF8Conversion.h:52 > +WTF_EXPORT_PRIVATE size_t utf16ToUTF8Length(const UChar* sourceStart, const UChar* sourceEnd, bool strict = true);
And lengthUTF8FromUTF16 maybe?
Alex Christensen
Comment 4
2021-11-10 14:42:33 PST
Created
attachment 443867
[details]
Patch
Alex Christensen
Comment 5
2021-11-10 14:44:11 PST
(didn't respond to feedback yet, just trying to get it to compile and not crash)
Alex Christensen
Comment 6
2021-11-10 15:33:46 PST
Comment on
attachment 443867
[details]
Patch Also needs MUCH better tests. Will do later
Alex Christensen
Comment 7
2021-11-10 15:47:15 PST
(In reply to Darin Adler from
comment #3
)
> Comment on
attachment 443848
[details]
> Patch > > View in context: >
https://bugs.webkit.org/attachment.cgi?id=443848&action=review
> > > Source/JavaScriptCore/ChangeLog:3 > > + Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8 > > Is scanning the characters twice to compute the length and then convert > better for performance? Did you measure?
Operations needed to UTF-8 encode a string before this patch: 1. Read each character 2. Encode each character 3. Write each converted character 4. memcpy (another read and write) converted characters to another buffer Allocations: 1 or 2 After this patch: 1. Read each character 2. Encode each character 3. Read each character 4. Encode each character again 5. Write each converted character Allocations: 1 I'm definitely planning to measure the performance of long/short/ascii/latin1/utf16 strings on different chips to be sure, but depending on the ratio of memcpy speed to encoding speed and allocation speed I think this may be a win in most cases. There is definitely always one fewer write operation per character after this change.
Darin Adler
Comment 8
2021-11-10 15:59:09 PST
(In reply to Alex Christensen from
comment #7
)
> I'm definitely planning to measure the performance of > long/short/ascii/latin1/utf16 strings on different chips to be sure, but > depending on the ratio of memcpy speed to encoding speed and allocation > speed I think this may be a win in most cases.
Makes sense. If we were always doing an allocation, then I’m sure this would be a win because fastMalloc/free is quite costly. Not as sure for smaller strings where we don’t do any extra allocation. I hope it’s always faster.
Radar WebKit Bug Importer
Comment 9
2021-11-17 12:22:21 PST
<
rdar://problem/85515842
>
Alex Christensen
Comment 10
2021-11-29 15:45:27 PST
Created
attachment 445357
[details]
Patch
Alex Christensen
Comment 11
2021-11-29 20:56:43 PST
Performance measurements indicate that this is a performance regression. It's hard to beat memcpy. I broke out the code cleanup and a slight optimization into
https://bugs.webkit.org/show_bug.cgi?id=233612
I'll use this bug to add String::utf8Length but not use it in String::utf8 once that's landed.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug