Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8
Created attachment 443845 [details] Patch
Created attachment 443848 [details] Patch
Comment on attachment 443848 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=443848&action=review > Source/JavaScriptCore/ChangeLog:3 > + Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8 Is scanning the characters twice to compute the length and then convert better for performance? Did you measure? > Source/WTF/wtf/unicode/UTF8Conversion.cpp:46 > + uint8_t buffer[U8_MAX_LENGTH]; > + int32_t i = 0; > + // All 256 possible code points are valid, and the buffer is always big enough, > + // so it's safe to use U8_APPEND_UNSAFE. > + U8_APPEND_UNSAFE(buffer, i, *source); > + length += i; Why use U8_APPEND_UNSAFE instead of U8_LENGTH? > Source/WTF/wtf/unicode/UTF8Conversion.cpp:75 > + const UChar* source = sourceStart; > + while (source < sourceEnd) { Going with the flow of the U16_NEXT macro, we would use sourceStart and j, and not need a source pointer. > Source/WTF/wtf/unicode/UTF8Conversion.cpp:78 > + U16_NEXT(source, j, sourceEnd - source, ch); Is this fast enough? Should we make a faster version for when there are no surrogates at all? > Source/WTF/wtf/unicode/UTF8Conversion.cpp:93 > + // We already checked that ch is a valid code point, and the buffer is always big enough, > + // so it's safe to use U8_APPEND_UNSAFE. > + uint8_t buffer[U8_MAX_LENGTH]; > + int32_t i = 0; > + U8_APPEND_UNSAFE(buffer, i, ch); > + length += i; Why use U8_APPEND_UNSAFE instead of U8_LENGTH? > Source/WTF/wtf/unicode/UTF8Conversion.h:35 > ConversionOK, // conversion successful I’d rename this to just "OK" or "Success" when making this an enum class. Conversion::Conversion is inelegant. > Source/WTF/wtf/unicode/UTF8Conversion.h:50 > +WTF_EXPORT_PRIVATE size_t latin1ToUTF8Length(const LChar* sourceStart, const LChar* sourceEnd); I don’t totally love this name, and it’s not quite as nice about the other names in this header. Maybe I would call it something closer to lengthLatin1AsUTF8 or lengthUTF8FromLatin1. I like having length be the first word. > Source/WTF/wtf/unicode/UTF8Conversion.h:52 > +WTF_EXPORT_PRIVATE size_t utf16ToUTF8Length(const UChar* sourceStart, const UChar* sourceEnd, bool strict = true); And lengthUTF8FromUTF16 maybe?
Created attachment 443867 [details] Patch
(didn't respond to feedback yet, just trying to get it to compile and not crash)
Comment on attachment 443867 [details] Patch Also needs MUCH better tests. Will do later
(In reply to Darin Adler from comment #3) > Comment on attachment 443848 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=443848&action=review > > > Source/JavaScriptCore/ChangeLog:3 > > + Add WTF::String::utf8Length and use it to allocate correct size buffer when converting to UTF-8 > > Is scanning the characters twice to compute the length and then convert > better for performance? Did you measure? Operations needed to UTF-8 encode a string before this patch: 1. Read each character 2. Encode each character 3. Write each converted character 4. memcpy (another read and write) converted characters to another buffer Allocations: 1 or 2 After this patch: 1. Read each character 2. Encode each character 3. Read each character 4. Encode each character again 5. Write each converted character Allocations: 1 I'm definitely planning to measure the performance of long/short/ascii/latin1/utf16 strings on different chips to be sure, but depending on the ratio of memcpy speed to encoding speed and allocation speed I think this may be a win in most cases. There is definitely always one fewer write operation per character after this change.
(In reply to Alex Christensen from comment #7) > I'm definitely planning to measure the performance of > long/short/ascii/latin1/utf16 strings on different chips to be sure, but > depending on the ratio of memcpy speed to encoding speed and allocation > speed I think this may be a win in most cases. Makes sense. If we were always doing an allocation, then I’m sure this would be a win because fastMalloc/free is quite costly. Not as sure for smaller strings where we don’t do any extra allocation. I hope it’s always faster.
<rdar://problem/85515842>
Created attachment 445357 [details] Patch
Performance measurements indicate that this is a performance regression. It's hard to beat memcpy. I broke out the code cleanup and a slight optimization into https://bugs.webkit.org/show_bug.cgi?id=233612 I'll use this bug to add String::utf8Length but not use it in String::utf8 once that's landed.