Bug 195535 - WebKit has too much of its own UTF-8 code and should rely more on ICU's UTF-8 support
Summary: WebKit has too much of its own UTF-8 code and should rely more on ICU's UTF-8...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: WebKit Nightly Build
Hardware: All All
: P2 Normal
Assignee: Darin Adler
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2019-03-10 14:29 PDT by Darin Adler
Modified: 2020-08-19 17:15 PDT (History)
22 users (show)

See Also:


Attachments
Patch (47.62 KB, patch)
2019-03-10 14:57 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (47.89 KB, patch)
2019-03-10 15:15 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews100 for mac-highsierra (663.30 KB, application/zip)
2019-03-10 15:58 PDT, EWS Watchlist
no flags Details
Archive of layout-test-results from ews106 for mac-highsierra-wk2 (769.16 KB, application/zip)
2019-03-10 16:04 PDT, EWS Watchlist
no flags Details
Archive of layout-test-results from ews115 for mac-highsierra (574.30 KB, application/zip)
2019-03-10 16:12 PDT, EWS Watchlist
no flags Details
Patch (50.41 KB, patch)
2019-04-21 12:53 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews123 for ios-simulator-wk2 (2.91 MB, application/zip)
2019-04-21 15:03 PDT, EWS Watchlist
no flags Details
Patch (65.11 KB, patch)
2019-04-21 17:12 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (67.00 KB, patch)
2019-04-28 09:12 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (69.53 KB, patch)
2019-04-29 19:03 PDT, Darin Adler
ap: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Darin Adler 2019-03-10 14:29:56 PDT
WebKit has too much of its own UTF-8 code and should rely more on ICU's UTF-8 support
Comment 1 Darin Adler 2019-03-10 14:57:18 PDT Comment hidden (obsolete)
Comment 2 Darin Adler 2019-03-10 15:15:33 PDT Comment hidden (obsolete)
Comment 3 EWS Watchlist 2019-03-10 15:58:46 PDT Comment hidden (obsolete)
Comment 4 EWS Watchlist 2019-03-10 15:58:48 PDT Comment hidden (obsolete)
Comment 5 EWS Watchlist 2019-03-10 16:04:08 PDT Comment hidden (obsolete)
Comment 6 EWS Watchlist 2019-03-10 16:04:10 PDT Comment hidden (obsolete)
Comment 7 EWS Watchlist 2019-03-10 16:12:27 PDT Comment hidden (obsolete)
Comment 8 EWS Watchlist 2019-03-10 16:12:29 PDT Comment hidden (obsolete)
Comment 9 Darin Adler 2019-03-10 16:14:06 PDT
Trying to figure out what kind of failure/crash this is.
Comment 10 EWS Watchlist 2019-03-10 17:53:22 PDT Comment hidden (obsolete)
Comment 11 Darin Adler 2019-04-21 12:53:11 PDT Comment hidden (obsolete)
Comment 12 Darin Adler 2019-04-21 12:54:43 PDT
I figured it out: previous patch broke AtomicString::fromUTF8 when passed a null-terminated string, and that’s fixed now.
Comment 13 EWS Watchlist 2019-04-21 15:03:57 PDT Comment hidden (obsolete)
Comment 14 EWS Watchlist 2019-04-21 15:03:59 PDT Comment hidden (obsolete)
Comment 15 jsc-armv7 EWS 2019-04-21 15:17:57 PDT Comment hidden (obsolete)
Comment 16 EWS Watchlist 2019-04-21 15:20:24 PDT Comment hidden (obsolete)
Comment 17 Darin Adler 2019-04-21 17:12:12 PDT Comment hidden (obsolete)
Comment 18 jsc-armv7 EWS 2019-04-21 19:36:28 PDT Comment hidden (obsolete)
Comment 19 EWS Watchlist 2019-04-21 19:40:32 PDT Comment hidden (obsolete)
Comment 20 Darin Adler 2019-04-22 08:34:13 PDT
Not sure how to run those tests, but I’ll look into why they are failing next time I have a chance to work on this.
Comment 21 Alexey Proskuryakov 2019-04-22 09:18:04 PDT
This may not be the optimal way to run just a subset of tests, but at the high level, the command is "run-javascriptcore-tests".
Comment 22 Saam Barati 2019-04-22 11:26:11 PDT
(In reply to Alexey Proskuryakov from comment #21)
> This may not be the optimal way to run just a subset of tests, but at the
> high level, the command is "run-javascriptcore-tests".

You can also run only the layout JSC tests by:
$ Tools/Scripts/run-jsc-stress-tests (--release or --debug) LayoutTests/jsc-layout-tests.yaml


You can run an individual test using:
$ ./Tools/Scripts/run-jsc (--release or --debug) LayoutTests/resources/standalone-pre.js <test> LayoutTests/resources/standalone-post.js

e.g:
$ ./Tools/Scripts/run-jsc (--release or --debug) LayoutTests/resources/standalone-pre.js LayoutTests/js/script-tests/invalid-utf8-in-syntax-error.js LayoutTests/resources/standalone-post.js
Comment 23 Yusuke Suzuki 2019-04-22 11:32:59 PDT
You can also pass `--filter="text"` option to filter out tests :)
Comment 24 Darin Adler 2019-04-28 09:12:31 PDT Comment hidden (obsolete)
Comment 25 Darin Adler 2019-04-28 09:14:15 PDT
Comments in ChangeLog explain why that one test was failing. Not a real problem, just a test that needed to be rebased since the test *output* now includes a U+FFFD rather than an unpaired UTF-16 surrogate. The test behavior is identical, and correct, in both cases.
Comment 26 Alexey Proskuryakov 2019-04-29 10:46:09 PDT
EWS points out that Windows build is broken:

C:\cygwin\home\buildbot\WebKit\Source\WTF\wtf\unicode\UTF8Conversion.cpp(45): error C2220: warning treated as error - no 'object' file generated [C:\cygwin\home\buildbot\WebKit\WebKitBuild\Release\Source\WTF\wtf\WTF.vcxproj]
C:\cygwin\home\buildbot\WebKit\Source\WTF\wtf\unicode\UTF8Conversion.cpp(45): warning C4333: '>>': right shift by too large amount, data loss [C:\cygwin\home\buildbot\WebKit\WebKitBuild\Release\Source\WTF\wtf\WTF.vcxproj]
Comment 27 Darin Adler 2019-04-29 13:05:17 PDT
Yes, I have a patch locally on my machine at home that works around that bug in either the Windows compiler's warning logic or in the Windows version of the ICU headers. Planning to upload it for review. Didn’t get a chance to upload it last night.
Comment 28 Darin Adler 2019-04-29 19:03:07 PDT
Created attachment 368529 [details]
Patch
Comment 29 Darin Adler 2019-04-30 08:22:54 PDT
All EWS tests passing now. Finally ready for someone to review.
Comment 30 Alexey Proskuryakov 2019-04-30 09:30:45 PDT
Comment on attachment 368529 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=368529&action=review

Large patch, somewhat superficial review.

Overall comment: I don't think that hiding the types in "auto result = convertUTF16ToUTF8" and "auto success = convertLatin1ToUTF8" is an improvement, especially as the types are different, and the plan is to change them further. This took me an outstanding amount of effort to follow, with all the subtle differences with regards to where "target exhausted" is a failure, and where it is a success.

> Source/WTF/wtf/text/WTFString.cpp:862
> +    if (!convertUTF8ToUTF16(stringCurrent, reinterpret_cast<const char *>(stringStart + length), &bufferCurrent, bufferCurrent + buffer.size()))

Pre-existing issue, misplaced star.

> Source/WTF/wtf/unicode/UTF8Conversion.cpp:2
> - * Copyright (C) 2007, 2014 Apple Inc. All rights reserved.
> + * Copyright (C) 2007-2019 Apple Inc. All rights reserved.

Have we actually made non-trivial changes every year?

> Source/WebCore/platform/SharedBuffer.cpp:340
> -            result = WTF::Unicode::convertLatin1ToUTF8(&d, d + length, &p, p + buffer.size());
> +            if (!WTF::Unicode::convertLatin1ToUTF8(&d, d + length, &p, p + buffer.size()))

Typically, we export symbols in WTF into the global namespace, and don't use the WTF prefix at call sites. Is there a good reason for WTF::Unicode to be special?

> Source/WebKit/Shared/API/APIString.h:26
> +#pragma once

Do we now fully rely on pragma once even in cross-platform API headers?

> LayoutTests/ChangeLog:14
> +        * css3/escape-dom-api-expected.txt:
> +        * fast/text/dangling-surrogates-expected.txt:
> +        * js/dom/webidl-type-mapping-expected.txt:
> +        * js/invalid-utf8-in-syntax-error-expected.txt:
> +        Updated expected results to have the Unicode replacement character in cases where the
> +        text contains unpaired surrogates. The tests are still doing the same operations, and
> +        still getting the same results, but the text output no longer includes illegal UTF-8.

It would be useful to mention whether the new behavior matches other browser engines.

> LayoutTests/js/invalid-utf8-in-syntax-error.html:1
> +<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

This should use HTML5 doctype.

How did you create the wrapper? I just verified that "make-new-script-test js/invalid-utf8-in-syntax-error.html" creates a modern wrapper, also without pre/post scripts.
Comment 31 Darin Adler 2019-04-30 11:32:49 PDT
Comment on attachment 368529 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=368529&action=review

Thank you for the review.

> Source/WTF/wtf/text/StringImpl.cpp:1857
> +        auto success = convertLatin1ToUTF8(&characters, characters + length, &buffer, buffer + bufferVector.size());

This should be bool success, not auto success!

>> Source/WTF/wtf/text/WTFString.cpp:862
>> +    if (!convertUTF8ToUTF16(stringCurrent, reinterpret_cast<const char *>(stringStart + length), &bufferCurrent, bufferCurrent + buffer.size()))
> 
> Pre-existing issue, misplaced star.

I’ll fix that.

>> Source/WTF/wtf/unicode/UTF8Conversion.cpp:2
>> + * Copyright (C) 2007-2019 Apple Inc. All rights reserved.
> 
> Have we actually made non-trivial changes every year?

I’ll check and update it correctly.

>> Source/WebCore/platform/SharedBuffer.cpp:340
>> +            if (!WTF::Unicode::convertLatin1ToUTF8(&d, d + length, &p, p + buffer.size()))
> 
> Typically, we export symbols in WTF into the global namespace, and don't use the WTF prefix at call sites. Is there a good reason for WTF::Unicode to be special?

Good point. Worth reconsidering later. I was not changing anything here at this time, but I am willing to look into this.

Long term, maybe we should stop doing the "export WTF symbols into the global namespace" entirely. It’s less important in an era where we don’t write out whole identifiers so often. But my choice to follow the pattern of the nearby code here was not intentionally to move us in any direction.

>> Source/WebKit/Shared/API/APIString.h:26
>> +#pragma once
> 
> Do we now fully rely on pragma once even in cross-platform API headers?

As far as I know, this is not a cross-platform API header. This is a header for a class used in the back end of the C API, not itself part of an API. The API directory is confusingly named since many of the headers in it are to *implement* the API, not part of the API. The entire WebKit API namespace is not itself part of API.

>> LayoutTests/ChangeLog:14
>> +        still getting the same results, but the text output no longer includes illegal UTF-8.
> 
> It would be useful to mention whether the new behavior matches other browser engines.

As I understand it, this is a change in the behavior of our test harnesses, to the text dumping done by DumpRenderTree and WebKitTestRunner; no change to behavior of the browser engine.

>> LayoutTests/js/invalid-utf8-in-syntax-error.html:1
>> +<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
> 
> This should use HTML5 doctype.
> 
> How did you create the wrapper? I just verified that "make-new-script-test js/invalid-utf8-in-syntax-error.html" creates a modern wrapper, also without pre/post scripts.

I’ll use make-new-script-test to regenerate this file. I copied another wrapper from another file in this directory and edited the filename; did not remember there was a make-new-script-test script.
Comment 32 Darin Adler 2019-05-01 08:52:22 PDT
Committed r244821: <https://trac.webkit.org/changeset/244821>
Comment 33 Radar WebKit Bug Importer 2019-05-01 08:54:11 PDT
<rdar://problem/50370040>
Comment 34 Shawn Roberts 2019-05-01 10:02:39 PDT
Seeing 4 Test262 JSC failures on Release and Debug after this change.

FAIL test/built-ins/RegExp/prototype/Symbol.match/builtin-infer-unicode.js (default)
Full Output:
Exception: Test262Error: Expected SameValue(«�», «null») to be true

FAIL test/built-ins/RegExp/prototype/Symbol.match/builtin-infer-unicode.js (strict mode)
Full Output:
Exception: Test262Error: Expected SameValue(«�», «null») to be true

FAIL test/built-ins/RegExp/prototype/exec/u-lastindex-adv.js (default)
Full Output:
Exception: Test262Error: Expected SameValue(«�», «null») to be true

FAIL test/built-ins/RegExp/prototype/exec/u-lastindex-adv.js (strict mode)
Full Output:
Exception: Test262Error: Expected SameValue(«�», «null») to be true

https://build.webkit.org/builders/Apple%20High%20Sierra%20Release%20Test262%20%28Tests%29/builds/2784/steps/test262-test/logs/stdio

https://build.webkit.org/builders/Apple%20High%20Sierra%20Debug%20Test262%20%28Tests%29/builds/2108/steps/test262-test/logs/stdio
Comment 35 Ross Kirsling 2019-05-01 10:11:23 PDT
(In reply to Shawn Roberts from comment #34)
> Seeing 4 Test262 JSC failures on Release and Debug after this change.

Those are already-failing tests so we just need to update test262/expectations.yaml.
Comment 36 Shawn Roberts 2019-05-01 10:14:35 PDT
Reverted r244821 for reason:

Causing

Committed r244827: <https://trac.webkit.org/changeset/244827>
Comment 37 Ryan Haddad 2019-05-01 10:22:28 PDT
(In reply to Ross Kirsling from comment #35)
> (In reply to Shawn Roberts from comment #34)
> > Seeing 4 Test262 JSC failures on Release and Debug after this change.
> 
> Those are already-failing tests so we just need to update
> test262/expectations.yaml.
Hm, the test run before this revision didn't report the tests as failing. The bottom of the stdio shows:

"0 tests newly fail"

I do see the failures when I search the test names, though.

https://build.webkit.org/builders/Apple%20High%20Sierra%20Debug%20Test262%20%28Tests%29/builds/2107/steps/test262-test/logs/stdio
Comment 38 Ryan Haddad 2019-05-01 10:28:44 PDT
(In reply to Ryan Haddad from comment #37)
> (In reply to Ross Kirsling from comment #35)
> > (In reply to Shawn Roberts from comment #34)
> > > Seeing 4 Test262 JSC failures on Release and Debug after this change.
> > 
> > Those are already-failing tests so we just need to update
> > test262/expectations.yaml.

Oh, I see what you are saying. The test was already failing but the output changed from:

Exception: Test262Error: Expected SameValue(«���», «null») to be true

to

Exception: Test262Error: Expected SameValue(«�», «null») to be true
Comment 39 Ryan Haddad 2019-05-01 10:33:55 PDT
I re-landed r244821 and r244822 in https://trac.webkit.org/changeset/244828/webkit.

Will address the test262 failures separately.
Comment 40 Ross Kirsling 2019-05-01 10:49:35 PDT
(In reply to Ryan Haddad from comment #39)
> I re-landed r244821 and r244822 in
> https://trac.webkit.org/changeset/244828/webkit.
> 
> Will address the test262 failures separately.

Committed r244833: <https://trac.webkit.org/changeset/244833>
Comment 41 Darin Adler 2019-05-01 13:02:56 PDT
I see now. The test262 failures are analogous to the layout tests, where we had to rebase because the test harness now writes out strings differently (more correctly) but the actual test results are unchanged. So the expected results need to be rebased.
Comment 42 Alex Christensen 2020-08-19 13:12:02 PDT
Comment on attachment 368529 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=368529&action=review

> Source/WTF/wtf/unicode/UTF8Conversion.cpp:99
> +        U8_NEXT(reinterpret_cast<const uint8_t*>(source), sourceOffset, sourceEnd - source, character);

It appears we would more closely match the behavior of Chrome and Firefox if we used U8_NEXT_OR_FFFD here instead of U8_NEXT.  I'm planning to propose that change in an upcoming patch.
Comment 43 Darin Adler 2020-08-19 13:20:52 PDT
Comment on attachment 368529 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=368529&action=review

>> Source/WTF/wtf/unicode/UTF8Conversion.cpp:99
>> +        U8_NEXT(reinterpret_cast<const uint8_t*>(source), sourceOffset, sourceEnd - source, character);
> 
> It appears we would more closely match the behavior of Chrome and Firefox if we used U8_NEXT_OR_FFFD here instead of U8_NEXT.  I'm planning to propose that change in an upcoming patch.

Given that this is a widely used function, I am not sure that all call sites need the same behavior.
Comment 44 Alex Christensen 2020-08-19 17:15:05 PDT
Ok, feel free to discuss in https://bugs.webkit.org/show_bug.cgi?id=215671