WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
178207
UTF-8 decoding produces one replacement character per byte; Encoding standard requires one replacement character per illegal sequence instead
https://bugs.webkit.org/show_bug.cgi?id=178207
Summary
UTF-8 decoding produces one replacement character per byte; Encoding standard...
Marja Hölttä
Reported
2017-10-12 05:10:32 PDT
See
https://bugs.chromium.org/p/chromium/issues/detail?id=773320
for the same bug in Blink. In that bug, we saw a script where an invalid byte sequence occurs and is split between two chunks (of size 4096): 0b11100000 << lead << 0xe0 0b10100101 << cont << 0xa5 0b00111111 << ascii << 0x3f The bug is that TextCodecUTF8::HandlePartialSequence calls TextCodecUTF8::HandleError which assumes that each error consumes one byte from the byte stream and produces an invalid char. However, that ignores the fact that we need to consume multiple bytes (i.e., the maximal subpart).
Attachments
Patch
(35.76 KB, patch)
2017-10-15 00:07 PDT
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Patch
(39.06 KB, patch)
2017-10-15 00:52 PDT
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Patch
(37.86 KB, patch)
2017-10-15 08:16 PDT
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Radar WebKit Bug Importer
Comment 1
2017-10-13 09:44:53 PDT
<
rdar://problem/34979603
>
Alexey Proskuryakov
Comment 2
2017-10-13 09:45:16 PDT
Test case and patch available in the Chromium bug. Thanks!
David Kilzer (:ddkilzer)
Comment 3
2017-10-14 02:49:16 PDT
Blink commit:
https://chromium.googlesource.com/chromium/src.git/+/79806338f4762c005b99808c8e97edf4d176b621
Darin Adler
Comment 4
2017-10-14 15:04:38 PDT
I think I’ll fix this. This might be better tested by writing a unit test for TextCodecUTF8 rather than trying to use a layout test. That would avoid the dependency on buffer sizes. I don’t know where the knowledge that a buffer is 4kB comes from and I would not want the test to stop being effective if we switched to, say, an 8kB buffer size. I also think a better fix would be to get rid of the handleError function; it’s not really working well.
Darin Adler
Comment 5
2017-10-14 15:06:04 PDT
To be thorough need to test both the case where the replacement character is the first 16-bit character and the case where there is a preceding 16-bit character.
Darin Adler
Comment 6
2017-10-14 15:06:50 PDT
And also the "flush" and "no flush" variants.
Darin Adler
Comment 7
2017-10-14 22:23:17 PDT
There is not a bug here that is specific to chunk boundaries. The code handles chunk boundaries correctly. The issue is that the UTF-8 decoder produces too many replacement characters. The Encoding standard and "Best Practices for Using U+FFFD" from the Unicode specification require one replacement character for each illegal sequence, not one per byte in an illegal sequence. We weren’t trying to implement that before, but now know that we should.
Darin Adler
Comment 8
2017-10-14 22:27:35 PDT
Henri Sivonen wrote a paper about this topic <
https://hsivonen.fi/broken-utf-8/
>.
Darin Adler
Comment 9
2017-10-14 22:31:29 PDT
I believe that the Chromium comments about this being a "very old bug" are incorrect. I think someone implemented new behavior in Chromium and didn’t handle chunk boundaries correctly. The very old code was correct for the old desired behavior, and became incorrect only recently when the behavior of the non-chunk-boundaries code was changed.
Darin Adler
Comment 10
2017-10-14 23:46:32 PDT
This code is not in WTF; it’s in the platform layer. And a tiny bit of trivia: Henri’s paper incorrectly claims that it is in WTF.
Darin Adler
Comment 11
2017-10-15 00:07:37 PDT
Created
attachment 323840
[details]
Patch
Darin Adler
Comment 12
2017-10-15 00:52:11 PDT
Created
attachment 323841
[details]
Patch
Darin Adler
Comment 13
2017-10-15 08:16:36 PDT
Created
attachment 323843
[details]
Patch
Darin Adler
Comment 14
2017-10-15 08:33:10 PDT
WTF has a different set of UTF-8 functions. Not sure how they handle replacement characters.
Sam Weinig
Comment 15
2017-10-15 12:46:36 PDT
Comment on
attachment 323843
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=323843&action=review
> Source/WebCore/platform/text/TextCodecUTF8.cpp:441 > + Vector<char, 3000> bytes(length * 3);
How did you decide on 3000?
> Source/WebCore/platform/text/TextCodecUTF8.cpp:448 > + return CString { bytes.data(), bytesWritten };
It's too bad we don't have a way to avoid this copy at least in some instances. Perhaps at some point, we should rework CStringBuffer to allow for this.
WebKit Commit Bot
Comment 16
2017-10-15 13:13:55 PDT
Comment on
attachment 323843
[details]
Patch Clearing flags on attachment: 323843 Committed
r223329
: <
https://trac.webkit.org/changeset/223329
>
WebKit Commit Bot
Comment 17
2017-10-15 13:13:57 PDT
All reviewed patches have been landed. Closing bug.
Darin Adler
Comment 18
2017-10-15 17:00:56 PDT
Comment on
attachment 323843
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=323843&action=review
>> Source/WebCore/platform/text/TextCodecUTF8.cpp:441 >> + Vector<char, 3000> bytes(length * 3); > > How did you decide on 3000?
I just kind of thought "1000 * 3", but it’s so random. I am getting a little annoyed at Vector<uint8_t> vs. Vector<char> vs. CString in different places. For the same kinds of purposes. We really need to standardize on Vector<uint8_t> vs. Vector<char>. I think that encode should probably return a Vector and not a CString.
>> Source/WebCore/platform/text/TextCodecUTF8.cpp:448 >> + return CString { bytes.data(), bytesWritten }; > > It's too bad we don't have a way to avoid this copy at least in some instances. Perhaps at some point, we should rework CStringBuffer to allow for this.
I think we should minimize the use of CString in the future. Not sure we need that class.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug