178207 – UTF-8 decoding produces one replacement character per byte; Encoding standard requires one replacement character per illegal sequence instead

RESOLVED FIXED178207

UTF-8 decoding produces one replacement character per byte; Encoding standard requires one replacement character per illegal sequence instead

https://bugs.webkit.org/show_bug.cgi?id=178207

Summary UTF-8 decoding produces one replacement character per byte; Encoding standard...

Marja Hölttä

Reported 2017-10-12 05:10:32 PDT

See https://bugs.chromium.org/p/chromium/issues/detail?id=773320 for the same bug in Blink. In that bug, we saw a script where an invalid byte sequence occurs and is split between two chunks (of size 4096): 0b11100000 << lead << 0xe0 0b10100101 << cont << 0xa5 0b00111111 << ascii << 0x3f The bug is that TextCodecUTF8::HandlePartialSequence calls TextCodecUTF8::HandleError which assumes that each error consumes one byte from the byte stream and produces an invalid char. However, that ignores the fact that we need to consume multiple bytes (i.e., the maximal subpart).

Attachments
Patch (35.76 KB, patch) 2017-10-15 00:07 PDT, Darin Adler	no flags	Details Formatted Diff Diff
Patch (39.06 KB, patch) 2017-10-15 00:52 PDT, Darin Adler	no flags	Details Formatted Diff Diff
Patch (37.86 KB, patch) 2017-10-15 08:16 PDT, Darin Adler	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Radar WebKit Bug Importer

Comment 1 2017-10-13 09:44:53 PDT

<rdar://problem/34979603>

Alexey Proskuryakov

Comment 2 2017-10-13 09:45:16 PDT

Test case and patch available in the Chromium bug. Thanks!

David Kilzer (:ddkilzer)

Comment 3 2017-10-14 02:49:16 PDT

Blink commit: https://chromium.googlesource.com/chromium/src.git/+/79806338f4762c005b99808c8e97edf4d176b621

Darin Adler

Comment 4 2017-10-14 15:04:38 PDT

I think I’ll fix this. This might be better tested by writing a unit test for TextCodecUTF8 rather than trying to use a layout test. That would avoid the dependency on buffer sizes. I don’t know where the knowledge that a buffer is 4kB comes from and I would not want the test to stop being effective if we switched to, say, an 8kB buffer size. I also think a better fix would be to get rid of the handleError function; it’s not really working well.

Darin Adler

Comment 5 2017-10-14 15:06:04 PDT

To be thorough need to test both the case where the replacement character is the first 16-bit character and the case where there is a preceding 16-bit character.

Darin Adler

Comment 6 2017-10-14 15:06:50 PDT

And also the "flush" and "no flush" variants.

Darin Adler

Comment 7 2017-10-14 22:23:17 PDT

There is not a bug here that is specific to chunk boundaries. The code handles chunk boundaries correctly. The issue is that the UTF-8 decoder produces too many replacement characters. The Encoding standard and "Best Practices for Using U+FFFD" from the Unicode specification require one replacement character for each illegal sequence, not one per byte in an illegal sequence. We weren’t trying to implement that before, but now know that we should.

Darin Adler

Comment 8 2017-10-14 22:27:35 PDT

Henri Sivonen wrote a paper about this topic <https://hsivonen.fi/broken-utf-8/>.

Darin Adler

Comment 9 2017-10-14 22:31:29 PDT

I believe that the Chromium comments about this being a "very old bug" are incorrect. I think someone implemented new behavior in Chromium and didn’t handle chunk boundaries correctly. The very old code was correct for the old desired behavior, and became incorrect only recently when the behavior of the non-chunk-boundaries code was changed.

Darin Adler

Comment 10 2017-10-14 23:46:32 PDT

This code is not in WTF; it’s in the platform layer. And a tiny bit of trivia: Henri’s paper incorrectly claims that it is in WTF.

Darin Adler

Comment 11 2017-10-15 00:07:37 PDT

Created attachment 323840 [details] Patch

Darin Adler

Comment 12 2017-10-15 00:52:11 PDT

Created attachment 323841 [details] Patch

Darin Adler

Comment 13 2017-10-15 08:16:36 PDT

Created attachment 323843 [details] Patch

Darin Adler

Comment 14 2017-10-15 08:33:10 PDT

WTF has a different set of UTF-8 functions. Not sure how they handle replacement characters.

Sam Weinig

Comment 15 2017-10-15 12:46:36 PDT

Comment on attachment 323843 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=323843&action=review > Source/WebCore/platform/text/TextCodecUTF8.cpp:441 > + Vector<char, 3000> bytes(length * 3); How did you decide on 3000? > Source/WebCore/platform/text/TextCodecUTF8.cpp:448 > + return CString { bytes.data(), bytesWritten }; It's too bad we don't have a way to avoid this copy at least in some instances. Perhaps at some point, we should rework CStringBuffer to allow for this.

WebKit Commit Bot

Comment 16 2017-10-15 13:13:55 PDT

Comment on attachment 323843 [details] Patch Clearing flags on attachment: 323843 Committed r223329: <https://trac.webkit.org/changeset/223329>

WebKit Commit Bot

Comment 17 2017-10-15 13:13:57 PDT

All reviewed patches have been landed. Closing bug.

Darin Adler

Comment 18 2017-10-15 17:00:56 PDT

Comment on attachment 323843 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=323843&action=review >> Source/WebCore/platform/text/TextCodecUTF8.cpp:441 >> + Vector<char, 3000> bytes(length * 3); > > How did you decide on 3000? I just kind of thought "1000 * 3", but it’s so random. I am getting a little annoyed at Vector<uint8_t> vs. Vector<char> vs. CString in different places. For the same kinds of purposes. We really need to standardize on Vector<uint8_t> vs. Vector<char>. I think that encode should probably return a Vector and not a CString. >> Source/WebCore/platform/text/TextCodecUTF8.cpp:448 >> + return CString { bytes.data(), bytesWritten }; > > It's too bad we don't have a way to avoid this copy at least in some instances. Perhaps at some point, we should rework CStringBuffer to allow for this. I think we should minimize the use of CString in the future. Not sure we need that class.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Platform

Assignee

Darin Adler

Reported

2017-10-12 05:10 PDT

Modified

2017-11-05 17:03 PST History

CC List

8 users Show

URL

Keywords BlinkMergeCandidate, InRadar

Depends on

Blocks