WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
233921
TextDecoder doesn't detect invalid UTF-8 sequences early enough
https://bugs.webkit.org/show_bug.cgi?id=233921
Summary
TextDecoder doesn't detect invalid UTF-8 sequences early enough
Andreu Botella
Reported
2021-12-07 03:55:22 PST
WPT tests:
https://wpt.fyi/results/encoding/textdecoder-eof.any.html?label=experimental&label=master&aligned
(stream: true case),
https://wpt.fyi/results/encoding/textdecoder-streaming.any.html?label=experimental&label=master&aligned
,
https://wpt.fyi/results/encoding/streams/decode-utf8.any.html?label=experimental&label=master&aligned
(non-SharedArrayBuffer cases) Related Chromium bug:
https://bugs.chromium.org/p/chromium/issues/detail?id=796697
When the TextCodecUTF8 decoder finds a non-ASCII lead byte, it waits until enough bytes are consumed to make a valid sequence starting at that position, before starting to process the bytes. This goes against the encoding spec, which requires the replacement character to be emitted as soon as enough bytes are consumed to tell that the sequence is in fact invalid. While this does not make a difference for non-streaming input, or for streaming data coming from the network, it does make a difference in that TextDecoder returns the wrong result as per the spec when in streaming mode: const decoder = new TextDecoder(); console.log(decoder.decode(new Uint8Array([0xF0, 0x9F]), { stream: true })); console.log(decoder.decode(new Uint8Array([0x41]), { stream: true })); console.log(decoder.decode(new Uint8Array([0x42]), { stream: true })); As per the spec, and in Firefox and Chromium 98, this prints "", "�A", "B". In WebKit and previous versions of Chromium, it prints "", "", "�AB".
Attachments
Patch
(16.63 KB, patch)
2021-12-13 05:50 PST
,
Andreu Botella
no flags
Details
Formatted Diff
Diff
Patch
(18.30 KB, patch)
2021-12-13 14:38 PST
,
Andreu Botella
no flags
Details
Formatted Diff
Diff
Show Obsolete
(1)
View All
Add attachment
proposed patch, testcase, etc.
Alex Christensen
Comment 1
2021-12-08 13:23:00 PST
I think we will need to do basically the same thing as Chromium, but it won't be a simple cherry pick because their DecodeNonASCIISequence returns a different value than our decodeNonASCIISequence and doesn't change the length, and their handlePartialSequence for LChar and UChar are more similar than ours are.
Alex Christensen
Comment 2
2021-12-09 09:50:17 PST
***
Bug 234030
has been marked as a duplicate of this bug. ***
Andreu Botella
Comment 3
2021-12-13 05:50:55 PST
Created
attachment 446994
[details]
Patch
EWS Watchlist
Comment 4
2021-12-13 05:52:03 PST
This patch modifies the imported WPT tests. Please ensure that any changes on the tests (not coming from a WPT import) are exported to WPT. Please see
https://trac.webkit.org/wiki/WPTExportProcess
Andreu Botella
Comment 5
2021-12-13 14:38:43 PST
Created
attachment 447070
[details]
Patch
Radar WebKit Bug Importer
Comment 6
2021-12-14 03:56:16 PST
<
rdar://problem/86461983
>
EWS
Comment 7
2021-12-14 08:30:29 PST
Committed
r287024
(
245229@main
): <
https://commits.webkit.org/245229@main
> All reviewed patches have been landed. Closing bug and clearing flags on
attachment 447070
[details]
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug