233921 – TextDecoder doesn't detect invalid UTF-8 sequences early enough

RESOLVED FIXED 233921

TextDecoder doesn't detect invalid UTF-8 sequences early enough

https://bugs.webkit.org/show_bug.cgi?id=233921

Summary TextDecoder doesn't detect invalid UTF-8 sequences early enough

Andreu Botella

Reported 2021-12-07 03:55:22 PST

WPT tests: https://wpt.fyi/results/encoding/textdecoder-eof.any.html?label=experimental&label=master&aligned (stream: true case), https://wpt.fyi/results/encoding/textdecoder-streaming.any.html?label=experimental&label=master&aligned, https://wpt.fyi/results/encoding/streams/decode-utf8.any.html?label=experimental&label=master&aligned (non-SharedArrayBuffer cases) Related Chromium bug: https://bugs.chromium.org/p/chromium/issues/detail?id=796697 When the TextCodecUTF8 decoder finds a non-ASCII lead byte, it waits until enough bytes are consumed to make a valid sequence starting at that position, before starting to process the bytes. This goes against the encoding spec, which requires the replacement character to be emitted as soon as enough bytes are consumed to tell that the sequence is in fact invalid. While this does not make a difference for non-streaming input, or for streaming data coming from the network, it does make a difference in that TextDecoder returns the wrong result as per the spec when in streaming mode: const decoder = new TextDecoder(); console.log(decoder.decode(new Uint8Array([0xF0, 0x9F]), { stream: true })); console.log(decoder.decode(new Uint8Array([0x41]), { stream: true })); console.log(decoder.decode(new Uint8Array([0x42]), { stream: true })); As per the spec, and in Firefox and Chromium 98, this prints "", "�A", "B". In WebKit and previous versions of Chromium, it prints "", "", "�AB".

Attachments
Patch (16.63 KB, patch) 2021-12-13 05:50 PST, Andreu Botella	no flags	Details Formatted Diff Diff
Patch (18.30 KB, patch) 2021-12-13 14:38 PST, Andreu Botella	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Alex Christensen

Comment 1 2021-12-08 13:23:00 PST

I think we will need to do basically the same thing as Chromium, but it won't be a simple cherry pick because their DecodeNonASCIISequence returns a different value than our decodeNonASCIISequence and doesn't change the length, and their handlePartialSequence for LChar and UChar are more similar than ours are.

Alex Christensen

Comment 2 2021-12-09 09:50:17 PST

*** Bug 234030 has been marked as a duplicate of this bug. ***

Andreu Botella

Comment 3 2021-12-13 05:50:55 PST

Created attachment 446994 [details] Patch

EWS Watchlist

Comment 4 2021-12-13 05:52:03 PST

This patch modifies the imported WPT tests. Please ensure that any changes on the tests (not coming from a WPT import) are exported to WPT. Please see https://trac.webkit.org/wiki/WPTExportProcess

Andreu Botella

Comment 5 2021-12-13 14:38:43 PST

Created attachment 447070 [details] Patch

Radar WebKit Bug Importer

Comment 6 2021-12-14 03:56:16 PST

<rdar://problem/86461983>

EWS

Comment 7 2021-12-14 08:30:29 PST

Committed r287024 (245229@main): <https://commits.webkit.org/245229@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 447070 [details].

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component DOM

Assignee

Nobody

Reported

2021-12-07 03:55 PST

Modified

2021-12-14 08:30 PST History

CC List

10 users Show

URL

Keywords InRadar

Duplicates (1)

234030 View as bug list

Depends on

Blocks