Bug 234030

Summary: TextCodecUTF8 can skip characters after an invalid sequence near EOF
Product: WebKit Reporter: Andreu Botella <andreu>
Component: Page LoadingAssignee: Nobody <webkit-unassigned>
Status: RESOLVED DUPLICATE    
Severity: Normal CC: achristensen, andreu, beidson, darin
Priority: P2    
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=233921
Attachments:
Description Flags
Sample to show that this bug affects page loading. none

Description Andreu Botella 2021-12-08 12:55:22 PST
Created attachment 446414 [details]
Sample to show that this bug affects page loading.

WPT tests: https://wpt.fyi/results/encoding/textdecoder-eof.any.html?label=experimental&label=master&aligned (also tests for bug 233921).

When the TextCodecUTF8 decoder finds a non-ASCII lead byte, it waits until enough bytes are consumed to make a valid sequence starting at that position, before starting to process the bytes. But if the stream is flushed before that, the decoder assumes that the remaining bytes are part of a truncated partial sequence, and so discards them while emitting a single replacement character. But this assumption doesn't necessarily hold, and it can result in non-replacement characters being skipped:

// "�A" in Firefox and Chromium 98, and according to the spec.
// "��A" in earlier versions of Chromium.
// "�" in WebKit.
new TextDecoder().decode(new Uint8Array([0xF0, 0x9F, 0x41]));

This can also result in fewer replacement characters being emitted than should be the case:

// "��A" in Firefox, Chrome, and according to the spec.
// "�" in WebKit.
new TextDecoder().decode(new Uint8Array([0xF0, 0x80, 0x41]));

This bug also affects page loading, as with the attached sample.
Comment 1 Alex Christensen 2021-12-09 09:50:17 PST

*** This bug has been marked as a duplicate of bug 233921 ***
Comment 2 Alex Christensen 2021-12-09 09:50:33 PST
This will be fixed with the same fix as bug 233921