235307 – TextCodec should treat lone surrogates as the replacement character

NEW 235307

TextCodec should treat lone surrogates as the replacement character

https://bugs.webkit.org/show_bug.cgi?id=235307

Summary TextCodec should treat lone surrogates as the replacement character

Andreu Botella

Reported 2022-01-17 17:46:23 PST

Created attachment 449360 [details] Test case for lone surrogate character entities in URL parsing WebCore uses TextCodec as its implementation of the encoding standard, and the "encode" algorithm in it (https://encoding.spec.whatwg.org/#encode) corresponds to calling TextCodec::encode with UnencodableHandling::Entities. However, the encoding standard algorithm only allows a scalar value string as its input (or anything else which can be converted to an I/O queue of scalar values; https://encoding.spec.whatwg.org/#to-i-o-queue-convert), whereas TextCodec::encode is called with arbitrary `StringView`s, which can contain lone surrogates. This would be fine if the codecs handled lone surrogates as if they were the replacement character, but they don't: the UTF-8 encoder emits invalid UTF-8 (for example, U+D800 encodes to 0xED 0xA0 0x80, which is invalid UTF-8; see https://simonsapin.github.io/wtf-8), and the rest of codecs emit a character entity for the lone surrogate (&#55296; for example). The non-UTF-8 case can be observed through URL parsing; see the attached test case. The spec here is a bit convoluted, but when the URL parser is in the query state (https://url.spec.whatwg.org/#query-state), it calls into "percent-encode after encoding", which itself calls the Encoding Standard's "encode or fail", and if that fails, it emits a URL-encoded character entity for the unencodable code point. The input to "encode or fail" must also be a scalar value string, and therefore the unencodable code point in the error thrown by that algorithm can't be a surrogate code point. This bug doesn't affect URL parsing with the UTF-8 encoding because it uses a different code path. Observing the UTF-8 case in web content is not easy, but in a Windows system you can create files with lone surrogates, and in the Windows WebKit port, uploading those in a multipart/form-data form will result in the form payload containing invalid UTF-8 (or surrogate character entities, if the page uses some other encoding) – see https://github.com/whatwg/html/issues/7413. Both of these bugs could be fixed by converting the strings in the URL parsing and form submission code before passing them to TextCodec, but it arguably should be fixed in the TextCodec implementations in any case to prevent regressions. However, there does seem to be one case where surrogate character entities are not necessarily a bug, and that is saving a page to disk. Right click / Save As will serialize the DOM and any associated resources as MHTML, with the intent that opening it will round-trip the DOM. If there is a lone surrogate in the DOM, and the page's encoding isn't UTF-8, the serialization will emit a surrogate character entity which will correctly round-trip as a surrogate in the DOM. However, if the page's encoding is UTF-8, the surrogate will serialize as (quoted-printable) invalid UTF-8. If this is a case worth keeping –which it might not be–, there should be a new variant in the UnencodableHandling enum to have lone surrogates serialize as surrogate character entities in all encodings. See also the corresponding Chromium bug: https://crbug.com/1285987

Attachments
Test case for lone surrogate character entities in URL parsing (766 bytes, text/html) 2022-01-17 17:46 PST, Andreu Botella	no flags	Details
View All Add attachment proposed patch, testcase, etc.

Radar WebKit Bug Importer

Comment 1 2022-01-24 17:47:14 PST

<rdar://problem/88000033>

Ahmad Saleem

Comment 2 2022-11-30 02:42:37 PST

*** Firefox Nightly 109 *** href attribute of link is: "?a\ud800b" (should be "?a\ud800b") href property of link is: "https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb" (should end in "?a%26%2365533%3Bb") *** Chrome Canary 110 *** href attribute of link is: "?a\ud800b" (should be "?a\ud800b") href property of link is: "https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb" (should end in "?a%26%2365533%3Bb") *** Safari 16.1 *** href attribute of link is: "?a\ud800b" (should be "?a\ud800b") href property of link is: "https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb" (should end in "?a%26%2365533%3Bb") _______- All browsers are matching or I am testing it wrong? JSFiddle - https://jsfiddle.net/b50n7e2s/ (same test but took from Chrome / Blink bug).

Anne van Kesteren

Comment 3 2022-11-30 05:23:11 PST

That test doesn't seem to test windows-1252 (due to JSFiddle forcing UTF-8), but when actually testing windows-1252 all browsers seem to agree as well: https://github.com/web-platform-tests/wpt/pull/37250. However, 1. Comment 0 also describes a problem on Windows that might still exist. 2. Code inspection shows that https://github.com/WebKit/WebKit/blob/5e81d33ff5c0150dbabbebbe2e96fb08ff4d6ad3/Source/WebCore/PAL/pal/text/TextCodecUTF8.cpp#L461-L472 does not do surrogate handling. (Also, if as comment 0 suggests this is somehow intentional, which I suspect it's not, it shouldn't be called UTF-8.)

Anne van Kesteren

Comment 4 2023-03-27 07:50:07 PDT

Hmm, I'm no longer convinced there's a problem here. Especially since Windows is no longer targeted. Andreu, what do you think?

Anne van Kesteren

Comment 5 2023-04-02 08:04:06 PDT

I was wrong about the non-UTF-8 encoders: https://github.com/web-platform-tests/wpt/pull/39324. I created bug 179303 to fix that. Keeping this open to find out if the UTF-8 issue is exposed somewhere.

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component WebCore Misc.

Assignee

Nobody

Reported

2022-01-17 17:46 PST

Modified

2023-04-02 08:04 PDT History

CC List

6 users Show

URL

Keywords InRadar

Depends on

254888

Blocks

179303

Dependencies

tree graph