Bug 235307 - TextCodec should treat lone surrogates as the replacement character
Summary: TextCodec should treat lone surrogates as the replacement character
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
Keywords: InRadar
Depends on:
Blocks: 179303
  Show dependency treegraph
Reported: 2022-01-17 17:46 PST by Andreu Botella
Modified: 2022-09-27 07:39 PDT (History)
5 users (show)

See Also:

Test case for lone surrogate character entities in URL parsing (766 bytes, text/html)
2022-01-17 17:46 PST, Andreu Botella
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andreu Botella 2022-01-17 17:46:23 PST
Created attachment 449360 [details]
Test case for lone surrogate character entities in URL parsing

WebCore uses TextCodec as its implementation of the encoding standard, and the "encode" algorithm in it (https://encoding.spec.whatwg.org/#encode) corresponds to calling TextCodec::encode with UnencodableHandling::Entities. However, the encoding standard algorithm only allows a scalar value string as its input (or anything else which can be converted to an I/O queue of scalar values; https://encoding.spec.whatwg.org/#to-i-o-queue-convert), whereas TextCodec::encode is called with arbitrary `StringView`s, which can contain lone surrogates. This would be fine if the codecs handled lone surrogates as if they were the replacement character, but they don't: the UTF-8 encoder emits invalid UTF-8 (for example, U+D800 encodes to 0xED 0xA0 0x80, which is invalid UTF-8; see https://simonsapin.github.io/wtf-8), and the rest of codecs emit a character entity for the lone surrogate (� for example).

The non-UTF-8 case can be observed through URL parsing; see the attached test case. The spec here is a bit convoluted, but when the URL parser is in the query state (https://url.spec.whatwg.org/#query-state), it calls into "percent-encode after encoding", which itself calls the Encoding Standard's "encode or fail", and if that fails, it emits a URL-encoded character entity for the unencodable code point. The input to "encode or fail" must also be a scalar value string, and therefore the unencodable code point in the error thrown by that algorithm can't be a surrogate code point. This bug doesn't affect URL parsing with the UTF-8 encoding because it uses a different code path.

Observing the UTF-8 case in web content is not easy, but in a Windows system you can create files with lone surrogates, and in the Windows WebKit port, uploading those in a multipart/form-data form will result in the form payload containing invalid UTF-8 (or surrogate character entities, if the page uses some other encoding) – see https://github.com/whatwg/html/issues/7413. Both of these bugs could be fixed by converting the strings in the URL parsing and form submission code before passing them to TextCodec, but it arguably should be fixed in the TextCodec implementations in any case to prevent regressions.

However, there does seem to be one case where surrogate character entities are not necessarily a bug, and that is saving a page to disk. Right click / Save As will serialize the DOM and any associated resources as MHTML, with the intent that opening it will round-trip the DOM. If there is a lone surrogate in the DOM, and the page's encoding isn't UTF-8, the serialization will emit a surrogate character entity which will correctly round-trip as a surrogate in the DOM. However, if the page's encoding is UTF-8, the surrogate will serialize as (quoted-printable) invalid UTF-8. If this is a case worth keeping –which it might not be–, there should be a new variant in the UnencodableHandling enum to have lone surrogates serialize as surrogate character entities in all encodings.

See also the corresponding Chromium bug: https://crbug.com/1285987
Comment 1 Radar WebKit Bug Importer 2022-01-24 17:47:14 PST