Bug 235307 - TextCodec should treat lone surrogates as the replacement character
Summary: TextCodec should treat lone surrogates as the replacement character
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on: 254888
Blocks: 179303
  Show dependency treegraph
 
Reported: 2022-01-17 17:46 PST by Andreu Botella
Modified: 2023-04-02 08:04 PDT (History)
6 users (show)

See Also:


Attachments
Test case for lone surrogate character entities in URL parsing (766 bytes, text/html)
2022-01-17 17:46 PST, Andreu Botella
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andreu Botella 2022-01-17 17:46:23 PST
Created attachment 449360 [details]
Test case for lone surrogate character entities in URL parsing

WebCore uses TextCodec as its implementation of the encoding standard, and the "encode" algorithm in it (https://encoding.spec.whatwg.org/#encode) corresponds to calling TextCodec::encode with UnencodableHandling::Entities. However, the encoding standard algorithm only allows a scalar value string as its input (or anything else which can be converted to an I/O queue of scalar values; https://encoding.spec.whatwg.org/#to-i-o-queue-convert), whereas TextCodec::encode is called with arbitrary `StringView`s, which can contain lone surrogates. This would be fine if the codecs handled lone surrogates as if they were the replacement character, but they don't: the UTF-8 encoder emits invalid UTF-8 (for example, U+D800 encodes to 0xED 0xA0 0x80, which is invalid UTF-8; see https://simonsapin.github.io/wtf-8), and the rest of codecs emit a character entity for the lone surrogate (� for example).

The non-UTF-8 case can be observed through URL parsing; see the attached test case. The spec here is a bit convoluted, but when the URL parser is in the query state (https://url.spec.whatwg.org/#query-state), it calls into "percent-encode after encoding", which itself calls the Encoding Standard's "encode or fail", and if that fails, it emits a URL-encoded character entity for the unencodable code point. The input to "encode or fail" must also be a scalar value string, and therefore the unencodable code point in the error thrown by that algorithm can't be a surrogate code point. This bug doesn't affect URL parsing with the UTF-8 encoding because it uses a different code path.

Observing the UTF-8 case in web content is not easy, but in a Windows system you can create files with lone surrogates, and in the Windows WebKit port, uploading those in a multipart/form-data form will result in the form payload containing invalid UTF-8 (or surrogate character entities, if the page uses some other encoding) – see https://github.com/whatwg/html/issues/7413. Both of these bugs could be fixed by converting the strings in the URL parsing and form submission code before passing them to TextCodec, but it arguably should be fixed in the TextCodec implementations in any case to prevent regressions.

However, there does seem to be one case where surrogate character entities are not necessarily a bug, and that is saving a page to disk. Right click / Save As will serialize the DOM and any associated resources as MHTML, with the intent that opening it will round-trip the DOM. If there is a lone surrogate in the DOM, and the page's encoding isn't UTF-8, the serialization will emit a surrogate character entity which will correctly round-trip as a surrogate in the DOM. However, if the page's encoding is UTF-8, the surrogate will serialize as (quoted-printable) invalid UTF-8. If this is a case worth keeping –which it might not be–, there should be a new variant in the UnencodableHandling enum to have lone surrogates serialize as surrogate character entities in all encodings.

See also the corresponding Chromium bug: https://crbug.com/1285987
Comment 1 Radar WebKit Bug Importer 2022-01-24 17:47:14 PST
<rdar://problem/88000033>
Comment 2 Ahmad Saleem 2022-11-30 02:42:37 PST
*** Firefox Nightly 109 ***

 href attribute of link is: "?a\ud800b" (should be "?a\ud800b")

href property of link is: "https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb" (should end in "?a%26%2365533%3Bb") 

*** Chrome Canary 110 ***

href attribute of link is: "?a\ud800b" (should be "?a\ud800b")

href property of link is: "https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb" (should end in "?a%26%2365533%3Bb")

*** Safari 16.1 ***

href attribute of link is: "?a\ud800b" (should be "?a\ud800b")

href property of link is: "https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb" (should end in "?a%26%2365533%3Bb")

_______-

All browsers are matching or I am testing it wrong?

JSFiddle - https://jsfiddle.net/b50n7e2s/ (same test but took from Chrome / Blink bug).
Comment 3 Anne van Kesteren 2022-11-30 05:23:11 PST
That test doesn't seem to test windows-1252 (due to JSFiddle forcing UTF-8), but when actually testing windows-1252 all browsers seem to agree as well: https://github.com/web-platform-tests/wpt/pull/37250.

However,

1. Comment 0 also describes a problem on Windows that might still exist.
2. Code inspection shows that https://github.com/WebKit/WebKit/blob/5e81d33ff5c0150dbabbebbe2e96fb08ff4d6ad3/Source/WebCore/PAL/pal/text/TextCodecUTF8.cpp#L461-L472 does not do surrogate handling.

(Also, if as comment 0 suggests this is somehow intentional, which I suspect it's not, it shouldn't be called UTF-8.)
Comment 4 Anne van Kesteren 2023-03-27 07:50:07 PDT
Hmm, I'm no longer convinced there's a problem here. Especially since Windows is no longer targeted.

Andreu, what do you think?
Comment 5 Anne van Kesteren 2023-04-02 08:04:06 PDT
I was wrong about the non-UTF-8 encoders: https://github.com/web-platform-tests/wpt/pull/39324. I created bug 179303 to fix that.

Keeping this open to find out if the UTF-8 issue is exposed somewhere.