WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
NEW
Bug 235307
TextCodec should treat lone surrogates as the replacement character
https://bugs.webkit.org/show_bug.cgi?id=235307
Summary
TextCodec should treat lone surrogates as the replacement character
Andreu Botella
Reported
2022-01-17 17:46:23 PST
Created
attachment 449360
[details]
Test case for lone surrogate character entities in URL parsing WebCore uses TextCodec as its implementation of the encoding standard, and the "encode" algorithm in it (
https://encoding.spec.whatwg.org/#encode
) corresponds to calling TextCodec::encode with UnencodableHandling::Entities. However, the encoding standard algorithm only allows a scalar value string as its input (or anything else which can be converted to an I/O queue of scalar values;
https://encoding.spec.whatwg.org/#to-i-o-queue-convert
), whereas TextCodec::encode is called with arbitrary `StringView`s, which can contain lone surrogates. This would be fine if the codecs handled lone surrogates as if they were the replacement character, but they don't: the UTF-8 encoder emits invalid UTF-8 (for example, U+D800 encodes to 0xED 0xA0 0x80, which is invalid UTF-8; see
https://simonsapin.github.io/wtf-8
), and the rest of codecs emit a character entity for the lone surrogate (� for example). The non-UTF-8 case can be observed through URL parsing; see the attached test case. The spec here is a bit convoluted, but when the URL parser is in the query state (
https://url.spec.whatwg.org/#query-state
), it calls into "percent-encode after encoding", which itself calls the Encoding Standard's "encode or fail", and if that fails, it emits a URL-encoded character entity for the unencodable code point. The input to "encode or fail" must also be a scalar value string, and therefore the unencodable code point in the error thrown by that algorithm can't be a surrogate code point. This bug doesn't affect URL parsing with the UTF-8 encoding because it uses a different code path. Observing the UTF-8 case in web content is not easy, but in a Windows system you can create files with lone surrogates, and in the Windows WebKit port, uploading those in a multipart/form-data form will result in the form payload containing invalid UTF-8 (or surrogate character entities, if the page uses some other encoding) – see
https://github.com/whatwg/html/issues/7413
. Both of these bugs could be fixed by converting the strings in the URL parsing and form submission code before passing them to TextCodec, but it arguably should be fixed in the TextCodec implementations in any case to prevent regressions. However, there does seem to be one case where surrogate character entities are not necessarily a bug, and that is saving a page to disk. Right click / Save As will serialize the DOM and any associated resources as MHTML, with the intent that opening it will round-trip the DOM. If there is a lone surrogate in the DOM, and the page's encoding isn't UTF-8, the serialization will emit a surrogate character entity which will correctly round-trip as a surrogate in the DOM. However, if the page's encoding is UTF-8, the surrogate will serialize as (quoted-printable) invalid UTF-8. If this is a case worth keeping –which it might not be–, there should be a new variant in the UnencodableHandling enum to have lone surrogates serialize as surrogate character entities in all encodings. See also the corresponding Chromium bug:
https://crbug.com/1285987
Attachments
Test case for lone surrogate character entities in URL parsing
(766 bytes, text/html)
2022-01-17 17:46 PST
,
Andreu Botella
no flags
Details
View All
Add attachment
proposed patch, testcase, etc.
Radar WebKit Bug Importer
Comment 1
2022-01-24 17:47:14 PST
<
rdar://problem/88000033
>
Ahmad Saleem
Comment 2
2022-11-30 02:42:37 PST
*** Firefox Nightly 109 *** href attribute of link is: "?a\ud800b" (should be "?a\ud800b") href property of link is: "
https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb
" (should end in "?a%26%2365533%3Bb") *** Chrome Canary 110 *** href attribute of link is: "?a\ud800b" (should be "?a\ud800b") href property of link is: "
https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb
" (should end in "?a%26%2365533%3Bb") *** Safari 16.1 *** href attribute of link is: "?a\ud800b" (should be "?a\ud800b") href property of link is: "
https://bug-235307-attachments.webkit.org/attachment.cgi?a%EF%BF%BDb
" (should end in "?a%26%2365533%3Bb") _______- All browsers are matching or I am testing it wrong? JSFiddle -
https://jsfiddle.net/b50n7e2s/
(same test but took from Chrome / Blink bug).
Anne van Kesteren
Comment 3
2022-11-30 05:23:11 PST
That test doesn't seem to test windows-1252 (due to JSFiddle forcing UTF-8), but when actually testing windows-1252 all browsers seem to agree as well:
https://github.com/web-platform-tests/wpt/pull/37250
. However, 1.
Comment 0
also describes a problem on Windows that might still exist. 2. Code inspection shows that
https://github.com/WebKit/WebKit/blob/5e81d33ff5c0150dbabbebbe2e96fb08ff4d6ad3/Source/WebCore/PAL/pal/text/TextCodecUTF8.cpp#L461-L472
does not do surrogate handling. (Also, if as
comment 0
suggests this is somehow intentional, which I suspect it's not, it shouldn't be called UTF-8.)
Anne van Kesteren
Comment 4
2023-03-27 07:50:07 PDT
Hmm, I'm no longer convinced there's a problem here. Especially since Windows is no longer targeted. Andreu, what do you think?
Anne van Kesteren
Comment 5
2023-04-02 08:04:06 PDT
I was wrong about the non-UTF-8 encoders:
https://github.com/web-platform-tests/wpt/pull/39324
. I created
bug 179303
to fix that. Keeping this open to find out if the UTF-8 issue is exposed somewhere.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug