Bug 306742
| Summary: | URL query percent encoding is incorrect | ||
|---|---|---|---|
| Product: | WebKit | Reporter: | Nikita Skovoroda <chalkerx> |
| Component: | Platform | Assignee: | Alex Christensen <achristensen> |
| Status: | RESOLVED FIXED | ||
| Severity: | Normal | CC: | achristensen, annevk, ap, darin, karlcow, webkit-bug-importer |
| Priority: | P2 | Keywords: | InRadar |
| Version: | WebKit Nightly Build | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| See Also: | https://bugs.webkit.org/show_bug.cgi?id=162068 | ||
Nikita Skovoroda
See https://encoding.spec.whatwg.org/index-iso-8859-2.txt -- there is no `U+00A2` here, and `128+34` maps to `0x02D8`, 128+13 maps to `0x008D`
Which is what TextDecoder does:
```
> new TextDecoder('iso-8859-2').decode(Uint8Array.of(0xa2)).codePointAt(0).toString(16)
'2d8'
> new TextDecoder('iso-8859-2').decode(Uint8Array.of(0x8d)).codePointAt(0).toString(16)
'8d'
```
But in https://url.spec.whatwg.org/#string-percent-encode-after-encoding + https://encoding.spec.whatwg.org/#encode-or-fail though, the result is wrong with `iso-8859-2` encoing:
```
var a = document.createElement('a');
a.href = 'https://example.com/?' + String.fromCodePoint(0xa2)
console.log(a.search.slice(1))
var b = document.createElement('a');
b.href = 'https://example.com/?' + String.fromCodePoint(0x8d)
console.log(b.search.slice(1))
```
WebKit prints `%8D` for both!
Per spec (and in Chrome, Firefox, Servo) it is `%26%23162%3B` for `U+A2` and `%8D` for `U+8D`.
---
This is not limited to just `iso-8859-2`
Here is the list of encodings where trivial encoding checks fails:
```
✖ FAIL percent-encode after encoding matches browser > iso-8859-2
✖ FAIL percent-encode after encoding matches browser > iso-8859-4
✖ FAIL percent-encode after encoding matches browser > iso-8859-5
✖ FAIL percent-encode after encoding matches browser > iso-8859-13
✖ FAIL percent-encode after encoding matches browser > iso-8859-15
✖ FAIL percent-encode after encoding matches browser > koi8-r
✖ FAIL percent-encode after encoding matches browser > macintosh
✖ FAIL percent-encode after encoding matches browser > windows-1250
✖ FAIL percent-encode after encoding matches browser > windows-1251
✖ FAIL percent-encode after encoding matches browser > windows-1254
✖ FAIL percent-encode after encoding matches browser > windows-1256
✖ FAIL percent-encode after encoding matches browser > windows-1258
✖ FAIL percent-encode after encoding matches browser > x-mac-cyrillic
✖ FAIL percent-encode after encoding matches browser > gbk
✖ FAIL percent-encode after encoding matches browser > gb18030
```
It passes on Chrome, Firefox and Servo.
| Attachments | ||
|---|---|---|
| Add attachment proposed patch, testcase, etc. |
Nikita Skovoroda
Quick isolated test, just run it in `about:blank`:
```
const iframe = document.createElement('iframe')
document.body.append(iframe)
const encoding = 'iso-8859-2'
const codepoint = 0x5a7a
const html = `
<!DOCTYPE html>
<script>
var a = document.createElement('a');
a.href = 'https://example.com/?' + String.fromCodePoint(0xa2)
console.log(a.search.slice(1))
var b = document.createElement('a');
b.href = 'https://example.com/?' + String.fromCodePoint(0x8d)
console.log(b.search.slice(1))
</script>`
iframe.src = `data:text/html;charset=${encoding},${encodeURI(html)}`
```
Alexey Proskuryakov
Thank you for the report! There is certainly a discrepancy in results across WebKit and other engines here. I don't understand what's going on without tracing through all the specs though.
In particular, how do the other engines arrive at '%26%23162%3B', which is percent encoding for '¢'? HTML entities should be meaningless in URLs. And I expected UTF-8 anyway, regardless of document encoding.
Nikita Skovoroda
See the links to the spec in the issue
Here they are again:
1. https://url.spec.whatwg.org/#string-percent-encode-after-encoding
2. https://encoding.spec.whatwg.org/#encode-or-fail
They mention how `%26%23`-escaping gets there.
Nikita Skovoroda
Moreover, WebKit encodes other bytes correctly and has the same logic.
Just it misbehaves on certain input
Try e.g.
```
(() => {
const iframe = document.createElement('iframe')
document.body.append(iframe)
const encoding = 'iso-8859-2'
const html = `
<!DOCTYPE html>
<script>
var a = document.createElement('a');
a.href = 'https://example.com/?' + String.fromCodePoint(0xa2)
console.log(a.search.slice(1))
var b = document.createElement('a');
b.href = 'https://example.com/?' + String.fromCodePoint(0xa3)
console.log(b.search.slice(1))
</script>`
iframe.src = `data:text/html;charset=${encoding},${encodeURI(html)}`
})();
```
WebKit prints:
```
%8D
%26%23163%3B
```
The behavior on 0xa3 is correct, just 0xa2 is broken somewhy.
Karl Dubost
Nikita, does it affect one of your sites? library?
Nikita Skovoroda
Karl, no. It was found in cross-tests.
The only thing it affects for me is this: https://github.com/ExodusOSS/bytes/blob/6d0030bfe/tests/whatwg.browser.test.js#L78-L83
I do have another impls to compare to though, so this is not important for my case.
Nikita Skovoroda
Re: title change: I doubt that this affects only query percent-encoding
The methods are common for all non-utf8 encoding, query or not.
So it's likely `non-utf8 encoding`, not `URL query percent encoding`
Alexey Proskuryakov
I meant to highlight that it's not affecting TextDecoder API, thus it's likely not in the underlying decoder code, and seems to be more about URLs. As already mentioned, I'm not sure about where the problem actually is.
Alex Christensen
This is a part of the encoding, not decoding, hence why TextDecoder is unaffected. This seems to be specified in https://url.spec.whatwg.org/#string-percent-encode-after-encoding at the very end.
Firefox implements it in a function named nsStandardURL::nsSegmentEncoder::EncodeSegmentCount
Chromium implements it in a function named appendURLEscapedChar
Our analogous function urlEscapedEntityCallback calls TextCodec::getUnencodableReplacement which implements it. I'm not completely sure why it isn't being called in this case. I don't see the (reason == UCNV_UNASSIGNED) condition being met, which seems strange to me.
Alex Christensen
Seems related to our use of ucnv_setFallback
Radar WebKit Bug Importer
<rdar://problem/169566553>
Alex Christensen
Pull request: https://github.com/WebKit/WebKit/pull/57812
EWS
Committed 306768@main (47164fceaef4): <https://commits.webkit.org/306768@main>
Reviewed commits have been landed. Closing PR #57812 and removing active labels.