Bug 181181
Summary: | The percent encoding in anchorElement.search depends on the encoding of the page | ||
---|---|---|---|
Product: | WebKit | Reporter: | Pierre-Yves Gérardy <org.webkit> |
Component: | WebCore JavaScript | Assignee: | Nobody <webkit-unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | Normal | CC: | achristensen, annevk, ap |
Priority: | P2 | ||
Version: | WebKit Nightly Build | ||
Hardware: | Mac | ||
OS: | macOS 10.12 |
Pierre-Yves Gérardy
On a page loaded with iso-8859-1 encoding, run this code:
var a = document.createElement("a")
a.href = "?" + String.fromCodePoint(246)
console.log(a.search)
You get back "?%F6", not "?%C3%B6". According to the URL spec, all percent-encoded bytes in URLs should represent valid UTF-8 code points.
`location.search` and `new URL().serach` are not affected, and neither are the `.pathname` and `.hash` getters (they all return percent-encoded UTF-8 bytes).
Repro here (you must set the encoding manually using the "view/text encoding" menu.
http://bl.ocks.org/pygy/raw/b4f638659162c321d40694a38c16a6e7/8e718d92c41228d5681cc989627f80e5f8573a20/
Attachments | ||
---|---|---|
Add attachment proposed patch, testcase, etc. |
Alexey Proskuryakov
This behavior is of course intentional, and used to be necessary for web compatibility. Maybe it's not needed any more.
Pierre-Yves Gérardy
That would explain why Chrome and Firefox behave in the same way... In Firefox, `location.search` also depends on the page encoding... Was it also the case in earlier WebKit versions?
If it is still needed for Web compat, then I suppose that the URL spec must be updated accordingly...
I can open an issue on the WhatWG tracker if needed.
Alexey Proskuryakov
If three browsers do this, then updating the spec would seem like the logical next step indeed.
I do not know if anything changed with regards to location.search in WebKit.
Anne van Kesteren
https://url.spec.whatwg.org/#query-state takes the encoding into account, no? Note that new URL() and some other code paths in the browser will always force UTF-8, but <a> and location will use the encoding of the document.
Alex Christensen
Yep, this is intentional, all browsers behave this way, and it is in the URL specification.
Pierre-Yves Gérardy
Firefox is the only browser that treats location and <a> identically. Chrome and Safari both have location and new URL() work the same.
Also, the URL specification is not consistent, because it also states that
"""
A percent-encoded byte is U+0025 (%), followed by two ASCII hex digits. Sequences of percent-encoded bytes, after conversion to bytes, should not cause UTF-8 decode without BOM or fail to return failure.
"""
Yet it proceeds to describe an algorithm that produces non-UTF-8 sequences.
https://url.spec.whatwg.org/#percent-encoded-bytes
decodeURI and friends rely on this choke on non-UTF-8 sequences of perccent-encoded bytes. For Latin-1, unescape() works, but that's about it.
This is in the context of a SPA router that supports routes as pathname, search or hash (only one at a time :-). Since I can't even rely on location and <a> behaving consistently (for feature detection) I'll probably disable non-ascii routes if document.characterSet.toUpperCase() is not "UTF-8".
Anne van Kesteren
Having different requirements for web developers and user agents is fairly common in standards. Web developers are also supposed to exclusively use UTF-8, for instance.
Not sure what you mean with regards to interoperability issues. It might be worth filing an issue against https://github.com/whatwg/url/issues/new with more detail so we can add the necessary tests and make browsers fully consistent where they're currently not (or if you want to work on web-platform-tests for that yourself that'd be great too).