Bug 181181

Summary:	The percent encoding in anchorElement.search depends on the encoding of the page
Product:	WebKit	Reporter:	Pierre-Yves Gérardy <org.webkit>
Component:	WebCore JavaScript	Assignee:	Nobody <webkit-unassigned>
Status:	RESOLVED FIXED
Severity:	Normal	CC:	achristensen, annevk, ap
Priority:	P2
Version:	WebKit Nightly Build
Hardware:	Mac
OS:	macOS 10.12

Pierre-Yves Gérardy

Reported 2017-12-28 07:02:44 PST

On a page loaded with iso-8859-1 encoding, run this code: var a = document.createElement("a") a.href = "?" + String.fromCodePoint(246) console.log(a.search) You get back "?%F6", not "?%C3%B6". According to the URL spec, all percent-encoded bytes in URLs should represent valid UTF-8 code points. `location.search` and `new URL().serach` are not affected, and neither are the `.pathname` and `.hash` getters (they all return percent-encoded UTF-8 bytes). Repro here (you must set the encoding manually using the "view/text encoding" menu. http://bl.ocks.org/pygy/raw/b4f638659162c321d40694a38c16a6e7/8e718d92c41228d5681cc989627f80e5f8573a20/

Attachments
Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2017-12-31 14:23:31 PST

This behavior is of course intentional, and used to be necessary for web compatibility. Maybe it's not needed any more.

Pierre-Yves Gérardy

Comment 2 2018-01-02 04:34:29 PST

That would explain why Chrome and Firefox behave in the same way... In Firefox, `location.search` also depends on the page encoding... Was it also the case in earlier WebKit versions? If it is still needed for Web compat, then I suppose that the URL spec must be updated accordingly... I can open an issue on the WhatWG tracker if needed.

Alexey Proskuryakov

Comment 3 2018-01-02 09:19:03 PST

If three browsers do this, then updating the spec would seem like the logical next step indeed. I do not know if anything changed with regards to location.search in WebKit.

Anne van Kesteren

Comment 4 2018-01-02 10:14:37 PST

https://url.spec.whatwg.org/#query-state takes the encoding into account, no? Note that new URL() and some other code paths in the browser will always force UTF-8, but <a> and location will use the encoding of the document.

Alex Christensen

Comment 5 2018-01-02 11:45:57 PST

Yep, this is intentional, all browsers behave this way, and it is in the URL specification.

Pierre-Yves Gérardy

Comment 6 2018-01-03 01:38:23 PST

Firefox is the only browser that treats location and <a> identically. Chrome and Safari both have location and new URL() work the same. Also, the URL specification is not consistent, because it also states that """ A percent-encoded byte is U+0025 (%), followed by two ASCII hex digits. Sequences of percent-encoded bytes, after conversion to bytes, should not cause UTF-8 decode without BOM or fail to return failure. """ Yet it proceeds to describe an algorithm that produces non-UTF-8 sequences. https://url.spec.whatwg.org/#percent-encoded-bytes decodeURI and friends rely on this choke on non-UTF-8 sequences of perccent-encoded bytes. For Latin-1, unescape() works, but that's about it. This is in the context of a SPA router that supports routes as pathname, search or hash (only one at a time :-). Since I can't even rely on location and <a> behaving consistently (for feature detection) I'll probably disable non-ascii routes if document.characterSet.toUpperCase() is not "UTF-8".

Anne van Kesteren

Comment 7 2018-01-03 01:50:34 PST

Having different requirements for web developers and user agents is fairly common in standards. Web developers are also supposed to exclusively use UTF-8, for instance. Not sure what you mean with regards to interoperability issues. It might be worth filing an issue against https://github.com/whatwg/url/issues/new with more detail so we can add the necessary tests and make browsers fully consistent where they're currently not (or if you want to work on web-platform-tests for that yourself that'd be great too).

Note You need to log in before you can comment on or make changes to this bug.