Bug 43225 - Webkit converts &not_ to %AC in a link's query arguments
Summary: Webkit converts &not_ to %AC in a link's query arguments
Status: RESOLVED WONTFIX
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC All
: P2 Normal
Assignee: Nobody
URL: data:text/html,&not_
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-29 18:02 PDT by Dan Ciliske
Modified: 2010-08-02 11:33 PDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Ciliske 2010-07-29 18:02:09 PDT
Trying to track down a bug in our system, I stumbled across this:
Webkit does not require the semicolon at the end of html special characters for them to be interpolated.
This ends up causing links containing query variables to instead end up with the unicode representation of the 
special characters instead of the ampersand separated variables.

Example:
The source contains: 
http://.../Inventory/?payment_status=paid&not_receipt_status=fully%20received

It gets rendered as:
http://.../Inventory/?payment_status=paid%AC_receipt_status=fully%20received


Expected result:
non semicolon terminated special characters do not get converted


I have confirmed this issue on Chromium 5.0.375.99 for both Ubuntu and Windows and in Safari on 10.5 with today's nightly build.
Comment 1 Mark Rowe (bdash) 2010-07-30 10:28:22 PDT
Literal ampersands should be encoded as & in the markup to remove any possibility of ambiguity.

That said, Firefox has different behavior for this particular snippet: data:text/html,&not_. We should work out what the correct behavior is here.
Comment 2 Dan Ciliske 2010-07-30 11:09:53 PDT
I will admit, I'm not wholly aware of html standards (I've only been doing serious web development for a few months) so it appears that I did not know this is an ambiguous case. That being said, recovering from an unterminated entity seems that it should be consistent. Currently, when it the unterminated string is not a valid entity, the ampersand initiated string is insert unmodified. However, when it is a valid entity, the valid portion is interpolated into the unicode character it represents.

Unless there's a strong argument against doing so, in keeping with consistency of invalid entities and in being consistent with Gecko and IE, it would seem advisable to be not interpolate these non-terminated entities.
Comment 3 Alexey Proskuryakov 2010-08-02 07:42:52 PDT
See also: bug 4948, bug 14391, and most importantly, bug 41345.

This behavior is unchanged between Safari 5.0 and current builds.
Comment 4 Adam Barth 2010-08-02 11:25:00 PDT
Whether &not_ gets treated as an entity depends on whether those characters appear in an attribute value.  In an attribute value, they don't get turned into an entity (for precisely the reason Dan mentions), but outside of an attribute value (e.g., between tags) they do get turned into an entity.
Comment 5 Adam Barth 2010-08-02 11:27:03 PDT
data:text/html,<div bar='&not_'></div><script>alert(document.getElementsByTagName('div')[0].getAttribute('bar'));</script>
Comment 6 Adam Barth 2010-08-02 11:33:06 PDT
I lied.  &not_ is always tokenized as an HTML entity, even in attribute values.

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references

[[
If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is either a U+003D EQUALS SIGN character (=) or in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned.
]]

and...

http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references

[[
not	 U+000AC	¬
]]