Webkit issue: UTF standards require parsers to reject sequences that were encoded using more bytes than absolutely necessary (for example, standard 7-bit characters encoded as 2 or 4-byte strings, e.g. j, either as a binary value or a HTML entity). Modify the renderer to reject such characters, as they have no legitimate use, but are routinely abused to carry out cross-site scripting attacks (attempts to close HTML tags and inject code, when obfuscated this way, routinely bypass filters).
Created attachment 19565 [details] test case (works as expected) Yes, our decoder does reject non-shortest UTF forms in all cases I'm aware of. Do you have a specific example of the problem?
Created attachment 20014 [details] reduction
Looks like the only remaining worrisome case is multibyte HTML entities. These could be used to bypass filters that differentiate between absolute and relative URLs, and apply restrictions based on this distinction: <a href="javascriptΪlert(1)">Long HTML entity notation might be used to bypass some URL filters</a> This is not strictly a browser bug, but it has no legitimate uses, and is a common XSS vector against applications, so locking it down is certainly beneficial.
In this example, the entity is not only long, but it is not terminated with a semicolon. As such, it is covered by bug 4948. I am not aware of any reason to reject ":", though - other browsers handle this just fine, and standards do not disallow it AFAIK.
Indeed, this behavior is covered by the HTML Standard.