Bug 17689 - Reject long UTF sequences
Summary: Reject long UTF sequences
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Windows XP
: P3 Normal
Assignee: Nobody
Depends on:
Reported: 2008-03-05 15:57 PST by jasneet
Modified: 2008-03-25 00:26 PDT (History)
3 users (show)

See Also:

test case (works as expected) (87 bytes, text/html)
2008-03-05 22:58 PST, Alexey Proskuryakov
no flags Details
reduction (140 bytes, text/html)
2008-03-24 15:07 PDT, jasneet
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description jasneet 2008-03-05 15:57:23 PST
Webkit issue:
UTF standards require parsers to reject sequences that were encoded using more bytes than absolutely necessary (for example, standard 7-bit characters encoded as 2 or 4-byte strings, e.g. &#0000106, either as a binary value or a HTML entity).

Modify the renderer to reject such characters, as they have no legitimate use, but are routinely abused to carry out cross-site scripting attacks (attempts to close HTML tags and inject code, when obfuscated this way, routinely bypass filters).
Comment 1 Alexey Proskuryakov 2008-03-05 22:58:38 PST
Created attachment 19565 [details]
test case (works as expected)

Yes, our decoder does reject non-shortest UTF forms in all cases I'm aware of. Do you have a specific example of the problem?
Comment 2 jasneet 2008-03-24 15:07:57 PDT
Created attachment 20014 [details]
Comment 3 jasneet 2008-03-24 15:08:30 PDT
Looks like the only remaining worrisome case is multibyte HTML entities. These could be used to bypass filters that differentiate between absolute and relative URLs, and apply restrictions based on this distinction:

<a href="javascript&#x0000003aalert(1)">Long HTML entity notation might be used to bypass some URL filters</a>

This is not strictly a browser bug, but it has no legitimate uses, and is a common XSS vector against applications, so locking it down is certainly beneficial.

Comment 4 Alexey Proskuryakov 2008-03-25 00:26:08 PDT
In this example, the entity is not only long, but it is not terminated with a semicolon. As such, it is covered by bug 4948.

I am not aware of any reason to reject "&#x0000003a;", though - other browsers handle this just fine, and standards do not disallow it AFAIK.