Bug 66588

Summary: XSS filter bypass via non-standard URL encoding
Product: WebKit Reporter: Adam Barth <abarth>
Component: WebKit Misc.Assignee: Daniel Bates <dbates>
Status: RESOLVED FIXED    
Severity: Normal CC: dbates
Priority: P2 Keywords: XSSAuditor
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 66579, 67134, 67842    
Attachments:
Description Flags
Work-in-progress patch
none
Patch and layout tests
none
Patch and layout tests
abarth: review-
Patch and layout tests abarth: review+

Description Adam Barth 2011-08-19 14:07:01 PDT
(I thought this bug was filed somewhere already, but I couldn't find it.)

Quoting Wikipedia:

[[
There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C. The third edition of ECMA-262 still includes an escape(string) function that uses this syntax, but also an encodeURI(uri) function that converts to UTF-8 and percent-encodes each octet.
]]
-- http://en.wikipedia.org/wiki/Percent-encoding#Non-standard_implementations

It turns out ASP (or possibly ASP.NET) servers decode these sequences, which leads to an XSS filter bypass because we don't understand that transformation.  ASP servers are a large enough population that it's probably worth teaching the XSS auditor to understand this case.
Comment 1 Daniel Bates 2011-09-01 22:35:49 PDT
Created attachment 106094 [details]
Work-in-progress patch
Comment 2 Adam Barth 2011-09-01 22:43:09 PDT
Comment on attachment 106094 [details]
Work-in-progress patch

View in context: https://bugs.webkit.org/attachment.cgi?id=106094&action=review

> Source/WebCore/html/parser/XSSAuditor.cpp:119
> +static inline String decodeFancyUnicodeEscapeSequences(const String& string)

Love the name.

> Source/WebCore/platform/text/DecodeEscapeSequences.h:38
> +static inline int hexDigitValue(UChar c)

Supposedly including static functions in a header makes the linker sad.  I think this function is in ASCIIType anyway.
Comment 3 Thomas Sepez 2011-09-02 11:20:15 PDT
Comment on attachment 106094 [details]
Work-in-progress patch

View in context: https://bugs.webkit.org/attachment.cgi?id=106094&action=review

>> Source/WebCore/html/parser/XSSAuditor.cpp:119
>> +static inline String decodeFancyUnicodeEscapeSequences(const String& string)
> 
> Love the name.

Might call it decode16BitUnicodeEscapeSequences.

> Source/WebCore/html/parser/XSSAuditor.cpp:135
>          String decodedString = decoder->encoding().decode(workingStringUTF8.data(), workingStringUTF8.length());

We know this can't work.  OK for now, but the decoding has to happen inside decodeURLEscapeSequences so maybe pass it the decoder.

> Source/WebCore/platform/text/DecodeEscapeSequences.h:62
> +            *p++ = (hexDigitValue(run[2]) << 12) | (hexDigitValue(run[3]) << 8) | (hexDigitValue(run[4]) << 4) | hexDigitValue(run[5]);

*p is a char, but you're assigning 16 bit value to it.

> Source/WebCore/platform/text/DecodeEscapeSequences.h:112
> +        String decoded = (encoding.isValid() ? encoding : UTF8Encoding()).decode(buffer.data(), numBytesWritten);

There's a difference between the 8bit version where the %-escapes are interepreted relative to an encoding vs. the %u-style escapes, which are expected to represent a unicode code point no matter what the encoding and shouldn't get decoded.  I think your palceDecodedRunInBuffer needs to be passed the encoding so it can decide whether to do this or not, with the output buffer being a String (or UChar equivalent).
Comment 4 Thomas Sepez 2011-09-02 11:23:08 PDT
Comment on attachment 106094 [details]
Work-in-progress patch

View in context: https://bugs.webkit.org/attachment.cgi?id=106094&action=review

> LayoutTests/http/tests/security/xssAuditor/script-tag-with-fancy-unicode2.html:13
> +</iframe>

Really need to stick some high-valued codepoints in here -- that may show the *p bug above.  Also %252525u0061 should be tried as well to test interaction between the two decoders.
Comment 5 Thomas Sepez 2011-09-02 12:00:53 PDT
Comment on attachment 106094 [details]
Work-in-progress patch

View in context: https://bugs.webkit.org/attachment.cgi?id=106094&action=review

> Source/WebCore/html/parser/XSSAuditor.cpp:130
>          workingString = decodeURLEscapeSequences(workingString);

Might want to circumvent the path through KURL.cpp to avoid schizophrenic issues about whether KURLGoogle.cpp is used on a given build.  You could just call the templated function directly.
Comment 6 Daniel Bates 2011-09-03 21:59:44 PDT
(In reply to comment #2)
> (From update of attachment 106094 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=106094&action=review
> 
> > Source/WebCore/html/parser/XSSAuditor.cpp:119
> > +static inline String decodeFancyUnicodeEscapeSequences(const String& string)
> 
> Love the name.

Although we understand the meaning of "fancy", I suggest we rename this function to decode16BitUnicodeEscapeSequences() as suggested by Thomas Sepez because this name better conveys the fanciness of these Unicode escape sequences.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:38
> > +static inline int hexDigitValue(UChar c)
> 
> Supposedly including static functions in a header makes the linker sad.  I think this function is in ASCIIType anyway.

Will remove.
Comment 7 Adam Barth 2011-09-03 22:01:36 PDT
Okiedokes.
Comment 8 Daniel Bates 2011-09-03 22:06:46 PDT
(In reply to comment #3)
> (From update of attachment 106094 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=106094&action=review
> 
> >> Source/WebCore/html/parser/XSSAuditor.cpp:119
> >> +static inline String decodeFancyUnicodeEscapeSequences(const String& string)
> > 
> > Love the name.
> 
> Might call it decode16BitUnicodeEscapeSequences.

Will rename.

> 
> > Source/WebCore/html/parser/XSSAuditor.cpp:135
> >          String decodedString = decoder->encoding().decode(workingStringUTF8.data(), workingStringUTF8.length());
> 
> We know this can't work.  OK for now, but the decoding has to happen inside decodeURLEscapeSequences so maybe pass it the decoder.

Will remove and instead pass text encoding to decodeURLEscapeSequences().

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:62
> > +            *p++ = (hexDigitValue(run[2]) << 12) | (hexDigitValue(run[3]) << 8) | (hexDigitValue(run[4]) << 4) | hexDigitValue(run[5]);
> 
> *p is a char, but you're assigning 16 bit value to it.
> 

Will fix.

> > Source/WebCore/platform/text/DecodeEscapeSequences.h:112
> > +        String decoded = (encoding.isValid() ? encoding : UTF8Encoding()).decode(buffer.data(), numBytesWritten);
> 
> There's a difference between the 8bit version where the %-escapes are interepreted relative to an encoding vs. the %u-style escapes, which are expected to represent a unicode code point no matter what the encoding and shouldn't get decoded.  I think your palceDecodedRunInBuffer needs to be passed the encoding so it can decide whether to do this or not, with the output buffer being a String (or UChar equivalent).

Will rename placeDecodedRunInBuffer() to decodeRun(), have it take as input a text encoding, and have it return a String object.
Comment 9 Daniel Bates 2011-09-03 22:08:04 PDT
(In reply to comment #4)
> (From update of attachment 106094 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=106094&action=review
> 
> > LayoutTests/http/tests/security/xssAuditor/script-tag-with-fancy-unicode2.html:13
> > +</iframe>
> 
> Really need to stick some high-valued codepoints in here -- that may show the *p bug above.  Also %252525u0061 should be tried as well to test interaction between the two decoders.

Will add test cases that include high-valued Unicode code points.
Comment 10 Daniel Bates 2011-09-03 22:21:06 PDT
(In reply to comment #5)
> (From update of attachment 106094 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=106094&action=review
> 
> > Source/WebCore/html/parser/XSSAuditor.cpp:130
> >          workingString = decodeURLEscapeSequences(workingString);
> 
> Might want to circumvent the path through KURL.cpp to avoid schizophrenic issues about whether KURLGoogle.cpp is used on a given build.  You could just call the templated function directly.

Will call templated function decodeEscapeSequences<URLEscapeSequence>() instead of KURL's decodeURLEscapeSequences().
Comment 11 Daniel Bates 2011-09-03 22:32:54 PDT
Created attachment 106279 [details]
Patch and layout tests

Needs change log. We could also look to bolster the test case script-tag-with-fancy-unicode5.html with more high-valued Unicode code points.

Additionally, added an early return in XSSAuditor::init() when the document's URL is null or the empty string (i.e. url.isEmpty() evaluates to true). This can happen when opening a new browser window or calling window.open("") (i.e. not specifying a URL to window.open()).
Comment 12 Daniel Bates 2011-09-04 16:37:33 PDT
Created attachment 106296 [details]
Patch and layout tests

Add more high-valued Unicode code points to test case script-tag-with-16bit-unicode5.html (formerly named script-tag-with-fancy-unicode5.html).

I renamed the test cases such that they have the suffix 16bit-unicode in their name so as to more closely match the references to 16-bit Unicode escape sequences in the code. I am open to suggestions. Should we decide to use this suffix then we should rename the existing test case script-tag-with-fancy-unicode.html for consistency.

Added an additional test case, window-open-without-url-should-not-assert.html, to ensure we don't assert when opening a browser window without a URL using window.open("") (verbatim).
Comment 13 Adam Barth 2011-09-05 11:38:47 PDT
Comment on attachment 106296 [details]
Patch and layout tests

View in context: https://bugs.webkit.org/attachment.cgi?id=106296&action=review

This looks good.  Below are a bunch of minor comments.  I'd like to look at it one more time to make sure we get the multi-byte unicode stuff right.

> Source/WebCore/html/parser/XSSAuditor.cpp:126
> +static inline String commonDecodeURLEscapeSequences(const String& string, const TextEncoding& encoding)

decodeCommonURLEscapeSequences ?  As written, the name doesn't seem to read quite right.  Maybe decodeStandardURLEscapSequences ?

> Source/WebCore/html/parser/XSSAuditor.cpp:142
> +    workingString = decode16BitUnicodeEscapeSequences(workingString);

Don't we need to do this inside the loop too?

> Source/WebCore/platform/text/DecodeEscapeSequences.h:46
> +            && WTF::isASCIIHexDigit(string[start + 2]) && WTF::isASCIIHexDigit(string[start + 3])
> +            && WTF::isASCIIHexDigit(string[start + 4]) && WTF::isASCIIHexDigit(string[start + 5]);

You probably don't need the WTF:: parts.

> Source/WebCore/platform/text/DecodeEscapeSequences.h:52
> +        size_t numSequences = runLength / size;

numSequences => numberOfSequences

> Source/WebCore/platform/text/DecodeEscapeSequences.h:53
> +        Vector<UChar, 512> buffer;

Given that we need to return a String anyway, we should just use StringBuilder and reserveCapacity.  We can't avoid the malloc in any case.

> Source/WebCore/platform/text/DecodeEscapeSequences.h:57
> +            *p++ = (toASCIIHexValue(run[2]) << 12) | (toASCIIHexValue(run[3]) << 8) | (toASCIIHexValue(run[4]) << 4) | toASCIIHexValue(run[5]);

Is this correct even with surrogates?  I would have expected you to do something like what we do here:

http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLEntityParser.cpp#L73

> Source/WebCore/platform/text/DecodeEscapeSequences.h:76
> +        DEFINE_STATIC_LOCAL(CharBuffer, buffer, ());

Why are we declaring it static?  That can cause problems if this code is run on multiple threads or re-entered.  Just allocating it on the stack should be find (and essentially free).

> Source/WebCore/platform/text/DecodeEscapeSequences.h:78
> +        size_t numSequences = runLength / size;

numSequences => numberOfSequences

> Source/WebCore/platform/text/DecodeEscapeSequences.h:84
> +        }

I'd add an ASSERT about buffer.size.

> Source/WebCore/platform/text/DecodeEscapeSequences.h:92
> +    Vector<UChar> result;

StringBuilder is probably better here.

> Source/WebCore/platform/text/DecodeEscapeSequences.h:93
> +    unsigned length = string.length();

size_t <-- amounts of memory are measured in size_t so they work properly on 64-bit machines.

> Source/WebCore/platform/text/DecodeEscapeSequences.h:95
> +    unsigned decodedPosition = 0;
> +    unsigned searchPosition = 0;

Same here.
Comment 14 Daniel Bates 2011-09-08 16:17:35 PDT
(In reply to comment #13)
> [...]
> > Source/WebCore/html/parser/XSSAuditor.cpp:126
> > +static inline String commonDecodeURLEscapeSequences(const String& string, const TextEncoding& encoding)
> 
> decodeCommonURLEscapeSequences ?  As written, the name doesn't seem to read quite right.  Maybe decodeStandardURLEscapSequences ?
> 

I'll use decodeStandardURLEscapeSequences().

Both decodeCommonURLEscapeSequences() and decodeStandardURLEscapeSequences() seem to imply that there exists some uncommon/non-standard URL escape sequences that this function doesn't decode.

I wanted to convey that this function doesn't use the platform-specific KURL URL decoding machinery. So, I chose the prefix "common" as a synonym for cross-platform. Maybe prefixing with "crossPlatform" would be more descriptive? Another idea is to rename decodeURLEscapeSequences() to platformDecodeURLEscapeSequences() in KURL.{cpp, h} and KURLGoogle.{cpp, h}. Then we can reuse the name decodeURLEscapeSequences for a function that calls decodeEscapeSequences<URLEscapeSequence>().

> > Source/WebCore/html/parser/XSSAuditor.cpp:142
> > +    workingString = decode16BitUnicodeEscapeSequences(workingString);
> 
> Don't we need to do this inside the loop too?

Yes, will fix.

Then we'll catch the following attacks that intermix the %u- and URL- encodings, such as: http://localhost:8000/security/xssAuditor/resources/echo-intertag-decode-16bit-unicode.pl?q=<script>alert(/XS%u002525u0053/)</script>.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:46
> > +            && WTF::isASCIIHexDigit(string[start + 2]) && WTF::isASCIIHexDigit(string[start + 3])
> > +            && WTF::isASCIIHexDigit(string[start + 4]) && WTF::isASCIIHexDigit(string[start + 5]);
> 
> You probably don't need the WTF:: parts.

Will remove.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:52
> > +        size_t numSequences = runLength / size;
> 
> numSequences => numberOfSequences

Will rename.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:53
> > +        Vector<UChar, 512> buffer;
> 
> Given that we need to return a String anyway, we should just use StringBuilder and reserveCapacity.  We can't avoid the malloc in any case.

Will use StringBuilder.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:57
> > +            *p++ = (toASCIIHexValue(run[2]) << 12) | (toASCIIHexValue(run[3]) << 8) | (toASCIIHexValue(run[4]) << 4) | toASCIIHexValue(run[5]);
> 
> Is this correct even with surrogates?  I would have expected you to do something like what we do here:
> 
> http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLEntityParser.cpp#L73

Yes, this works with surrogates.

We discussed this over IRC on 09/06/2011. Will follow this comment with a transcript from that discussion.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:76
> > +        DEFINE_STATIC_LOCAL(CharBuffer, buffer, ());
> 
> Why are we declaring it static?  That can cause problems if this code is run on multiple threads or re-entered.  Just allocating it on the stack should be find (and essentially free).

Will remove. Instead, will stack allocate a buffer.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:78
> > +        size_t numSequences = runLength / size;
> 
> numSequences => numberOfSequences

Will rename.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:84
> > +        }
> 
> I'd add an ASSERT about buffer.size.

Will add ASSERT(buffer.size() == static_cast<size_t>(p - buffer.data())).

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:92
> > +    Vector<UChar> result;
> 
> StringBuilder is probably better here.

Will use StringBuilder.

> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:93
> > +    unsigned length = string.length();
> 
> size_t <-- amounts of memory are measured in size_t so they work properly on 64-bit machines.
> 
> > Source/WebCore/platform/text/DecodeEscapeSequences.h:95
> > +    unsigned decodedPosition = 0;
> > +    unsigned searchPosition = 0;
> 
> Same here.

Will use size_t.
Comment 15 Daniel Bates 2011-09-08 16:24:49 PDT
For completeness, I discussed with Adam over IRC on 09/06/2011 about the correctness of the decoding routine for %u-escape sequences that he asked about in comment 13:

"> Source/WebCore/platform/text/DecodeEscapeSequences.h:57
> +            *p++ = (toASCIIHexValue(run[2]) << 12) | (toASCIIHexValue(run[3]) << 8) | (toASCIIHexValue(run[4]) << 4) | toASCIIHexValue(run[5]);

Is this correct even with surrogates?  I would have expected you to do something like what we do here:

http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLEntityParser.cpp#L73"

The paraphrased IRC transcript follows:

[7:56pm] abarth: dydz: were you looking for me earlier?
[7:57pm] dydz: abarth: Hi abarth. I was. I wanted to talk about <https://bugs.webkit.org/show_bug.cgi?id=66588#c13>
[7:57pm] dydz: abarth: In particular, about the surrogates.
[7:57pm] abarth: yessir
[7:58pm] dydz: abarth: From my understanding a String obejct represents UTF-16 string
[7:58pm] dydz: And %u-escape sequences represent UTF-16 code points.
[7:58pm] dydz: err
[7:58pm] dydz: UTF-16 code unit
[7:58pm] dydz: That is, the escape sequence doesn't represent a Unicode code point.
[7:59pm] abarth: dydz: why do you think that?
[7:59pm] abarth: maybe ECMA-262 says?
[8:00pm] abarth: "%uxxxx, where xxxx is a Unicode value represented as four hexadecimal digits"
[8:00pm] abarth: is what wikipedia says
[8:00pm] abarth: which isn't all that precise 
[8:00pm] abarth: but you're right that 262 usually deals in code units
[8:01pm] abarth: http://www.w3.org/International/iri-edit/draft-duerst-iri.html
[8:01pm] abarth: might be a good source
[8:02pm] abarth: "Instead of using the existing percent-encoding convention of URIs, which is based on octets, the idea was to create a new encoding convention, for example to use '%u' to introduce UCS code points."
[8:02pm] dydz: abarth: I haven't read ECMA-262 yet, but for your reference the definition of Unicode code units seems consistent with the implementation of unescape <http://trac.webkit.org/browser/trunk/Source/JavaScriptCore/runtime/JSGlobalObjectFunctions.cpp#L568>
[8:03pm] pizlo joined the chat room.
[8:03pm] abarth: Lexer::convertUnicode
[8:03pm] abarth: ok
[8:03pm] abarth: you're probably right
[8:03pm] abarth: maybe add a comment explaining?
[8:04pm] dydz: abarth: sure
[8:04pm] dydz: abarth: Can you save me some time, is a UCS-2 code point the same as a UTF-16 code unit?
[8:05pm] abarth: http://en.wikipedia.org/wiki/UTF-16/UCS-2
[8:05pm] abarth: "The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.[2] It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time."
[8:05pm] abarth: sounds like UTF-16 without the surroages
[8:06pm] abarth: and unable to represent characters that UTF-16 uses surrogates for
[8:10pm] dydz: abarth: ok. So a UCS-2 code point is analogous to a UTF-16 code unit.
Comment 16 Daniel Bates 2011-09-08 17:09:33 PDT
Created attachment 106817 [details]
Patch and layout tests

Updated patch based on Adam Barth's comments.

Added additional tests, including script-tag-with-16bit-unicode-surrogate-pair.html.
Comment 17 Adam Barth 2011-09-08 17:13:28 PDT
Comment on attachment 106817 [details]
Patch and layout tests

Thanks Dan.
Comment 18 Daniel Bates 2011-09-08 19:39:16 PDT
Committed r94828: <http://trac.webkit.org/changeset/94828>