Bug 67134

Summary: XSSAuditor bypass under big5 encoding (also sjis).
Product: WebKit Reporter: Thomas Sepez <tsepez>
Component: WebCore Misc.Assignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, dbates, webkit.review.bot
Priority: P2 Keywords: XSSAuditor
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 66588    
Bug Blocks: 66579    
Attachments:
Description Flags
Proposed tests.
dbates: review-, dbates: commit-queue-
patch using perl for testcase. none

Description Thomas Sepez 2011-08-29 10:13:41 PDT
XSSAuditor can be tricked under those encodings where not all bytes of a multibyte character are greater than 0x80.  Big5 is one such example where trailing bytes need only be greater than 0x40.

third_party/WebKit/Source/WebCore/platform/KURLGoogle.cpp: decodeURLEscapeSequences() isn't character-set aware.  So given a single big5 character input like 0xc8 0x5f, this gets transformed into two characters (code points): c8 and 5f, due to the (correct) recovery when trying to interpret this as utf8.

Later, the XSSAuditor removes all non-ascii code points in both the page contents and the URL, so as to be immune to these kinds of misinterpretations when comparing the page contents against the URL. This works well in the cases where a misinterpretation introduces two high-valued characters instead of one high-valued character, as typically happens when all the bytes in the multibyte sequence are greater than 0x80.  But in the example page, it removes the (one) high-valued character, but in the URL removes the first misinterpreted byte but leaves the 5f.  Not having such a character in the page, the match fails.

Correct fix this is to make decodeURLEscapeSequences() encoding-aware (as is the non-G version in KURL.ccp), but this has its own issues as detailed in the comments in KURLGoogle.cpp).
Comment 1 Thomas Sepez 2011-09-07 11:25:42 PDT
I've written a test for this, but it manifests only under --chromium.  There's a happy co-incidence that generally masks this on the KURL.cpp side: namely what happens when a bad sequence is encountered.

As indicated above, the KURLGoogle path is something like:
%c9%5f => 00c9 005f => c3 89 5f => [ c389 big5 char ] 5f => [ hi unicode char ] 5f => 5f
Where each byte of the invalid sequence is passed through as the equivlaent codepoint (after first arrow above).

The KURL path is something like:
%c9%5f => fffd 005f => ef  bf bd 5f => [ efbf big 5 char] [ bd5f big5 char] => [hi unicode][hi unicode]=> empty string
where each byte of the invalid sequence is replaced by the unicode replacement character U+fffd.

The co-incidence is that during the KURL case, the replacement codepoint turns into an odd number of utf8 bytes (after the second arrow) but the KURLGoogle replacement code point turns into an even number of utf8 bytes. Hence the byte for the following 5f gets eaten as part of a second big5 character in one case, but gets passed through as-is in the other. 

There will be variations involving other charsets that don't hit this happy co-incidence, but this explains why the test case I was working on passed under the normal build.
Comment 2 Thomas Sepez 2011-09-07 15:13:16 PDT
Have constructed an example in shift_jis that will flunk against the present code but is expected to pass once the issue is fixed.

We believe the fix to https://bugs.webkit.org/show_bug.cgi?id=66588 as proposed by Daniel Bates will cover this case.  I'm just uploading the two prospective tests while waiting for that bug to close.
Comment 3 Thomas Sepez 2011-09-07 15:17:05 PDT
Created attachment 106655 [details]
Proposed tests.

Just to get code up for review even though we can't land at present.
Comment 4 Thomas Sepez 2011-09-09 10:13:35 PDT
These now passed for me on mac under both normal and --chromium cases following the fix to 66588.
Comment 5 Daniel Bates 2011-09-09 10:35:13 PDT
Comment on attachment 106655 [details]
Proposed tests.

Can we write these tests using some kind of Perl script that echos back the input data instead of hardcoding the response page? Adding such a script or modifying an existing script will make it easier to extend our test coverage for such encoding issues. If it's not possible to write such a script please elaborate on why.

Take a look at the Encode module <http://perldoc.perl.org/Encode.html> for encoding text.
Comment 6 Thomas Sepez 2011-09-09 10:47:19 PDT
I'm just concerned about the correct functioning of the test on any given platform depending on the correctness of the perl implementation on that platform.  Aren't there already a bunch of these xssauditor tests that flunk only on certain ports?  Perhaps because the cgi implementation does some unexpected transformation or the perhaps the encoding doesn't come back as expected. 

I guess I want a test that says when these bytes go up, and these bytes come back, this is what happens.
Comment 7 Thomas Sepez 2011-09-09 11:56:37 PDT
Created attachment 106895 [details]
patch using perl for testcase.

Updated echo-intertag.pl to allow specification of charset in content-type header.  We don't decode the strings because we want the exact byte sequence returned as passed in escaped in the q parameter.  This appears to be the case when CGI is invoked without the utf8 option, treating the parameter as if it were a binary string.
Comment 8 Daniel Bates 2011-09-09 15:03:07 PDT
Comment on attachment 106895 [details]
patch using perl for testcase.

Thanks Tom. This looks good to me.
Comment 9 WebKit Review Bot 2011-09-09 16:04:34 PDT
Comment on attachment 106895 [details]
patch using perl for testcase.

Clearing flags on attachment: 106895

Committed r94884: <http://trac.webkit.org/changeset/94884>
Comment 10 WebKit Review Bot 2011-09-09 16:04:39 PDT
All reviewed patches have been landed.  Closing bug.