Bug 35831

Summary: WebCore PreloadScanner Entity Detection Bug - Non-HTML Entities are being treated as entities
Product: WebKit Reporter: Jeff <mirthy>
Component: Page LoadingAssignee: Nobody <webkit-unassigned>
Status: RESOLVED DUPLICATE    
Severity: Normal CC: abarth, ap, inhoahiephcm, koivisto
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Mac (Intel)   
OS: OS X 10.6   
URL: http://www.vistaprint.com/gallery.aspx

Description Jeff 2010-03-06 12:46:02 PST
The entity detector in WebCore's PreloadScanner is broken.

The HTML tokenizer used will accept things that look like entities but aren't and convert them into Unicode characters.

For example, in scanning the HTML to pull out IMG tags, we might have a case like this:

<img src="http://www.webkit.org/getImage.aspx?id=12345&lang_id=1"/>

The tokenizer spots &lang_id=1 and thinks it might be an entity (it isn't!), but the test for entities isn't correct in the PreloadScanner (as it is in HTMLTokenizer).

Code area:
http://trac.webkit.org/browser/trunk/WebCore/html/PreloadScanner.cpp#L257

The actual problematic line:
http://trac.webkit.org/browser/trunk/WebCore/html/PreloadScanner.cpp#L268

The loop actually halts and the text is check for entities when a non-alphanumeric character is reached.  It should really only be checking when a semicolon is reached.

This causes query strings to get truncated and replaced with a unicode < symbol.  The mangled URL is then passed back to the preloader looking like:
<img src="http://www.webkit.org/getImage.aspx?id=12345<_id=1"/>

The preloader then tries to fetch it with an invalid URL (which will most likely 404).

Other examples where this might be problematic:
&amp_energy=100
&lt-now=10

Basically, any query string variable name that starts like a HTML entity name and has a non-alphanumeric separator.

Proposed fix would to just remove the alphanumeric check.  The semicolon check above should be sufficient, if there are cases of bad entities that that are too long or don't contain a semicolon, then leave them be.

Build Info:
SVN Rev: 55620
Regular WebKit on Mac OS X 
XCode 3.2.1
Comment 1 Jeff 2010-03-06 12:49:36 PST
Also replicated on Safari 4 Mac/Win, Google Chrome Mac/Win.

You can spot the preload issue in the Resource Inspector.
Comment 2 Antti Koivisto 2010-03-07 12:03:20 PST
PreloadScanner (unlike the main tokenizer currently) implements the HTML5 entity parsing. The spec says (10.2.4):

"Consume the maximum number of characters possible, with the consumed characters matching one of the identifiers in the first column of the named character references table (in a case-sensitive manner).

If no match can be made, then this is a parse error. No characters are consumed, and nothing is returned.

If the last character matched is not a U+003B SEMICOLON character (;), there is a parse error.

If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned.

Otherwise, return a character token for the character corresponding to the character reference name (as given by the second column of the named character references table)."

Basically, if a named entity in attribute ends in non-alphanumeric character other than ; it is considered a parse error but the entity is still returned.

As far as I see the implementation matches the spec. If you think HTML5 is wrong here, you should send mail to the whatwg list and explain why.
Comment 3 Jeff 2010-04-06 12:37:46 PDT
The HTML spec has been amended to address these situations:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9207
Comment 4 Antti Koivisto 2010-04-06 14:41:33 PDT
Reopening based on above. EQUALS SIGN character (=) at the end should cause the entity to be unconsumed.
Comment 5 Alexey Proskuryakov 2010-06-29 11:33:32 PDT
See also: bug 41345.
Comment 6 Adam Barth 2010-06-29 17:24:16 PDT
The preload scanner uses the same tokenizer as the main parser now, so we'll fix both these bugs at once.

*** This bug has been marked as a duplicate of bug 41345 ***
Comment 7 Alexey Proskuryakov 2010-08-02 07:42:13 PDT
Jeff, could you please verify whether this is indeed fixed in current nightly builds?
Comment 8 Jeff 2010-08-03 07:18:23 PDT
(In reply to comment #7)
> Jeff, could you please verify whether this is indeed fixed in current nightly builds?

Verified as fixed.  Thanks guys!
Comment 9 Jeff 2010-08-03 07:19:52 PDT
(In reply to comment #8)
> (In reply to comment #7)
> > Jeff, could you please verify whether this is indeed fixed in current nightly builds?
> 
> Verified as fixed.  Thanks guys!

Verified in 6533.16, r64451