WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED DUPLICATE of
bug 41345
Bug 35831
WebCore PreloadScanner Entity Detection Bug - Non-HTML Entities are being treated as entities
https://bugs.webkit.org/show_bug.cgi?id=35831
Summary
WebCore PreloadScanner Entity Detection Bug - Non-HTML Entities are being tre...
Jeff
Reported
2010-03-06 12:46:02 PST
The entity detector in WebCore's PreloadScanner is broken. The HTML tokenizer used will accept things that look like entities but aren't and convert them into Unicode characters. For example, in scanning the HTML to pull out IMG tags, we might have a case like this: <img src="
http://www.webkit.org/getImage.aspx?id=12345&lang_id=1
"/> The tokenizer spots &lang_id=1 and thinks it might be an entity (it isn't!), but the test for entities isn't correct in the PreloadScanner (as it is in HTMLTokenizer). Code area:
http://trac.webkit.org/browser/trunk/WebCore/html/PreloadScanner.cpp#L257
The actual problematic line:
http://trac.webkit.org/browser/trunk/WebCore/html/PreloadScanner.cpp#L268
The loop actually halts and the text is check for entities when a non-alphanumeric character is reached. It should really only be checking when a semicolon is reached. This causes query strings to get truncated and replaced with a unicode < symbol. The mangled URL is then passed back to the preloader looking like: <img src="
http://www.webkit.org/getImage.aspx?id=12345
<_id=1"/> The preloader then tries to fetch it with an invalid URL (which will most likely 404). Other examples where this might be problematic: &_energy=100 <-now=10 Basically, any query string variable name that starts like a HTML entity name and has a non-alphanumeric separator. Proposed fix would to just remove the alphanumeric check. The semicolon check above should be sufficient, if there are cases of bad entities that that are too long or don't contain a semicolon, then leave them be. Build Info: SVN Rev: 55620 Regular WebKit on Mac OS X XCode 3.2.1
Attachments
Add attachment
proposed patch, testcase, etc.
Jeff
Comment 1
2010-03-06 12:49:36 PST
Also replicated on Safari 4 Mac/Win, Google Chrome Mac/Win. You can spot the preload issue in the Resource Inspector.
Antti Koivisto
Comment 2
2010-03-07 12:03:20 PST
PreloadScanner (unlike the main tokenizer currently) implements the HTML5 entity parsing. The spec says (10.2.4): "Consume the maximum number of characters possible, with the consumed characters matching one of the identifiers in the first column of the named character references table (in a case-sensitive manner). If no match can be made, then this is a parse error. No characters are consumed, and nothing is returned. If the last character matched is not a U+003B SEMICOLON character (;), there is a parse error. If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. Otherwise, return a character token for the character corresponding to the character reference name (as given by the second column of the named character references table)." Basically, if a named entity in attribute ends in non-alphanumeric character other than ; it is considered a parse error but the entity is still returned. As far as I see the implementation matches the spec. If you think HTML5 is wrong here, you should send mail to the whatwg list and explain why.
Jeff
Comment 3
2010-04-06 12:37:46 PDT
The HTML spec has been amended to address these situations:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9207
Antti Koivisto
Comment 4
2010-04-06 14:41:33 PDT
Reopening based on above. EQUALS SIGN character (=) at the end should cause the entity to be unconsumed.
Alexey Proskuryakov
Comment 5
2010-06-29 11:33:32 PDT
See also:
bug 41345
.
Adam Barth
Comment 6
2010-06-29 17:24:16 PDT
The preload scanner uses the same tokenizer as the main parser now, so we'll fix both these bugs at once. *** This bug has been marked as a duplicate of
bug 41345
***
Alexey Proskuryakov
Comment 7
2010-08-02 07:42:13 PDT
Jeff, could you please verify whether this is indeed fixed in current nightly builds?
Jeff
Comment 8
2010-08-03 07:18:23 PDT
(In reply to
comment #7
)
> Jeff, could you please verify whether this is indeed fixed in current nightly builds?
Verified as fixed. Thanks guys!
Jeff
Comment 9
2010-08-03 07:19:52 PDT
(In reply to
comment #8
)
> (In reply to
comment #7
) > > Jeff, could you please verify whether this is indeed fixed in current nightly builds? > > Verified as fixed. Thanks guys!
Verified in 6533.16,
r64451
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug