The entity detector in WebCore's PreloadScanner is broken.
The HTML tokenizer used will accept things that look like entities but aren't and convert them into Unicode characters.
For example, in scanning the HTML to pull out IMG tags, we might have a case like this:
The tokenizer spots &lang_id=1 and thinks it might be an entity (it isn't!), but the test for entities isn't correct in the PreloadScanner (as it is in HTMLTokenizer).
The actual problematic line:
The loop actually halts and the text is check for entities when a non-alphanumeric character is reached. It should really only be checking when a semicolon is reached.
This causes query strings to get truncated and replaced with a unicode < symbol. The mangled URL is then passed back to the preloader looking like:
The preloader then tries to fetch it with an invalid URL (which will most likely 404).
Other examples where this might be problematic:
Basically, any query string variable name that starts like a HTML entity name and has a non-alphanumeric separator.
Proposed fix would to just remove the alphanumeric check. The semicolon check above should be sufficient, if there are cases of bad entities that that are too long or don't contain a semicolon, then leave them be.
SVN Rev: 55620
Regular WebKit on Mac OS X
Also replicated on Safari 4 Mac/Win, Google Chrome Mac/Win.
You can spot the preload issue in the Resource Inspector.
PreloadScanner (unlike the main tokenizer currently) implements the HTML5 entity parsing. The spec says (10.2.4):
"Consume the maximum number of characters possible, with the consumed characters matching one of the identifiers in the first column of the named character references table (in a case-sensitive manner).
If no match can be made, then this is a parse error. No characters are consumed, and nothing is returned.
If the last character matched is not a U+003B SEMICOLON character (;), there is a parse error.
If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned.
Otherwise, return a character token for the character corresponding to the character reference name (as given by the second column of the named character references table)."
Basically, if a named entity in attribute ends in non-alphanumeric character other than ; it is considered a parse error but the entity is still returned.
As far as I see the implementation matches the spec. If you think HTML5 is wrong here, you should send mail to the whatwg list and explain why.
The HTML spec has been amended to address these situations:
Reopening based on above. EQUALS SIGN character (=) at the end should cause the entity to be unconsumed.
See also: bug 41345.
The preload scanner uses the same tokenizer as the main parser now, so we'll fix both these bugs at once.
*** This bug has been marked as a duplicate of bug 41345 ***
Jeff, could you please verify whether this is indeed fixed in current nightly builds?
(In reply to comment #7)
> Jeff, could you please verify whether this is indeed fixed in current nightly builds?
Verified as fixed. Thanks guys!
(In reply to comment #8)
> (In reply to comment #7)
> > Jeff, could you please verify whether this is indeed fixed in current nightly builds?
> Verified as fixed. Thanks guys!
Verified in 6533.16, r64451