Bug 41345 - Parameter names in frame src URLs parsed incorrectly if resembles HTML entity code followed by underscore
Summary: Parameter names in frame src URLs parsed incorrectly if resembles HTML entity...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Frames (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Windows XP
: P2 Normal
Assignee: Adam Barth
URL:
Keywords:
: 35831 (view as bug list)
Depends on:
Blocks: 41115
  Show dependency treegraph
 
Reported: 2010-06-29 05:52 PDT by atomicules
Modified: 2010-07-01 02:47 PDT (History)
7 users (show)

See Also:


Attachments
Test case showing error. (534 bytes, text/html)
2010-06-29 05:52 PDT, atomicules
no flags Details
reduction (417 bytes, text/html)
2010-06-29 10:50 PDT, Adam Barth
no flags Details
Patch (4.33 KB, patch)
2010-06-29 17:59 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (4.59 KB, patch)
2010-06-30 23:25 PDT, Adam Barth
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description atomicules 2010-06-29 05:52:21 PDT
Created attachment 60014 [details]
Test case showing error.

(Note, originally filed as Issue 47509 on Chromium Issues, but then I read about Chrome using the new HTML5 parser in the Chromium-dev Google group, so verified with webkit nightly (r61877) and it is indeed a webkit bug).

Chrome Version       : 6.0.447.0 (Official Build 50594) dev
URLs (if applicable) : See attached file
Other browsers tested:
Add OK or FAIL after other browsers where you have tested this issue:
          Safari 5: OK
  Firefox 3.7a6pre: OK
              IE 6: OK
         

What steps will reproduce the problem?
1. Open up the attached webpage
2. Developer Tools > Elements will show the first paramter of the URL for the bottom frame as "setno=0∏_id="
3. View source on the page, or viewing the attached file in a text editor will show the first parameters as "setno=0&prod_id="

What is the expected result?

The URL parameters should be parsed correctly

What happens instead?

Parsed incorrectly.


Please provide any additional information below. Attach a screenshot if
possible.

The attached file is based on an internal intranet page of the company I work for. I have just changed the URLs to something that will load for everyone (google search), but have included the parameter from the intranet page as a dummy parameter to the google search.

This issue is new with 6.0.447.0. As it was working correctly in the previous build.

In testing this parsing error seems to be caused if a parameter name resembles a HTML entity code (i.e. just missing the semi-colon) followed by an underscore:

E.g: 

&pound_id  => £_id
&amp_id => &_id
&quot_id => "_id
Comment 1 Adam Barth 2010-06-29 10:50:02 PDT
Created attachment 60037 [details]
reduction

Here's a simpler reduction.  For some reason, Minefield is treating the ∏ entity differently from the £ entity.  We'll need to investigate further to understand why.
Comment 2 Alexey Proskuryakov 2010-06-29 11:32:51 PDT
See also: bug 35831.
Comment 3 Adam Barth 2010-06-29 17:24:16 PDT
*** Bug 35831 has been marked as a duplicate of this bug. ***
Comment 4 Adam Barth 2010-06-29 17:25:16 PDT
Some folks in the HTML working group explained the issue to me.  We're ignoring an important bit in the entity table.  Given that the old parser has the right behavior here, we must have that bit somewhere.  Investigating.
Comment 5 Adam Barth 2010-06-29 17:38:31 PDT
Looks like I missed this check:

http://trac.webkit.org/browser/trunk/WebCore/html/LegacyHTMLDocumentParser.cpp#L834
Comment 6 Adam Barth 2010-06-29 17:59:44 PDT
Created attachment 60073 [details]
Patch
Comment 7 Darin Adler 2010-06-29 18:01:33 PDT
Comment on attachment 60073 [details]
Patch

> +                        if (entity->code > 255)
> +                            break;

What's the meaning of the magic number 255 here? If it was a named constant it could have a name or comment indicating why 255 is the correct number.
Comment 8 Adam Barth 2010-06-29 18:03:27 PDT
(In reply to comment #7)
> (From update of attachment 60073 [details])
> > +                        if (entity->code > 255)
> > +                            break;
> 
> What's the meaning of the magic number 255 here? If it was a named constant it could have a name or comment indicating why 255 is the correct number.

I bet it's the characters that fit in Latin-1.  That's how the code in the legacy tokenizer is written (although it only applies that branch to attribute values, which doesn't match the spec or Minefield).

The spec just has a giant table:

http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references
Comment 9 Eric Seidel (no email) 2010-06-30 23:07:01 PDT
Comment on attachment 60073 [details]
Patch

Holy shit that needs a comment.  No sir, I cannot r+ such madness w/o explanation (or at least a spec link).
Comment 10 Adam Barth 2010-06-30 23:25:26 PDT
Created attachment 60203 [details]
Patch
Comment 11 Eric Seidel (no email) 2010-06-30 23:26:26 PDT
Comment on attachment 60203 [details]
Patch

Fantastic!  Thank you!
Comment 12 WebKit Commit Bot 2010-07-01 02:29:32 PDT
Comment on attachment 60203 [details]
Patch

Clearing flags on attachment: 60203

Committed r62241: <http://trac.webkit.org/changeset/62241>
Comment 13 WebKit Commit Bot 2010-07-01 02:29:43 PDT
All reviewed patches have been landed.  Closing bug.
Comment 14 WebKit Review Bot 2010-07-01 02:47:50 PDT
http://trac.webkit.org/changeset/62241 might have broken Qt Linux Release