Bug 16179 - HTML parser should allow non-ASCII characters in tag and attribute names
Summary: HTML parser should allow non-ASCII characters in tag and attribute names
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Windows XP
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2007-11-28 15:11 PST by johnnyding
Modified: 2007-12-20 14:12 PST (History)
4 users (show)

See Also:


Attachments
some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed (424 bytes, text/html)
2007-11-28 15:13 PST, johnnyding
no flags Details
same test case encoding by unicode (1.06 KB, text/html)
2007-11-28 17:17 PST, johnnyding
no flags Details
dump attributs of one html element which has illegal characters in attribute name (1.40 KB, text/html)
2007-12-04 11:40 PST, johnnyding
no flags Details
patch for fixing this bug (8.90 KB, patch)
2007-12-05 16:13 PST, johnnyding
darin: review-
Details | Formatted Diff | Diff
patch for fixing this bug (8.30 KB, patch)
2007-12-05 18:40 PST, johnnyding
darin: review+
Details | Formatted Diff | Diff
Patch v2. (8.30 KB, patch)
2007-12-07 11:10 PST, johnnyding
ap: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description johnnyding 2007-11-28 15:11:59 PST
in HTML spec, any attribute should be one of basic HTML data typwe: NAME tokens, which  must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

However in WebKit's HTMLTokenizer, it did not check the composed characters of attribute name whether follow HTML spec or not, it just cut off the UChar's high 8bits and assign the low 8 bits to attribute name buffer which is a char buffer and is used to gather attribute name characters and generate final attribute atomicstring name. (section:case AttributeName, func:HTMLTokenizer::parseTag,  file:Webkit\html\HTMLTokenizer.cpp)
So if any attribute name start with a Unicode which like #xx00, then finally the attribute name buffer will get data like  #00  #xx ..., which cause current attribute name will be a empty atomicstring, then  section:case QuotedValue, empty attribute name cause attribute name is same with attribute value which is CDATA type and maybe contain some characters which are illegal in attribute name , however the function Token::addAttribute will check the attribute name must not contain '/'. if the attribute value is URL, then we got assert failed.

The following is a testcase.  some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed.

For fixing this problem, I think
1) we may change the temporary attribute name/value buffer cBuffer as UChar buffer, of course, some other code need to be changed.
2) detect the illegal character and discard it.
Comment 1 johnnyding 2007-11-28 15:13:20 PST
Created attachment 17582 [details]
some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed
Comment 2 johnnyding 2007-11-28 17:17:52 PST
Created attachment 17583 [details]
same test case encoding by unicode
Comment 3 David Kilzer (:ddkilzer) 2007-11-28 23:53:23 PST
<rdar://problem/5619399>
Comment 4 johnnyding 2007-11-29 15:04:12 PST
Hi David, is that mean the bug has been upstreamed to apple and apple engineer will handle this, or they had found this bug before?
Comment 5 David Kilzer (:ddkilzer) 2007-12-02 09:37:14 PST
(In reply to comment #4)
> Hi David, is that mean the bug has been upstreamed to apple and apple engineer
> will handle this, or they had found this bug before?

A new Radar has been created for tracking purposes.  There are no guarantees about who will work on this or when it will get fixed.  Patches from the open source community are always welcome!

Comment 6 johnnyding 2007-12-04 11:39:04 PST
Great, I have a patch for this problem, I will upload it later and let you guy review it.

For fixing this problem, I test it in IE6/7, Firefox 2/3, Opera9. They have same behavior about this problem, safari got crash.

The characters which should be illegal in attribute name according HTML spec are in attribute name will be treated as legal characters, they are part of attribute name.

Please check the following test case.
Comment 7 johnnyding 2007-12-04 11:40:38 PST
Created attachment 17705 [details]
dump attributs of one html element which has illegal characters in attribute name
Comment 8 johnnyding 2007-12-05 16:13:57 PST
Created attachment 17727 [details]
patch for fixing this bug
Comment 9 Darin Adler 2007-12-05 16:25:39 PST
Comment on attachment 17727 [details]
patch for fixing this bug

+                    // from statck, its' length is 10, should be OK.

Misspellings here: statck should be stack and its' should be its.

+                    unsigned int testedEntityNameLen = 0;

We normally delcar these as unsigned.

+                        if (cBuffer[testedEntityNameLen] > 0xff)

Seems to me we could use a smaller cutoff. Only ASCII characters can be in the entity names, so the cutoff could be 7F or 7E.

+                            break;
+                        else
+                            chTmpEntityNameBuffer[testedEntityNameLen] = cBuffer[testedEntityNameLen];

No need for an else after a break; we don't like to nest code like this.

The layout test should not be in the webarchive directory -- it's not a test of web archives. It should go into the fast/parser directory since it's a test of the HTML parser.

Also, the test should use "dumpAsText()" so it can work cross-platform.

Since this is super-hot code, we need to do some performance testing to make sure it doesn't slow things down.
Comment 10 johnnyding 2007-12-05 18:40:21 PST
Created attachment 17733 [details]
patch for fixing this bug

Hi Darin. Thanks for your comment. Please have another look of my new patch.

About the performance test. Anything need I add to patch?
Also I think since only ASCII characters can be as entity name, if I use one stand-alone char buffer to temporarily save and deal with entity name, it should not effect the performance, right?

Thanks!
Comment 11 Darin Adler 2007-12-06 14:44:44 PST
Comment on attachment 17733 [details]
patch for fixing this bug

Alexey informed me that there may be some security concerns with supporting these additional characters in tag and attribute names. I don't know the details yet. Alexey would you be willing to comment?

+        (WebCore::HTMLTokenizer::parseEntity): Handle unicode Entity Name by using acsii version findentity.

"ASCII, not acssi". "version of findEntity", not "version findentity".

"Unicode", not "unicode".

Someone should fix the title of the bug; it no longer matches what's being fixed here.

+                    // Since the maximum length of entity name only
+                    // can be 9, so one char array which is allocated
+                    // from stack, its length is 10, should be OK.
+                    // Also if we have illegal character, we treat it
+                    // as illegal entity name.

"maximum length is 9", not "maximum length can be 9"

"a single char array", not "one char array"

"on the stack", not "from stack"

"have an illegal character", not "have illegal character"

+                    char chTmpEntityNameBuffer[10];

We don't normally use type prefixes like "ch" in code.

+
Comment 12 Alexey Proskuryakov 2007-12-06 22:37:25 PST
(In reply to comment #11)
> Alexey informed me that there may be some security concerns with supporting
> these additional characters in tag and attribute names. I don't know the
> details yet. Alexey would you be willing to comment?

I tried to say that treating U+3000 as whitespace could be dangerous, referring to the comment that "some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group" - I'm not aware of any issues with treating them as non-whitespace.
Comment 13 johnnyding 2007-12-06 23:20:31 PST
(In reply to comment #12)
> (In reply to comment #11)
> > Alexey informed me that there may be some security concerns with supporting
> > these additional characters in tag and attribute names. I don't know the
> > details yet. Alexey would you be willing to comment?
> 
> I tried to say that treating U+3000 as whitespace could be dangerous, referring
> to the comment that "some Chinese websites use one Chinese space symbol +U3000
> as space to separate attribute name/value group" - I'm not aware of any issues
> with treating them as non-whitespace.
> 
Oh, I guess it's my fault, My previous meaning was that some authors of Chinese sites tried to use one Chinese space symbol +U3000 as space to separate attribute name/value group, or maybe they just did not realize they used some symbols which look like space. it does not mean WebKit need to treat them as space. Also in my patch, I just treat those characters as normal, right? My previous sentence just guess the motivation about why they used those strange symbol characters in their pages.

So are we clear?
Comment 14 johnnyding 2007-12-06 23:32:08 PST
Sorry for my pool wording, I will fix that and keep it in my mind!:-)
The new patch will uploaded soon.

I agree we should change the title to avoid misunderstanding. maybe like:
Need WebKit to support Unicode characters as part of attribute name?

About the performance test you mentioned in previous review, any update? How about my approach by using stand-alone buffer to temporarily save entity name?

Thanks!

(In reply to comment #11)
> (From update of attachment 17733 [details] [edit])
> Alexey informed me that there may be some security concerns with supporting
> these additional characters in tag and attribute names. I don't know the
> details yet. Alexey would you be willing to comment?
> 
> +        (WebCore::HTMLTokenizer::parseEntity): Handle unicode Entity Name by
> using acsii version findentity.
> 
> "ASCII, not acssi". "version of findEntity", not "version findentity".
> 
> "Unicode", not "unicode".
> 
> Someone should fix the title of the bug; it no longer matches what's being
> fixed here.
> 
> +                    // Since the maximum length of entity name only
> +                    // can be 9, so one char array which is allocated
> +                    // from stack, its length is 10, should be OK.
> +                    // Also if we have illegal character, we treat it
> +                    // as illegal entity name.
> 
> "maximum length is 9", not "maximum length can be 9"
> 
> "a single char array", not "one char array"
> 
> "on the stack", not "from stack"
> 
> "have an illegal character", not "have illegal character"
> 
> +                    char chTmpEntityNameBuffer[10];
> 
> We don't normally use type prefixes like "ch" in code.
> 
> +
> 

(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > Alexey informed me that there may be some security concerns with supporting
> > > these additional characters in tag and attribute names. I don't know the
> > > details yet. Alexey would you be willing to comment?
> > 
> > I tried to say that treating U+3000 as whitespace could be dangerous, referring
> > to the comment that "some Chinese websites use one Chinese space symbol +U3000
> > as space to separate attribute name/value group" - I'm not aware of any issues
> > with treating them as non-whitespace.
> > 
> Oh, I guess it's my fault, My previous meaning was that some authors of Chinese
> sites tried to use one Chinese space symbol +U3000 as space to separate
> attribute name/value group, or maybe they just did not realize they used some
> symbols which look like space. it does not mean WebKit need to treat them as
> space. Also in my patch, I just treat those characters as normal, right? My
> previous sentence just guess the motivation about why they used those strange
> symbol characters in their pages.
> 
> So are we clear?
> 

Comment 15 johnnyding 2007-12-06 23:36:41 PST
(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > Alexey informed me that there may be some security concerns with supporting
> > > these additional characters in tag and attribute names. I don't know the
> > > details yet. Alexey would you be willing to comment?
> > 
> > I tried to say that treating U+3000 as whitespace could be dangerous, referring
> > to the comment that "some Chinese websites use one Chinese space symbol +U3000
> > as space to separate attribute name/value group" - I'm not aware of any issues
> > with treating them as non-whitespace.
> > 
> Oh, I guess it's my fault, My previous meaning was that some authors of Chinese
> sites tried to use one Chinese space symbol +U3000 as space to separate
> attribute name/value group, or maybe they just did not realize they used some
> symbols which look like space. it does not mean WebKit need to treat them as
> space. Also in my patch, I just treat those characters as normal, right? My
> previous sentence just guess the motivation about why they used those strange
> symbol characters in their pages.
> 
> So are we clear?
> 

Hi Alexey,
Thanks very much for correcting me!
Comment 16 Darin Adler 2007-12-07 01:18:20 PST
Comment on attachment 17733 [details]
patch for fixing this bug

OK, r=me.

We'll have to get someone to do some performance tests after landing this.
Comment 17 johnnyding 2007-12-07 11:10:18 PST
Created attachment 17774 [details]
Patch v2. 

Thanks Darin!

New patch is already. According to your review comments, I modified some wording and variable name based on patch 17733.
Please take a look when you are available.
Comment 18 johnnyding 2007-12-10 13:46:44 PST
Hi Darin,

Can I consider my patch v2 passed review? If yes, I will submit this change. 

Thanks!
Comment 19 Alexey Proskuryakov 2007-12-16 09:15:51 PST
Comment on attachment 17774 [details]
Patch v2. 

Marking reviewed, as this is basically the same patch.

Not sure how to proceed - who can run the performance tests?
Comment 20 Darin Adler 2007-12-16 09:20:59 PST
(In reply to comment #19)
> Not sure how to proceed - who can run the performance tests?

Probably Stephanie. You should send her a note.
Comment 21 Stephanie Lewis 2007-12-20 14:12:18 PST
Tested and Landed revision 28908