WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
16179
HTML parser should allow non-ASCII characters in tag and attribute names
https://bugs.webkit.org/show_bug.cgi?id=16179
Summary
HTML parser should allow non-ASCII characters in tag and attribute names
johnnyding
Reported
2007-11-28 15:11:59 PST
in HTML spec, any attribute should be one of basic HTML data typwe: NAME tokens, which must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods ("."). However in WebKit's HTMLTokenizer, it did not check the composed characters of attribute name whether follow HTML spec or not, it just cut off the UChar's high 8bits and assign the low 8 bits to attribute name buffer which is a char buffer and is used to gather attribute name characters and generate final attribute atomicstring name. (section:case AttributeName, func:HTMLTokenizer::parseTag, file:Webkit\html\HTMLTokenizer.cpp) So if any attribute name start with a Unicode which like #xx00, then finally the attribute name buffer will get data like #00 #xx ..., which cause current attribute name will be a empty atomicstring, then section:case QuotedValue, empty attribute name cause attribute name is same with attribute value which is CDATA type and maybe contain some characters which are illegal in attribute name , however the function Token::addAttribute will check the attribute name must not contain '/'. if the attribute value is URL, then we got assert failed. The following is a testcase. some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed. For fixing this problem, I think 1) we may change the temporary attribute name/value buffer cBuffer as UChar buffer, of course, some other code need to be changed. 2) detect the illegal character and discard it.
Attachments
some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed
(424 bytes, text/html)
2007-11-28 15:13 PST
,
johnnyding
no flags
Details
same test case encoding by unicode
(1.06 KB, text/html)
2007-11-28 17:17 PST
,
johnnyding
no flags
Details
dump attributs of one html element which has illegal characters in attribute name
(1.40 KB, text/html)
2007-12-04 11:40 PST
,
johnnyding
no flags
Details
patch for fixing this bug
(8.90 KB, patch)
2007-12-05 16:13 PST
,
johnnyding
darin
: review-
Details
Formatted Diff
Diff
patch for fixing this bug
(8.30 KB, patch)
2007-12-05 18:40 PST
,
johnnyding
darin
: review+
Details
Formatted Diff
Diff
Patch v2.
(8.30 KB, patch)
2007-12-07 11:10 PST
,
johnnyding
ap
: review+
Details
Formatted Diff
Diff
Show Obsolete
(1)
View All
Add attachment
proposed patch, testcase, etc.
johnnyding
Comment 1
2007-11-28 15:13:20 PST
Created
attachment 17582
[details]
some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed
johnnyding
Comment 2
2007-11-28 17:17:52 PST
Created
attachment 17583
[details]
same test case encoding by unicode
David Kilzer (:ddkilzer)
Comment 3
2007-11-28 23:53:23 PST
<
rdar://problem/5619399
>
johnnyding
Comment 4
2007-11-29 15:04:12 PST
Hi David, is that mean the bug has been upstreamed to apple and apple engineer will handle this, or they had found this bug before?
David Kilzer (:ddkilzer)
Comment 5
2007-12-02 09:37:14 PST
(In reply to
comment #4
)
> Hi David, is that mean the bug has been upstreamed to apple and apple engineer > will handle this, or they had found this bug before?
A new Radar has been created for tracking purposes. There are no guarantees about who will work on this or when it will get fixed. Patches from the open source community are always welcome!
johnnyding
Comment 6
2007-12-04 11:39:04 PST
Great, I have a patch for this problem, I will upload it later and let you guy review it. For fixing this problem, I test it in IE6/7, Firefox 2/3, Opera9. They have same behavior about this problem, safari got crash. The characters which should be illegal in attribute name according HTML spec are in attribute name will be treated as legal characters, they are part of attribute name. Please check the following test case.
johnnyding
Comment 7
2007-12-04 11:40:38 PST
Created
attachment 17705
[details]
dump attributs of one html element which has illegal characters in attribute name
johnnyding
Comment 8
2007-12-05 16:13:57 PST
Created
attachment 17727
[details]
patch for fixing this bug
Darin Adler
Comment 9
2007-12-05 16:25:39 PST
Comment on
attachment 17727
[details]
patch for fixing this bug + // from statck, its' length is 10, should be OK. Misspellings here: statck should be stack and its' should be its. + unsigned int testedEntityNameLen = 0; We normally delcar these as unsigned. + if (cBuffer[testedEntityNameLen] > 0xff) Seems to me we could use a smaller cutoff. Only ASCII characters can be in the entity names, so the cutoff could be 7F or 7E. + break; + else + chTmpEntityNameBuffer[testedEntityNameLen] = cBuffer[testedEntityNameLen]; No need for an else after a break; we don't like to nest code like this. The layout test should not be in the webarchive directory -- it's not a test of web archives. It should go into the fast/parser directory since it's a test of the HTML parser. Also, the test should use "dumpAsText()" so it can work cross-platform. Since this is super-hot code, we need to do some performance testing to make sure it doesn't slow things down.
johnnyding
Comment 10
2007-12-05 18:40:21 PST
Created
attachment 17733
[details]
patch for fixing this bug Hi Darin. Thanks for your comment. Please have another look of my new patch. About the performance test. Anything need I add to patch? Also I think since only ASCII characters can be as entity name, if I use one stand-alone char buffer to temporarily save and deal with entity name, it should not effect the performance, right? Thanks!
Darin Adler
Comment 11
2007-12-06 14:44:44 PST
Comment on
attachment 17733
[details]
patch for fixing this bug Alexey informed me that there may be some security concerns with supporting these additional characters in tag and attribute names. I don't know the details yet. Alexey would you be willing to comment? + (WebCore::HTMLTokenizer::parseEntity): Handle unicode Entity Name by using acsii version findentity. "ASCII, not acssi". "version of findEntity", not "version findentity". "Unicode", not "unicode". Someone should fix the title of the bug; it no longer matches what's being fixed here. + // Since the maximum length of entity name only + // can be 9, so one char array which is allocated + // from stack, its length is 10, should be OK. + // Also if we have illegal character, we treat it + // as illegal entity name. "maximum length is 9", not "maximum length can be 9" "a single char array", not "one char array" "on the stack", not "from stack" "have an illegal character", not "have illegal character" + char chTmpEntityNameBuffer[10]; We don't normally use type prefixes like "ch" in code. +
Alexey Proskuryakov
Comment 12
2007-12-06 22:37:25 PST
(In reply to
comment #11
)
> Alexey informed me that there may be some security concerns with supporting > these additional characters in tag and attribute names. I don't know the > details yet. Alexey would you be willing to comment?
I tried to say that treating U+3000 as whitespace could be dangerous, referring to the comment that "some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group" - I'm not aware of any issues with treating them as non-whitespace.
johnnyding
Comment 13
2007-12-06 23:20:31 PST
(In reply to
comment #12
)
> (In reply to
comment #11
) > > Alexey informed me that there may be some security concerns with supporting > > these additional characters in tag and attribute names. I don't know the > > details yet. Alexey would you be willing to comment? > > I tried to say that treating U+3000 as whitespace could be dangerous, referring > to the comment that "some Chinese websites use one Chinese space symbol +U3000 > as space to separate attribute name/value group" - I'm not aware of any issues > with treating them as non-whitespace. >
Oh, I guess it's my fault, My previous meaning was that some authors of Chinese sites tried to use one Chinese space symbol +U3000 as space to separate attribute name/value group, or maybe they just did not realize they used some symbols which look like space. it does not mean WebKit need to treat them as space. Also in my patch, I just treat those characters as normal, right? My previous sentence just guess the motivation about why they used those strange symbol characters in their pages. So are we clear?
johnnyding
Comment 14
2007-12-06 23:32:08 PST
Sorry for my pool wording, I will fix that and keep it in my mind!:-) The new patch will uploaded soon. I agree we should change the title to avoid misunderstanding. maybe like: Need WebKit to support Unicode characters as part of attribute name? About the performance test you mentioned in previous review, any update? How about my approach by using stand-alone buffer to temporarily save entity name? Thanks! (In reply to
comment #11
)
> (From update of
attachment 17733
[details]
[edit]) > Alexey informed me that there may be some security concerns with supporting > these additional characters in tag and attribute names. I don't know the > details yet. Alexey would you be willing to comment? > > + (WebCore::HTMLTokenizer::parseEntity): Handle unicode Entity Name by > using acsii version findentity. > > "ASCII, not acssi". "version of findEntity", not "version findentity". > > "Unicode", not "unicode". > > Someone should fix the title of the bug; it no longer matches what's being > fixed here. > > + // Since the maximum length of entity name only > + // can be 9, so one char array which is allocated > + // from stack, its length is 10, should be OK. > + // Also if we have illegal character, we treat it > + // as illegal entity name. > > "maximum length is 9", not "maximum length can be 9" > > "a single char array", not "one char array" > > "on the stack", not "from stack" > > "have an illegal character", not "have illegal character" > > + char chTmpEntityNameBuffer[10]; > > We don't normally use type prefixes like "ch" in code. > > + >
(In reply to
comment #13
)
> (In reply to
comment #12
) > > (In reply to
comment #11
) > > > Alexey informed me that there may be some security concerns with supporting > > > these additional characters in tag and attribute names. I don't know the > > > details yet. Alexey would you be willing to comment? > > > > I tried to say that treating U+3000 as whitespace could be dangerous, referring > > to the comment that "some Chinese websites use one Chinese space symbol +U3000 > > as space to separate attribute name/value group" - I'm not aware of any issues > > with treating them as non-whitespace. > > > Oh, I guess it's my fault, My previous meaning was that some authors of Chinese > sites tried to use one Chinese space symbol +U3000 as space to separate > attribute name/value group, or maybe they just did not realize they used some > symbols which look like space. it does not mean WebKit need to treat them as > space. Also in my patch, I just treat those characters as normal, right? My > previous sentence just guess the motivation about why they used those strange > symbol characters in their pages. > > So are we clear? >
johnnyding
Comment 15
2007-12-06 23:36:41 PST
(In reply to
comment #13
)
> (In reply to
comment #12
) > > (In reply to
comment #11
) > > > Alexey informed me that there may be some security concerns with supporting > > > these additional characters in tag and attribute names. I don't know the > > > details yet. Alexey would you be willing to comment? > > > > I tried to say that treating U+3000 as whitespace could be dangerous, referring > > to the comment that "some Chinese websites use one Chinese space symbol +U3000 > > as space to separate attribute name/value group" - I'm not aware of any issues > > with treating them as non-whitespace. > > > Oh, I guess it's my fault, My previous meaning was that some authors of Chinese > sites tried to use one Chinese space symbol +U3000 as space to separate > attribute name/value group, or maybe they just did not realize they used some > symbols which look like space. it does not mean WebKit need to treat them as > space. Also in my patch, I just treat those characters as normal, right? My > previous sentence just guess the motivation about why they used those strange > symbol characters in their pages. > > So are we clear? >
Hi Alexey, Thanks very much for correcting me!
Darin Adler
Comment 16
2007-12-07 01:18:20 PST
Comment on
attachment 17733
[details]
patch for fixing this bug OK, r=me. We'll have to get someone to do some performance tests after landing this.
johnnyding
Comment 17
2007-12-07 11:10:18 PST
Created
attachment 17774
[details]
Patch v2. Thanks Darin! New patch is already. According to your review comments, I modified some wording and variable name based on patch 17733. Please take a look when you are available.
johnnyding
Comment 18
2007-12-10 13:46:44 PST
Hi Darin, Can I consider my patch v2 passed review? If yes, I will submit this change. Thanks!
Alexey Proskuryakov
Comment 19
2007-12-16 09:15:51 PST
Comment on
attachment 17774
[details]
Patch v2. Marking reviewed, as this is basically the same patch. Not sure how to proceed - who can run the performance tests?
Darin Adler
Comment 20
2007-12-16 09:20:59 PST
(In reply to
comment #19
)
> Not sure how to proceed - who can run the performance tests?
Probably Stephanie. You should send her a note.
Stephanie Lewis
Comment 21
2007-12-20 14:12:18 PST
Tested and Landed revision 28908
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug