16179 – HTML parser should allow non-ASCII characters in tag and attribute names

RESOLVED FIXED 16179

HTML parser should allow non-ASCII characters in tag and attribute names

https://bugs.webkit.org/show_bug.cgi?id=16179

Summary HTML parser should allow non-ASCII characters in tag and attribute names

johnnyding

Reported 2007-11-28 15:11:59 PST

in HTML spec, any attribute should be one of basic HTML data typwe: NAME tokens, which must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods ("."). However in WebKit's HTMLTokenizer, it did not check the composed characters of attribute name whether follow HTML spec or not, it just cut off the UChar's high 8bits and assign the low 8 bits to attribute name buffer which is a char buffer and is used to gather attribute name characters and generate final attribute atomicstring name. (section:case AttributeName, func:HTMLTokenizer::parseTag, file:Webkit\html\HTMLTokenizer.cpp) So if any attribute name start with a Unicode which like #xx00, then finally the attribute name buffer will get data like #00 #xx ..., which cause current attribute name will be a empty atomicstring, then section:case QuotedValue, empty attribute name cause attribute name is same with attribute value which is CDATA type and maybe contain some characters which are illegal in attribute name , however the function Token::addAttribute will check the attribute name must not contain '/'. if the attribute value is URL, then we got assert failed. The following is a testcase. some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed. For fixing this problem, I think 1) we may change the temporary attribute name/value buffer cBuffer as UChar buffer, of course, some other code need to be changed. 2) detect the illegal character and discard it.

Attachments
some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed (424 bytes, text/html) 2007-11-28 15:13 PST, johnnyding	no flags	Details
same test case encoding by unicode (1.06 KB, text/html) 2007-11-28 17:17 PST, johnnyding	no flags	Details
dump attributs of one html element which has illegal characters in attribute name (1.40 KB, text/html) 2007-12-04 11:40 PST, johnnyding	no flags	Details
patch for fixing this bug (8.90 KB, patch) 2007-12-05 16:13 PST, johnnyding	darin: review-	Details Formatted Diff Diff
patch for fixing this bug (8.30 KB, patch) 2007-12-05 18:40 PST, johnnyding	darin: review+	Details Formatted Diff Diff
Patch v2. (8.30 KB, patch) 2007-12-07 11:10 PST, johnnyding	ap: review+	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

johnnyding

Comment 1 2007-11-28 15:13:20 PST

Created attachment 17582 [details] some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group, then cause WebKit got assert failed

johnnyding

Comment 2 2007-11-28 17:17:52 PST

Created attachment 17583 [details] same test case encoding by unicode

David Kilzer (:ddkilzer)

Comment 3 2007-11-28 23:53:23 PST

<rdar://problem/5619399>

johnnyding

Comment 4 2007-11-29 15:04:12 PST

Hi David, is that mean the bug has been upstreamed to apple and apple engineer will handle this, or they had found this bug before?

David Kilzer (:ddkilzer)

Comment 5 2007-12-02 09:37:14 PST

(In reply to comment #4) > Hi David, is that mean the bug has been upstreamed to apple and apple engineer > will handle this, or they had found this bug before? A new Radar has been created for tracking purposes. There are no guarantees about who will work on this or when it will get fixed. Patches from the open source community are always welcome!

johnnyding

Comment 6 2007-12-04 11:39:04 PST

Great, I have a patch for this problem, I will upload it later and let you guy review it. For fixing this problem, I test it in IE6/7, Firefox 2/3, Opera9. They have same behavior about this problem, safari got crash. The characters which should be illegal in attribute name according HTML spec are in attribute name will be treated as legal characters, they are part of attribute name. Please check the following test case.

johnnyding

Comment 7 2007-12-04 11:40:38 PST

Created attachment 17705 [details] dump attributs of one html element which has illegal characters in attribute name

johnnyding

Comment 8 2007-12-05 16:13:57 PST

Created attachment 17727 [details] patch for fixing this bug

Darin Adler

Comment 9 2007-12-05 16:25:39 PST

Comment on attachment 17727 [details] patch for fixing this bug + // from statck, its' length is 10, should be OK. Misspellings here: statck should be stack and its' should be its. + unsigned int testedEntityNameLen = 0; We normally delcar these as unsigned. + if (cBuffer[testedEntityNameLen] > 0xff) Seems to me we could use a smaller cutoff. Only ASCII characters can be in the entity names, so the cutoff could be 7F or 7E. + break; + else + chTmpEntityNameBuffer[testedEntityNameLen] = cBuffer[testedEntityNameLen]; No need for an else after a break; we don't like to nest code like this. The layout test should not be in the webarchive directory -- it's not a test of web archives. It should go into the fast/parser directory since it's a test of the HTML parser. Also, the test should use "dumpAsText()" so it can work cross-platform. Since this is super-hot code, we need to do some performance testing to make sure it doesn't slow things down.

johnnyding

Comment 10 2007-12-05 18:40:21 PST

Created attachment 17733 [details] patch for fixing this bug Hi Darin. Thanks for your comment. Please have another look of my new patch. About the performance test. Anything need I add to patch? Also I think since only ASCII characters can be as entity name, if I use one stand-alone char buffer to temporarily save and deal with entity name, it should not effect the performance, right? Thanks!

Darin Adler

Comment 11 2007-12-06 14:44:44 PST

Comment on attachment 17733 [details] patch for fixing this bug Alexey informed me that there may be some security concerns with supporting these additional characters in tag and attribute names. I don't know the details yet. Alexey would you be willing to comment? + (WebCore::HTMLTokenizer::parseEntity): Handle unicode Entity Name by using acsii version findentity. "ASCII, not acssi". "version of findEntity", not "version findentity". "Unicode", not "unicode". Someone should fix the title of the bug; it no longer matches what's being fixed here. + // Since the maximum length of entity name only + // can be 9, so one char array which is allocated + // from stack, its length is 10, should be OK. + // Also if we have illegal character, we treat it + // as illegal entity name. "maximum length is 9", not "maximum length can be 9" "a single char array", not "one char array" "on the stack", not "from stack" "have an illegal character", not "have illegal character" + char chTmpEntityNameBuffer[10]; We don't normally use type prefixes like "ch" in code. +

Alexey Proskuryakov

Comment 12 2007-12-06 22:37:25 PST

(In reply to comment #11) > Alexey informed me that there may be some security concerns with supporting > these additional characters in tag and attribute names. I don't know the > details yet. Alexey would you be willing to comment? I tried to say that treating U+3000 as whitespace could be dangerous, referring to the comment that "some Chinese websites use one Chinese space symbol +U3000 as space to separate attribute name/value group" - I'm not aware of any issues with treating them as non-whitespace.

johnnyding

Comment 13 2007-12-06 23:20:31 PST

(In reply to comment #12) > (In reply to comment #11) > > Alexey informed me that there may be some security concerns with supporting > > these additional characters in tag and attribute names. I don't know the > > details yet. Alexey would you be willing to comment? > > I tried to say that treating U+3000 as whitespace could be dangerous, referring > to the comment that "some Chinese websites use one Chinese space symbol +U3000 > as space to separate attribute name/value group" - I'm not aware of any issues > with treating them as non-whitespace. > Oh, I guess it's my fault, My previous meaning was that some authors of Chinese sites tried to use one Chinese space symbol +U3000 as space to separate attribute name/value group, or maybe they just did not realize they used some symbols which look like space. it does not mean WebKit need to treat them as space. Also in my patch, I just treat those characters as normal, right? My previous sentence just guess the motivation about why they used those strange symbol characters in their pages. So are we clear?

johnnyding

Comment 14 2007-12-06 23:32:08 PST

Sorry for my pool wording, I will fix that and keep it in my mind!:-) The new patch will uploaded soon. I agree we should change the title to avoid misunderstanding. maybe like: Need WebKit to support Unicode characters as part of attribute name? About the performance test you mentioned in previous review, any update? How about my approach by using stand-alone buffer to temporarily save entity name? Thanks! (In reply to comment #11) > (From update of attachment 17733 [details] [edit]) > Alexey informed me that there may be some security concerns with supporting > these additional characters in tag and attribute names. I don't know the > details yet. Alexey would you be willing to comment? > > + (WebCore::HTMLTokenizer::parseEntity): Handle unicode Entity Name by > using acsii version findentity. > > "ASCII, not acssi". "version of findEntity", not "version findentity". > > "Unicode", not "unicode". > > Someone should fix the title of the bug; it no longer matches what's being > fixed here. > > + // Since the maximum length of entity name only > + // can be 9, so one char array which is allocated > + // from stack, its length is 10, should be OK. > + // Also if we have illegal character, we treat it > + // as illegal entity name. > > "maximum length is 9", not "maximum length can be 9" > > "a single char array", not "one char array" > > "on the stack", not "from stack" > > "have an illegal character", not "have illegal character" > > + char chTmpEntityNameBuffer[10]; > > We don't normally use type prefixes like "ch" in code. > > + > (In reply to comment #13) > (In reply to comment #12) > > (In reply to comment #11) > > > Alexey informed me that there may be some security concerns with supporting > > > these additional characters in tag and attribute names. I don't know the > > > details yet. Alexey would you be willing to comment? > > > > I tried to say that treating U+3000 as whitespace could be dangerous, referring > > to the comment that "some Chinese websites use one Chinese space symbol +U3000 > > as space to separate attribute name/value group" - I'm not aware of any issues > > with treating them as non-whitespace. > > > Oh, I guess it's my fault, My previous meaning was that some authors of Chinese > sites tried to use one Chinese space symbol +U3000 as space to separate > attribute name/value group, or maybe they just did not realize they used some > symbols which look like space. it does not mean WebKit need to treat them as > space. Also in my patch, I just treat those characters as normal, right? My > previous sentence just guess the motivation about why they used those strange > symbol characters in their pages. > > So are we clear? >

johnnyding

Comment 15 2007-12-06 23:36:41 PST

(In reply to comment #13) > (In reply to comment #12) > > (In reply to comment #11) > > > Alexey informed me that there may be some security concerns with supporting > > > these additional characters in tag and attribute names. I don't know the > > > details yet. Alexey would you be willing to comment? > > > > I tried to say that treating U+3000 as whitespace could be dangerous, referring > > to the comment that "some Chinese websites use one Chinese space symbol +U3000 > > as space to separate attribute name/value group" - I'm not aware of any issues > > with treating them as non-whitespace. > > > Oh, I guess it's my fault, My previous meaning was that some authors of Chinese > sites tried to use one Chinese space symbol +U3000 as space to separate > attribute name/value group, or maybe they just did not realize they used some > symbols which look like space. it does not mean WebKit need to treat them as > space. Also in my patch, I just treat those characters as normal, right? My > previous sentence just guess the motivation about why they used those strange > symbol characters in their pages. > > So are we clear? > Hi Alexey, Thanks very much for correcting me!

Darin Adler

Comment 16 2007-12-07 01:18:20 PST

Comment on attachment 17733 [details] patch for fixing this bug OK, r=me. We'll have to get someone to do some performance tests after landing this.

johnnyding

Comment 17 2007-12-07 11:10:18 PST

Created attachment 17774 [details] Patch v2. Thanks Darin! New patch is already. According to your review comments, I modified some wording and variable name based on patch 17733. Please take a look when you are available.

johnnyding

Comment 18 2007-12-10 13:46:44 PST

Hi Darin, Can I consider my patch v2 passed review? If yes, I will submit this change. Thanks!

Alexey Proskuryakov

Comment 19 2007-12-16 09:15:51 PST

Comment on attachment 17774 [details] Patch v2. Marking reviewed, as this is basically the same patch. Not sure how to proceed - who can run the performance tests?

Darin Adler

Comment 20 2007-12-16 09:20:59 PST

(In reply to comment #19) > Not sure how to proceed - who can run the performance tests? Probably Stephanie. You should send her a note.

Stephanie Lewis

Comment 21 2007-12-20 14:12:18 PST

Tested and Landed revision 28908

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware PC

OS Windows XP

Product WebKit

Component Page Loading

Assignee

Nobody

Reported

2007-11-28 15:11 PST

Modified

2007-12-20 14:12 PST History

CC List

4 users Show

URL

Keywords InRadar

Depends on

Blocks