RESOLVED FIXED Bug 4885
problems in Unicode handling in JavaScript parsing
https://bugs.webkit.org/show_bug.cgi?id=4885
Summary problems in Unicode handling in JavaScript parsing
Darin Adler
Reported 2005-09-08 08:42:25 PDT
Non-ASCII letters aren't handled by the lexer for JavaScript. This is causing some JavaScriptCore Mozilla tests to fail.
Attachments
patch to handle non-ASCII characters in the tokenizer (12.12 KB, patch)
2005-09-08 08:44 PDT, Darin Adler
no flags
Deseret test case (245 bytes, text/html)
2005-09-08 10:42 PDT, Alexey Proskuryakov
no flags
normalization test case (298 bytes, text/html)
2005-09-08 21:43 PDT, Alexey Proskuryakov
no flags
Darin Adler
Comment 1 2005-09-08 08:44:03 PDT
Created attachment 3816 [details] patch to handle non-ASCII characters in the tokenizer
Alexey Proskuryakov
Comment 2 2005-09-08 10:42:37 PDT
Created attachment 3817 [details] Deseret test case Is JavaScript supposed to handle non-BMP Unicode? Cyrillic starts to work, but Deseret characters do not work even with this patch. Firefox exhibits the same behavior.
Darin Adler
Comment 3 2005-09-08 10:55:04 PDT
Yes, JavaScript is supposed to handle non-BMP Unicode, in a sense. Everything's defined in terms of UTF-16. But I believe there are no non-BMP characters that are legal in identifiers. The ECMA specification lists the legal character classes for identifiers. So I think this patch is completely fine.
Alexey Proskuryakov
Comment 4 2005-09-08 12:36:30 PDT
I'm certainly not an expert here (I found the documentation by keywords you mentioned)... But the character in my example has an "Lu" category - permitted in identifiers by Ecma-262.
Alexey Proskuryakov
Comment 5 2005-09-08 21:43:25 PDT
Created attachment 3821 [details] normalization test case One more questionable test :) Ecma-262 says: >Two identifiers that are canonically equivalent according to the Unicode standard are not equal >unless they are represented by the exact same sequence of code points (in other words, conforming >ECMAScript implementations are only required to do bitwise comparison on identifiers) . The intent >is that the incoming source text has been converted to normalised form C before it reaches the >compiler. Well, if the source text must be normalized before it reaches the compiler, then it shouldn't matter whether decomposed or precomposed forms are used... Common sense says the same (otherwise, it's too easy to make extremely hard to find mistakes), but this test case shows it doesn't work this way in both Safari and Firefox.
Darin Adler
Comment 6 2005-09-08 22:19:46 PDT
OK. Lets file a separate bug about non-BMP characters. It's particularly awkward to handle those since JavaScript is so heavily based on UTF-16, but I'm sure we can get it to work. And lets file yet another separate bug about normalization. Also note, the patch has some unrelated changes that are not part of the fix to this bug.
Darin Adler
Comment 7 2005-09-08 22:26:11 PDT
I think I'd like to fix both of those issues, but lets not do them all at once. Lets break the issues from this one bug report into 5 separate bug reports: (1) non-Latin-1 BMP characters in identifiers, (2) normalization of character sequences in identifiers, (3) \u sequences in identifiers, (4) skipping Cf characters in incoming text, and (5) non-BMP characters in identifiers. This patch addresses (1), (3), and (4), and I think it should be broken into separate pieces and landed separately with separate tests for each one.
Alexey Proskuryakov
Comment 8 2005-09-10 13:13:50 PDT
So, this becomes a meta-bug, did I understand you correctly? I have filed separate bugs: bug 4918 for (1) bug 4919 for (2) bug 4920 for (3) bug 4921 for (5). I couldn't find why Cf characters should be skipped, so I didn't file a bug on that.
Darin Adler
Comment 9 2005-09-10 21:25:55 PDT
I was planning to use this for only issue (1), but keeping this as a "meta-bug" seems OK too, even though it's not what I had in mind. The reason we need to skip Cf characters is section 7.1 of the ECMA 262 standard: "The format control characters can occur anywhere in the source text of an ECMAScript program. These characters are removed from the source text before applying the lexical grammar."
Alexey Proskuryakov
Comment 10 2005-09-11 02:19:42 PDT
Filed (4) as bug 4931.
David Kilzer (:ddkilzer)
Comment 11 2006-03-28 19:29:09 PST
See also Bug 8043.
Doug Wright
Comment 12 2006-08-12 11:16:15 PDT
Gavin Barraclough
Comment 13 2011-06-13 22:39:40 PDT
All related bugs closed, closing umbrella bug.
Note You need to log in before you can comment on or make changes to this bug.