4885 – problems in Unicode handling in JavaScript parsing

RESOLVED FIXED 4885

problems in Unicode handling in JavaScript parsing

https://bugs.webkit.org/show_bug.cgi?id=4885

Summary problems in Unicode handling in JavaScript parsing

Darin Adler

Reported 2005-09-08 08:42:25 PDT

Non-ASCII letters aren't handled by the lexer for JavaScript. This is causing some JavaScriptCore Mozilla tests to fail.

Attachments
patch to handle non-ASCII characters in the tokenizer (12.12 KB, patch) 2005-09-08 08:44 PDT, Darin Adler	no flags	Details Formatted Diff Diff
Deseret test case (245 bytes, text/html) 2005-09-08 10:42 PDT, Alexey Proskuryakov	no flags	Details
normalization test case (298 bytes, text/html) 2005-09-08 21:43 PDT, Alexey Proskuryakov	no flags	Details
View All Add attachment proposed patch, testcase, etc.

Darin Adler

Comment 1 2005-09-08 08:44:03 PDT

Created attachment 3816 [details] patch to handle non-ASCII characters in the tokenizer

Alexey Proskuryakov

Comment 2 2005-09-08 10:42:37 PDT

Created attachment 3817 [details] Deseret test case Is JavaScript supposed to handle non-BMP Unicode? Cyrillic starts to work, but Deseret characters do not work even with this patch. Firefox exhibits the same behavior.

Darin Adler

Comment 3 2005-09-08 10:55:04 PDT

Yes, JavaScript is supposed to handle non-BMP Unicode, in a sense. Everything's defined in terms of UTF-16. But I believe there are no non-BMP characters that are legal in identifiers. The ECMA specification lists the legal character classes for identifiers. So I think this patch is completely fine.

Alexey Proskuryakov

Comment 4 2005-09-08 12:36:30 PDT

I'm certainly not an expert here (I found the documentation by keywords you mentioned)... But the character in my example has an "Lu" category - permitted in identifiers by Ecma-262.

Alexey Proskuryakov

Comment 5 2005-09-08 21:43:25 PDT

Created attachment 3821 [details] normalization test case One more questionable test :) Ecma-262 says: >Two identifiers that are canonically equivalent according to the Unicode standard are not equal >unless they are represented by the exact same sequence of code points (in other words, conforming >ECMAScript implementations are only required to do bitwise comparison on identifiers) . The intent >is that the incoming source text has been converted to normalised form C before it reaches the >compiler. Well, if the source text must be normalized before it reaches the compiler, then it shouldn't matter whether decomposed or precomposed forms are used... Common sense says the same (otherwise, it's too easy to make extremely hard to find mistakes), but this test case shows it doesn't work this way in both Safari and Firefox.

Darin Adler

Comment 6 2005-09-08 22:19:46 PDT

OK. Lets file a separate bug about non-BMP characters. It's particularly awkward to handle those since JavaScript is so heavily based on UTF-16, but I'm sure we can get it to work. And lets file yet another separate bug about normalization. Also note, the patch has some unrelated changes that are not part of the fix to this bug.

Darin Adler

Comment 7 2005-09-08 22:26:11 PDT

I think I'd like to fix both of those issues, but lets not do them all at once. Lets break the issues from this one bug report into 5 separate bug reports: (1) non-Latin-1 BMP characters in identifiers, (2) normalization of character sequences in identifiers, (3) \u sequences in identifiers, (4) skipping Cf characters in incoming text, and (5) non-BMP characters in identifiers. This patch addresses (1), (3), and (4), and I think it should be broken into separate pieces and landed separately with separate tests for each one.

Alexey Proskuryakov

Comment 8 2005-09-10 13:13:50 PDT

So, this becomes a meta-bug, did I understand you correctly? I have filed separate bugs: bug 4918 for (1) bug 4919 for (2) bug 4920 for (3) bug 4921 for (5). I couldn't find why Cf characters should be skipped, so I didn't file a bug on that.

Darin Adler

Comment 9 2005-09-10 21:25:55 PDT

I was planning to use this for only issue (1), but keeping this as a "meta-bug" seems OK too, even though it's not what I had in mind. The reason we need to skip Cf characters is section 7.1 of the ECMA 262 standard: "The format control characters can occur anywhere in the source text of an ECMAScript program. These characters are removed from the source text before applying the lexical grammar."

Alexey Proskuryakov

Comment 10 2005-09-11 02:19:42 PDT

Filed (4) as bug 4931.

David Kilzer (:ddkilzer)

Comment 11 2006-03-28 19:29:09 PST