Non-ASCII letters aren't handled by the lexer for JavaScript. This is causing some JavaScriptCore Mozilla tests to fail.
Created attachment 3816 [details] patch to handle non-ASCII characters in the tokenizer
Created attachment 3817 [details] Deseret test case Is JavaScript supposed to handle non-BMP Unicode? Cyrillic starts to work, but Deseret characters do not work even with this patch. Firefox exhibits the same behavior.
Yes, JavaScript is supposed to handle non-BMP Unicode, in a sense. Everything's defined in terms of UTF-16. But I believe there are no non-BMP characters that are legal in identifiers. The ECMA specification lists the legal character classes for identifiers. So I think this patch is completely fine.
I'm certainly not an expert here (I found the documentation by keywords you mentioned)... But the character in my example has an "Lu" category - permitted in identifiers by Ecma-262.
Created attachment 3821 [details] normalization test case One more questionable test :) Ecma-262 says: >Two identifiers that are canonically equivalent according to the Unicode standard are not equal >unless they are represented by the exact same sequence of code points (in other words, conforming >ECMAScript implementations are only required to do bitwise comparison on identifiers) . The intent >is that the incoming source text has been converted to normalised form C before it reaches the >compiler. Well, if the source text must be normalized before it reaches the compiler, then it shouldn't matter whether decomposed or precomposed forms are used... Common sense says the same (otherwise, it's too easy to make extremely hard to find mistakes), but this test case shows it doesn't work this way in both Safari and Firefox.
OK. Lets file a separate bug about non-BMP characters. It's particularly awkward to handle those since JavaScript is so heavily based on UTF-16, but I'm sure we can get it to work. And lets file yet another separate bug about normalization. Also note, the patch has some unrelated changes that are not part of the fix to this bug.
I think I'd like to fix both of those issues, but lets not do them all at once. Lets break the issues from this one bug report into 5 separate bug reports: (1) non-Latin-1 BMP characters in identifiers, (2) normalization of character sequences in identifiers, (3) \u sequences in identifiers, (4) skipping Cf characters in incoming text, and (5) non-BMP characters in identifiers. This patch addresses (1), (3), and (4), and I think it should be broken into separate pieces and landed separately with separate tests for each one.
So, this becomes a meta-bug, did I understand you correctly? I have filed separate bugs: bug 4918 for (1) bug 4919 for (2) bug 4920 for (3) bug 4921 for (5). I couldn't find why Cf characters should be skipped, so I didn't file a bug on that.
I was planning to use this for only issue (1), but keeping this as a "meta-bug" seems OK too, even though it's not what I had in mind. The reason we need to skip Cf characters is section 7.1 of the ECMA 262 standard: "The format control characters can occur anywhere in the source text of an ECMAScript program. These characters are removed from the source text before applying the lexical grammar."
Filed (4) as bug 4931.
See also Bug 8043.
See bug 10370
All related bugs closed, closing umbrella bug.