tests to fail.
Created attachment 3816 [details]
patch to handle non-ASCII characters in the tokenizer
Created attachment 3817 [details]
Deseret test case
Deseret characters do not work even with this patch.
Firefox exhibits the same behavior.
But I believe there are no non-BMP characters that are legal in identifiers. The ECMA specification lists the
legal character classes for identifiers. So I think this patch is completely fine.
I'm certainly not an expert here (I found the documentation by keywords you mentioned)... But the
character in my example has an "Lu" category - permitted in identifiers by Ecma-262.
Created attachment 3821 [details]
normalization test case
One more questionable test :)
>Two identifiers that are canonically equivalent according to the Unicode
standard are not equal
>unless they are represented by the exact same sequence of code points (in
other words, conforming
>ECMAScript implementations are only required to do bitwise comparison on
identifiers) . The intent
>is that the incoming source text has been converted to normalised form C
before it reaches the
Well, if the source text must be normalized before it reaches the compiler,
then it shouldn't matter whether decomposed or precomposed forms are used...
Common sense says the same (otherwise, it's too easy to make extremely hard to
find mistakes), but this test case shows it doesn't work this way in both
Safari and Firefox.
OK. Lets file a separate bug about non-BMP characters. It's particularly awkward to handle those since
And lets file yet another separate bug about normalization.
Also note, the patch has some unrelated changes that are not part of the fix to this bug.
I think I'd like to fix both of those issues, but lets not do them all at once.
Lets break the issues from this one bug report into 5 separate bug reports: (1) non-Latin-1 BMP
characters in identifiers, (2) normalization of character sequences in identifiers, (3) \u sequences in
identifiers, (4) skipping Cf characters in incoming text, and (5) non-BMP characters in identifiers.
This patch addresses (1), (3), and (4), and I think it should be broken into separate pieces and landed
separately with separate tests for each one.
So, this becomes a meta-bug, did I understand you correctly? I have filed separate bugs:
bug 4918 for (1)
bug 4919 for (2)
bug 4920 for (3)
bug 4921 for (5).
I couldn't find why Cf characters should be skipped, so I didn't file a bug on that.
I was planning to use this for only issue (1), but keeping this as a "meta-bug" seems OK too, even though
it's not what I had in mind.
The reason we need to skip Cf characters is section 7.1 of the ECMA 262 standard:
"The format control characters can occur anywhere in the source text of an ECMAScript program. These
characters are removed from the source text before applying the lexical grammar."
Filed (4) as bug 4931.
See also Bug 8043.
See bug 10370
All related bugs closed, closing umbrella bug.