Bug 4885 - problems in Unicode handling in JavaScript parsing
Summary: problems in Unicode handling in JavaScript parsing
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Geoffrey Garen
URL:
Keywords:
Depends on: 4918 4919 4920 4921 4931
Blocks:
  Show dependency treegraph
 
Reported: 2005-09-08 08:42 PDT by Darin Adler
Modified: 2011-06-13 22:39 PDT (History)
4 users (show)

See Also:


Attachments
patch to handle non-ASCII characters in the tokenizer (12.12 KB, patch)
2005-09-08 08:44 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Deseret test case (245 bytes, text/html)
2005-09-08 10:42 PDT, Alexey Proskuryakov
no flags Details
normalization test case (298 bytes, text/html)
2005-09-08 21:43 PDT, Alexey Proskuryakov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Darin Adler 2005-09-08 08:42:25 PDT
Non-ASCII letters aren't handled by the lexer for JavaScript. This is causing some JavaScriptCore Mozilla 
tests to fail.
Comment 1 Darin Adler 2005-09-08 08:44:03 PDT
Created attachment 3816 [details]
patch to handle non-ASCII characters in the tokenizer
Comment 2 Alexey Proskuryakov 2005-09-08 10:42:37 PDT
Created attachment 3817 [details]
Deseret test case

Is JavaScript supposed to handle non-BMP Unicode? Cyrillic starts to work, but
Deseret characters do not work even with this patch.

Firefox exhibits the same behavior.
Comment 3 Darin Adler 2005-09-08 10:55:04 PDT
Yes, JavaScript is supposed to handle non-BMP Unicode, in a sense. Everything's defined in terms of 
UTF-16.

But I believe there are no non-BMP characters that are legal in identifiers. The ECMA specification lists the 
legal character classes for identifiers. So I think this patch is completely fine.
Comment 4 Alexey Proskuryakov 2005-09-08 12:36:30 PDT
I'm certainly not an expert here (I found the documentation by keywords you mentioned)... But the 
character in my example has an "Lu" category - permitted in identifiers by Ecma-262.
Comment 5 Alexey Proskuryakov 2005-09-08 21:43:25 PDT
Created attachment 3821 [details]
normalization test case

One more questionable test :)

Ecma-262 says:
>Two identifiers that are canonically equivalent according to the Unicode
standard are not equal 
>unless they are represented by the exact same sequence of code points (in
other words, conforming
>ECMAScript implementations are only required to do bitwise comparison on
identifiers) . The intent 
>is that the incoming source text has been converted to normalised form C
before it reaches the 
>compiler. 

Well, if the source text must be normalized before it reaches the compiler,
then it shouldn't matter whether decomposed or precomposed forms are used...
Common sense says the same (otherwise, it's too easy to make extremely hard to
find mistakes), but this test case shows it doesn't work this way in both
Safari and Firefox.
Comment 6 Darin Adler 2005-09-08 22:19:46 PDT
OK. Lets file a separate bug about non-BMP characters. It's particularly awkward to handle those since 
JavaScript is so heavily based on UTF-16, but I'm sure we can get it to work.

And lets file yet another separate bug about normalization.

Also note, the patch has some unrelated changes that are not part of the fix to this bug.
Comment 7 Darin Adler 2005-09-08 22:26:11 PDT
I think I'd like to fix both of those issues, but lets not do them all at once.

Lets break the issues from this one bug report into 5 separate bug reports: (1) non-Latin-1 BMP 
characters in identifiers, (2) normalization of character sequences in identifiers, (3) \u sequences in 
identifiers, (4) skipping Cf characters in incoming text, and (5) non-BMP characters in identifiers.

This patch addresses (1), (3), and (4), and I think it should be broken into separate pieces and landed 
separately with separate tests for each one.
Comment 8 Alexey Proskuryakov 2005-09-10 13:13:50 PDT
So, this becomes a meta-bug, did I understand you correctly? I have filed separate bugs:

bug 4918 for (1)
bug 4919 for (2)
bug 4920 for (3)
bug 4921 for (5). 

I couldn't find why Cf characters should be skipped, so I didn't file a bug on that.
Comment 9 Darin Adler 2005-09-10 21:25:55 PDT
I was planning to use this for only issue (1), but keeping this as a "meta-bug" seems OK too, even though 
it's not what I had in mind.

The reason we need to skip Cf characters is section 7.1 of the ECMA 262 standard:

"The format control characters can occur anywhere in the source text of an ECMAScript program. These 
characters are removed from the source text before applying the lexical grammar."
Comment 10 Alexey Proskuryakov 2005-09-11 02:19:42 PDT
Filed (4) as bug 4931.
Comment 11 David Kilzer (:ddkilzer) 2006-03-28 19:29:09 PST
See also Bug 8043.
Comment 12 Doug Wright 2006-08-12 11:16:15 PDT
See bug 10370
Comment 13 Gavin Barraclough 2011-06-13 22:39:40 PDT
All related bugs closed, closing umbrella bug.