Bug 4919 - Normalization of character sequences in identifiers
Summary: Normalization of character sequences in identifiers
Status: RESOLVED WONTFIX
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Geoffrey Garen
URL:
Keywords:
Depends on:
Blocks: 4885
  Show dependency treegraph
 
Reported: 2005-09-10 12:59 PDT by Alexey Proskuryakov
Modified: 2012-09-25 13:45 PDT (History)
6 users (show)

See Also:


Attachments
test case (714 bytes, text/html)
2005-09-10 13:00 PDT, Alexey Proskuryakov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2005-09-10 12:59:47 PDT
Ecma-262 says:
>Two identifiers that are canonically equivalent according to the Unicode
>standard are not equal unless they are represented by the exact same 
>sequence of code points (in other words, conforming
>ECMAScript implementations are only required to do bitwise comparison on
>identifiers) . The intent is that the incoming source text has been converted 
>to normalised form C before it reaches the compiler. 

Well, if the source text must be normalized before it reaches the compiler,
then it shouldn't matter whether decomposed or precomposed forms are used...
Common sense says the same (otherwise, it's too easy to make extremely hard to
find mistakes), but this test case shows it doesn't work this way in both
Safari and Firefox.
Comment 1 Alexey Proskuryakov 2005-09-10 13:00:11 PDT
Created attachment 3847 [details]
test case
Comment 2 Eric Seidel (no email) 2010-11-08 11:31:52 PST
MInefield fails too, is this still considered a bug?
Comment 3 Alexey Proskuryakov 2010-11-08 12:57:41 PST
Likely so. The current situation is just an invitation for IOCCC-style contests.

What does ECMAScript 5 say?
Comment 4 Gavin Barraclough 2011-06-13 22:38:59 PDT
ES5 further clarifies this point, see section 6:

"Conforming ECMAScript implementations are not required to perform any normalization of text, or behave as though they were performing normalization of text, themselves."

The intention seems very clear here - we should not introduce an unreasonable computational burden in lexing.  Sources should be being served to the browser should be in a normalized form.  If this is overly onerous on developers working in languages affected by this requirement then I'd suggest filing a bug against the the ECMA spec.

We are in compliance with the spec here, this bug is invalid.
Comment 5 Alexey Proskuryakov 2011-06-13 23:15:16 PDT
Even if the spec said so, it would have been simply wrong. But it actually avoids answering the question by magically separating "the compiler" from other parts of JavaScript implementation. See e.g. section 7.6: "The intent is that the incoming source text has been converted to normalised form C before it reaches the compiler."

Whether normalization is in JavaScriptCore or in WebCore, treating canonically equivalent strings as equal is a fundamental feature of the Unicode specification.
Comment 6 Gavin Barraclough 2011-06-14 02:34:11 PDT
(In reply to comment #5)
> Even if the spec said so, it would have been simply wrong. But it actually avoids answering the question by magically separating "the compiler" from other parts of JavaScript implementation. See e.g. section 7.6: "The intent is that the incoming source text has been converted to normalised form C before it reaches the compiler."

I think that it is important to read this sentence from the spec in its full context:

"Two IdentifierName that are canonically equivalent according to the Unicode standard are not equal unless they are represented by the exact same sequence of code units (in other words, conforming ECMAScript implementations are only required to do bitwise comparison on IdentifierName values). The intent is that the incoming source text has been converted to normalised form C before it reaches the compiler."

The spec explicitly recognizes that non-canonical input may be passed to the lexer, and explicitly states what the correct behaviour should be is in this case – that the two IdentifierNames are not equal.  This refutes any notion that there is an expectation that the source text should converted into a normalized form (in fact, this is explicitly stated in chapter 6 - "Conforming ECMAScript implementations are not required to perform any normalisation of text, or behave as though they were performing normalisation of text, themselves.").

It does not seem like likely that the ES5 authors intention was of the last sentence of this paragraph was to contradict the behaviour specified in the first two sentences, or in chapter 6.  Rather than a definition of behaviour, this is clearly a statement of intent (it does begin, "The intent is that ...").  My reading would be that their intent is that by defining the language not to normalize, their intention is to encourage people to serve source text in a canonicalized form.

> Whether normalization is in JavaScriptCore or in WebCore, treating canonically equivalent strings as equal is a fundamental feature of the Unicode specification.

This argument seems to be predicated on the assumption that ES5 source text can be treated as a string of Unicode text - which is incorrect.  ES5 source text is explicitly defined to be a sequence of UTF-16  Code Points (see chapter 6), and equality of identifiers within ES5 source text is explicitly defined to be a bitwise comparison.
Comment 7 Gavin Barraclough 2011-06-14 02:34:18 PDT
Also, if you preprocess the source to canonicalize, you may introduce incorrect behaviour!

Consider the following example:

var NAME1 = 1;
var NAME2 = 2;
alert(NAME1 + NAME2);

In a conforming ES5 implementation the result of the add should be 3, provided that NAME1 and NAME2 are two distinct IdentifierNames according to the spec's definition - that is, any two identifiers that are not the exact same sequence of Code Units (even if they are canonically equivalent according to the Unicode standard).  However if were to preprocess the source to canonicalize, you may make NAME1 and NAME2 equal, and as a result the add would yield the result 4.
Comment 8 Alexey Proskuryakov 2011-06-14 08:59:53 PDT
As long as JavaScript spec claims conformance to Unicode, it simply cannot perform bitwise comparisons on strings.

You certainly don't want "var é=1; alert(é)" to alert "undefined" simply because the source has been edited in different editors, which use different normalization forms. This kind of mistake is nearly impossible to spot, and it's the job of the browser to ensure Unicode conformance by not making a difference between normalization forms.
Comment 9 Gavin Barraclough 2011-06-14 12:15:14 PDT
(In reply to comment #8)
> As long as JavaScript spec claims conformance to Unicode, it simply cannot perform bitwise comparisons on strings.

I don't know the ES spec to claim full Unicode conformance - in fact, quite the opposite.  ES for example, explicitly permits malformed UTF-16 strings with unmatched surrogates.

> You certainly don't want "var é=1; alert(é)" to alert "undefined" simply because the source has been edited in different editors, which use different normalization forms.

I'm afraid that if we want to retain compatibility with the ES5 spec, then é and é are different variables, if they have different Code Unit representations.  The spec is quite deliberate on this matter.

I think it's also worth noting that this is not something simple that could be safely fixed by the browser preprocessing the source, without further language changes.  Consider the situation of a site that serves ES source in a non-canonical form, where that source accesses binary encoded data interpreted as strings and used as object property names, accessed via a source other than the program text (e.g. a script that pulls data containing strings over XHR).  If the source text is in a non-canonical representation and the property names are also in the same non-canonical representation, then this code will work correctly, in a spec defined fashion.  If you preprocess the source to canonicalize, without further (non spec-compliant) changes in object property access, then this script will start to fail (property names accessed via identifiers in the source would no longer match the properties on the object constructed from the loaded data).  Similar problems could be described where identifiers are compared to dynamically composed strings.

To resolve these issues you would have to change property access, or possibly string composition and maybe equality behavior.  None of these are steps are things we'd want to consider outside of the ES process.

> This kind of mistake is nearly impossible to spot, and it's the job of the browser to ensure Unicode conformance by not making a difference between normalization forms.

I would respectfully point out that the Unicode standard is not the only standard out there, and it does not usurp and supersede all other standards.  I don't see it as our job to crusade for Unicode conformance at the expense of compliance with any non-compilant specifications.  Instead, I would propose we should work together with the relevant standards bodies to help move towards better Unicode handling.  (And if you are willing to do so, I think you'll find ECMA open to such an approach.  If you've been following the mailing lists you may have seen recent threads discussing adoption of UTF-32 support in ES strings.)
Comment 10 Alexey Proskuryakov 2011-06-14 13:29:16 PDT
> Instead, I would propose we should work together with the relevant standards bodies to help move towards better Unicode handling.

Besides ECMA, both W3C <http://www.w3.org/International/core/> and the Unicode consortium <http://www.unicode.org/reports/tr20/> have been looking into these issues, and it's not clear to me who has the most expertise and authority to resolve them. I do not think that tracking this in our Bugzilla somehow violates the spirit of open collaboration.
Comment 11 Gavin Barraclough 2012-09-25 13:45:28 PDT
Hopefully the ES6 & next specs will bring greater unicode conformance, but right now our implementation is correct per ES5.