4919 – Normalization of character sequences in identifiers

RESOLVED WONTFIX 4919

Normalization of character sequences in identifiers

https://bugs.webkit.org/show_bug.cgi?id=4919

Summary Normalization of character sequences in identifiers

Alexey Proskuryakov

Reported 2005-09-10 12:59:47 PDT

Ecma-262 says: >Two identifiers that are canonically equivalent according to the Unicode >standard are not equal unless they are represented by the exact same >sequence of code points (in other words, conforming >ECMAScript implementations are only required to do bitwise comparison on >identifiers) . The intent is that the incoming source text has been converted >to normalised form C before it reaches the compiler. Well, if the source text must be normalized before it reaches the compiler, then it shouldn't matter whether decomposed or precomposed forms are used... Common sense says the same (otherwise, it's too easy to make extremely hard to find mistakes), but this test case shows it doesn't work this way in both Safari and Firefox.

Attachments
test case (714 bytes, text/html) 2005-09-10 13:00 PDT, Alexey Proskuryakov	no flags	Details
View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2005-09-10 13:00:11 PDT

Created attachment 3847 [details] test case

Eric Seidel (no email)

Comment 2 2010-11-08 11:31:52 PST

MInefield fails too, is this still considered a bug?

Alexey Proskuryakov

Comment 3 2010-11-08 12:57:41 PST

Likely so. The current situation is just an invitation for IOCCC-style contests. What does ECMAScript 5 say?

Gavin Barraclough

Comment 4 2011-06-13 22:38:59 PDT

ES5 further clarifies this point, see section 6: "Conforming ECMAScript implementations are not required to perform any normalization of text, or behave as though they were performing normalization of text, themselves." The intention seems very clear here - we should not introduce an unreasonable computational burden in lexing. Sources should be being served to the browser should be in a normalized form. If this is overly onerous on developers working in languages affected by this requirement then I'd suggest filing a bug against the the ECMA spec. We are in compliance with the spec here, this bug is invalid.

Alexey Proskuryakov

Comment 5 2011-06-13 23:15:16 PDT

Even if the spec said so, it would have been simply wrong. But it actually avoids answering the question by magically separating "the compiler" from other parts of JavaScript implementation. See e.g. section 7.6: "The intent is that the incoming source text has been converted to normalised form C before it reaches the compiler." Whether normalization is in JavaScriptCore or in WebCore, treating canonically equivalent strings as equal is a fundamental feature of the Unicode specification.

Gavin Barraclough

Comment 6 2011-06-14 02:34:11 PDT

(In reply to comment #5) > Even if the spec said so, it would have been simply wrong. But it actually avoids answering the question by magically separating "the compiler" from other parts of JavaScript implementation. See e.g. section 7.6: "The intent is that the incoming source text has been converted to normalised form C before it reaches the compiler." I think that it is important to read this sentence from the spec in its full context: "Two IdentifierName that are canonically equivalent according to the Unicode standard are not equal unless they are represented by the exact same sequence of code units (in other words, conforming ECMAScript implementations are only required to do bitwise comparison on IdentifierName values). The intent is that the incoming source text has been converted to normalised form C before it reaches the compiler." The spec explicitly recognizes that non-canonical input may be passed to the lexer, and explicitly states what the correct behaviour should be is in this case – that the two IdentifierNames are not equal. This refutes any notion that there is an expectation that the source text should converted into a normalized form (in fact, this is explicitly stated in chapter 6 - "Conforming ECMAScript implementations are not required to perform any normalisation of text, or behave as though they were performing normalisation of text, themselves."). It does not seem like likely that the ES5 authors intention was of the last sentence of this paragraph was to contradict the behaviour specified in the first two sentences, or in chapter 6. Rather than a definition of behaviour, this is clearly a statement of intent (it does begin, "The intent is that ..."). My reading would be that their intent is that by defining the language not to normalize, their intention is to encourage people to serve source text in a canonicalized form. > Whether normalization is in JavaScriptCore or in WebCore, treating canonically equivalent strings as equal is a fundamental feature of the Unicode specification. This argument seems to be predicated on the assumption that ES5 source text can be treated as a string of Unicode text - which is incorrect. ES5 source text is explicitly defined to be a sequence of UTF-16 Code Points (see chapter 6), and equality of identifiers within ES5 source text is explicitly defined to be a bitwise comparison.

Gavin Barraclough

Comment 7 2011-06-14 02:34:18 PDT

Also, if you preprocess the source to canonicalize, you may introduce incorrect behaviour! Consider the following example: var NAME1 = 1; var NAME2 = 2; alert(NAME1 + NAME2); In a conforming ES5 implementation the result of the add should be 3, provided that NAME1 and NAME2 are two distinct IdentifierNames according to the spec's definition - that is, any two identifiers that are not the exact same sequence of Code Units (even if they are canonically equivalent according to the Unicode standard). However if were to preprocess the source to canonicalize, you may make NAME1 and NAME2 equal, and as a result the add would yield the result 4.

Alexey Proskuryakov

Comment 8 2011-06-14 08:59:53 PDT

As long as JavaScript spec claims conformance to Unicode, it simply cannot perform bitwise comparisons on strings. You certainly don't want "var é=1; alert(é)" to alert "undefined" simply because the source has been edited in different editors, which use different normalization forms. This kind of mistake is nearly impossible to spot, and it's the job of the browser to ensure Unicode conformance by not making a difference between normalization forms.

Gavin Barraclough

Comment 9 2011-06-14 12:15:14 PDT

(In reply to comment #8) > As long as JavaScript spec claims conformance to Unicode, it simply cannot perform bitwise comparisons on strings. I don't know the ES spec to claim full Unicode conformance - in fact, quite the opposite. ES for example, explicitly permits malformed UTF-16 strings with unmatched surrogates. > You certainly don't want "var é=1; alert(é)" to alert "undefined" simply because the source has been edited in different editors, which use different normalization forms. I'm afraid that if we want to retain compatibility with the ES5 spec, then é and é are different variables, if they have different Code Unit representations. The spec is quite deliberate on this matter. I think it's also worth noting that this is not something simple that could be safely fixed by the browser preprocessing the source, without further language changes. Consider the situation of a site that serves ES source in a non-canonical form, where that source accesses binary encoded data interpreted as strings and used as object property names, accessed via a source other than the program text (e.g. a script that pulls data containing strings over XHR). If the source text is in a non-canonical representation and the property names are also in the same non-canonical representation, then this code will work correctly, in a spec defined fashion. If you preprocess the source to canonicalize, without further (non spec-compliant) changes in object property access, then this script will start to fail (property names accessed via identifiers in the source would no longer match the properties on the object constructed from the loaded data). Similar problems could be described where identifiers are compared to dynamically composed strings. To resolve these issues you would have to change property access, or possibly string composition and maybe equality behavior. None of these are steps are things we'd want to consider outside of the ES process. > This kind of mistake is nearly impossible to spot, and it's the job of the browser to ensure Unicode conformance by not making a difference between normalization forms. I would respectfully point out that the Unicode standard is not the only standard out there, and it does not usurp and supersede all other standards. I don't see it as our job to crusade for Unicode conformance at the expense of compliance with any non-compilant specifications. Instead, I would propose we should work together with the relevant standards bodies to help move towards better Unicode handling. (And if you are willing to do so, I think you'll find ECMA open to such an approach. If you've been following the mailing lists you may have seen recent threads discussing adoption of UTF-32 support in ES strings.)

Alexey Proskuryakov

Comment 10 2011-06-14 13:29:16 PDT

> Instead, I would propose we should work together with the relevant standards bodies to help move towards better Unicode handling. Besides ECMA, both W3C <http://www.w3.org/International/core/> and the Unicode consortium <http://www.unicode.org/reports/tr20/> have been looking into these issues, and it's not clear to me who has the most expertise and authority to resolve them. I do not think that tracking this in our Bugzilla somehow violates the spirit of open collaboration.

Gavin Barraclough

Comment 11 2012-09-25 13:45:28 PDT

Hopefully the ES6 & next specs will bring greater unicode conformance, but right now our implementation is correct per ES5.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution WONTFIX

Priority P2

Severity Normal

Classification Unclassified

Version 420+

Hardware Mac

OS OS X 10.4

Product WebKit

Component JavaScriptCore

Assignee

Geoffrey Garen

Reported

2005-09-10 12:59 PDT

Modified

2012-09-25 13:45 PDT History

CC List

6 users Show

URL

Keywords

Depends on

Blocks

4885

Dependencies

tree graph