Bug 155114 - [ES6] Regular Expression canonicalization tables for Unicode need to be updated to use Unicode CaseFolding.txt
Summary: [ES6] Regular Expression canonicalization tables for Unicode need to be updat...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2016-03-07 08:43 PST by Michael Saboff
Modified: 2016-03-08 10:35 PST (History)
1 user (show)

See Also:


Attachments
Patch (295.45 KB, patch)
2016-03-07 09:06 PST, Michael Saboff
no flags Details | Formatted Diff | Diff
Patch updated to fix CMake builds (295.70 KB, patch)
2016-03-07 09:12 PST, Michael Saboff
no flags Details | Formatted Diff | Diff
Patch (298.27 KB, patch)
2016-03-07 23:10 PST, Michael Saboff
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2016-03-07 08:43:28 PST
In the ECMAScript 6.0 standard, the rules for case insensitive matching in regular expressions describes a Canonicalize function that has different behavior depending on whether or not the regular expression is a Unicode regular express, set with the ‘u’ flag.

The canonicalize behavior is table driven with those tables generated by a JavaScript file, YarrCanonicalizeUnicode.js, that is used to YarrCanonicalizeUnicode.js.cpp file.  The semantics for the non-Unicode case are handled properly by the JavaScript generating file, however the Unicode tables cannot easily be handled with JavaScript.  Those semantics reference the Unicode Character Database (UCD) file CaseFolding.txt.
Comment 1 Michael Saboff 2016-03-07 09:03:23 PST
<rdar://problem/24955033>
Comment 2 Michael Saboff 2016-03-07 09:06:26 PST
Created attachment 273183 [details]
Patch
Comment 3 WebKit Commit Bot 2016-03-07 09:08:58 PST
Attachment 273183 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.cpp:51:  UCS2_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.cpp:70:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:40:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:41:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:42:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:43:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:44:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:54:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:58:  UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:62:  enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums.  [readability/enum_casing] [4]
Total errors found: 10 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 4 Michael Saboff 2016-03-07 09:12:57 PST
Created attachment 273185 [details]
Patch updated to fix CMake builds
Comment 5 WebKit Commit Bot 2016-03-07 09:14:19 PST
Attachment 273185 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.cpp:51:  UCS2_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.cpp:70:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:40:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:41:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:42:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:43:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:44:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:54:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:58:  UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:62:  enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums.  [readability/enum_casing] [4]
Total errors found: 10 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 6 Geoffrey Garen 2016-03-07 09:23:03 PST
Comment on attachment 273185 [details]
Patch updated to fix CMake builds

View in context: https://bugs.webkit.org/attachment.cgi?id=273185&action=review

> Source/JavaScriptCore/ChangeLog:10
> +        Extracted out the Unicode canonicalization table creation from YarrCanonicalizeUnicode.js
> +        into a new Python script, generate-unicode-canonical-tables.  That script generates the Unicode
> +        tables as a .h file.

I don't think we want a giant table in a .h file. Do we plan to #include it in multiple places? We'd like to have a single copy of the table in the binary.

> Source/JavaScriptCore/DerivedSources.make:64
> +    YarrCanonicalizeUnicode.h \

Why is one part of canonicalization generated by a manually executed JavaScript program, while another is generated by a build-system-executed Python program? I think this system might be easier to follow if it all happened at once -- either manually or at build time.

> Source/JavaScriptCore/yarr/YarrCanonicalize.js:218
> +// createTables("Unicode", MAX_UNICODE, createUnicodeCanonicalGroups());

What's up with this commented out code?
Comment 7 Michael Saboff 2016-03-07 09:29:07 PST
(In reply to comment #6)
> Comment on attachment 273185 [details]
> Patch updated to fix CMake builds
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=273185&action=review
> 
> > Source/JavaScriptCore/ChangeLog:10
> > +        Extracted out the Unicode canonicalization table creation from YarrCanonicalizeUnicode.js
> > +        into a new Python script, generate-unicode-canonical-tables.  That script generates the Unicode
> > +        tables as a .h file.
> 
> I don't think we want a giant table in a .h file. Do we plan to #include it
> in multiple places? We'd like to have a single copy of the table in the
> binary.

It is only include in one .cpp file.  I could change it to generate a .cpp file if you prefer.

> > Source/JavaScriptCore/DerivedSources.make:64
> > +    YarrCanonicalizeUnicode.h \
> 
> Why is one part of canonicalization generated by a manually executed
> JavaScript program, while another is generated by a build-system-executed
> Python program? I think this system might be easier to follow if it all
> happened at once -- either manually or at build time.

The rules description for Canonicalize (https://tc39.github.io/ecma262/2016/#sec-runtime-semantics-canonicalize-ch) describe using CaseFolding.txt for the Unicode case and "as if by performing the algorithm for String.prototype.toUpperCase" for the non-Unicode case.  For all intents and purposes, the non-Unicode case doesn't need to be generated again unless there are future non-Unicode standards changes.

> > Source/JavaScriptCore/yarr/YarrCanonicalize.js:218
> > +// createTables("Unicode", MAX_UNICODE, createUnicodeCanonicalGroups());
> 
> What's up with this commented out code?

Deleted.
Comment 8 Michael Saboff 2016-03-07 23:10:06 PST
Created attachment 273276 [details]
Patch

Updated to generate the Unicode tables into DerivedSource/JavaScriptCore/YarrCanonicalizeUnicode.cpp.

Renamed yarr/YarrCanonicalizeUnicode.js and yarr/YarrCanonicalizeUnicode.cpp back to their original names, yarr/YarrCanonicalizeUCS2.js and .cpp.

Fixed CMake builds.
Comment 9 WebKit Commit Bot 2016-03-07 23:11:30 PST
Attachment 273276 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:40:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:41:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:42:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:43:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:44:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:54:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:58:  UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalize.h:62:  enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums.  [readability/enum_casing] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:49:  UCS2_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:68:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
Total errors found: 10 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 10 Darin Adler 2016-03-08 09:07:44 PST
Comment on attachment 273276 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=273276&action=review

Looks like a great change, specifically about the need to use the case folding table even when ASCII is involved. Are these the toughest test cases, though? Any that shed light on the cases where canonicalization doesn’t do the same thing that comparing against both upper and lower case would?

Not sure there is a Windows problem, by the way. The Windows bot says it’s unable to *apply* the patch, not failing to compile once it’s applied.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:381
> +                // patterns, Unicode values are never allowed to match against ascii ones.

I suggest ASCII rather than ascii here.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:382
> +                // For Unicode, we need to check all canonical equvolents of a character.

Typo: equvolents

> Source/JavaScriptCore/yarr/YarrPattern.cpp:77
> +            // Handle ascii cases.

I suggest ASCII rather than ascii. But also, not sure this comment has value.
Comment 11 Michael Saboff 2016-03-08 10:12:54 PST
(In reply to comment #10)
> Comment on attachment 273276 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=273276&action=review
> 
> Looks like a great change, specifically about the need to use the case
> folding table even when ASCII is involved. Are these the toughest test
> cases, though? Any that shed light on the cases where canonicalization
> doesn’t do the same thing that comparing against both upper and lower case
> would?

The test cases that I added are the ones that are "cross-ASCII", that is a non-ASCII character canonicalizes to an ASCII character.  There are existing ignore case tests that handle ASCII, BMP and non-BMP cases.  This patch also passes the Chakra regex-case-folding.js test.

> Not sure there is a Windows problem, by the way. The Windows bot says it’s
> unable to *apply* the patch, not failing to compile once it’s applied.
> 

Agreed.

> > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:381
> > +                // patterns, Unicode values are never allowed to match against ascii ones.
> 
> I suggest ASCII rather than ascii here.
...

I took care of the typos.
Comment 12 Michael Saboff 2016-03-08 10:35:46 PST
Committed r197781: <http://trac.webkit.org/changeset/197781>