Bug 172069

Summary: Implement RegExp Unicode property escapes
Product: WebKit Reporter: Mathias Bynens <mathias>
Component: JavaScriptCoreAssignee: Michael Saboff <msaboff>
Status: RESOLVED FIXED    
Severity: Normal CC: buildbot, fpizlo, jfbastien, keith_miller, mark.lam, mathias, msaboff, rniwa, ryanhaddad, sbarati, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
URL: https://github.com/tc39/proposal-regexp-unicode-property-escapes
See Also: https://github.com/Microsoft/ChakraCore/issues/2969
https://bugzilla.mozilla.org/show_bug.cgi?id=1361876
https://bugs.chromium.org/p/v8/issues/detail?id=4743
Attachments:
Description Flags
UCD Datefiles for Patch
none
Main Patch
jfbastien: review+
Combined patch for EWS
none
Patch for EWS with suggested changes and speculative build fixes.
none
Patch for EWS with updated build fixes
buildbot: commit-queue-
Archive of layout-test-results from ews100 for mac-elcapitan
none
Archive of layout-test-results from ews105 for mac-elcapitan-wk2
none
Patch for EWS with test fix none

Description Mathias Bynens 2017-05-13 02:42:15 PDT
This proposal is currently at Stage 3 and seeking implementations: https://github.com/tc39/proposal-regexp-unicode-property-escapes

Tests: https://github.com/tc39/test262/tree/master/test/built-ins/RegExp/property-escapes
Comment 1 Radar WebKit Bug Importer 2017-05-13 20:23:48 PDT
<rdar://problem/32182540>
Comment 2 Michael Saboff 2017-09-27 06:43:09 PDT
Active Radar for the bug is <rdar://problem/33183184>
Comment 3 Michael Saboff 2017-10-06 15:11:54 PDT
Created attachment 323050 [details]
UCD Datefiles for Patch

These are Unicode Database version 10 files that are input for the main patch.
Comment 4 Michael Saboff 2017-10-06 15:12:15 PDT
Created attachment 323051 [details]
Main Patch
Comment 5 Build Bot 2017-10-06 15:14:39 PDT
Attachment 323051 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:72:  Alphabetical sorting problem.  [build/include_order] [4]
ERROR: Source/JavaScriptCore/yarr/YarrPattern.cpp:1026:  One space before end of line comments  [whitespace/comments] [5]
Total errors found: 2 in 21 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 6 Michael Saboff 2017-10-06 15:16:55 PDT
Created attachment 323053 [details]
Combined patch for EWS

Review the "Main Patch".
Comment 7 Build Bot 2017-10-06 15:19:18 PDT
Attachment 323053 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:738:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:996:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1135:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1692:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/ScriptExtensions.txt:38:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/yarr/YarrPattern.cpp:1026:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:72:  Alphabetical sorting problem.  [build/include_order] [4]
ERROR: Source/JavaScriptCore/ucd/Scripts.txt:18:  Line contains tab character.  [whitespace/tab] [5]
Total errors found: 8 in 31 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 8 Build Bot 2017-10-06 15:29:07 PDT
Attachment 323050 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:738:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:996:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1135:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1692:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/Scripts.txt:18:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/ScriptExtensions.txt:38:  Line contains tab character.  [whitespace/tab] [5]
Total errors found: 6 in 10 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 9 JF Bastien 2017-10-06 15:53:39 PDT
Comment on attachment 323051 [details]
Main Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=323051&action=review

r=me

> Source/JavaScriptCore/ChangeLog:22
> +        names to the function index.  Using these hashing tables, we can lookup up a property

up up up up

> Source/JavaScriptCore/Scripts/generateYarrUnicodePropertyTables.py:1
> +#! /usr/bin/python

#!/usr/bin/env python

> Source/JavaScriptCore/Scripts/generateYarrUnicodePropertyTables.py:62
> +// DO NO EDIT! - This file was generated by generateYarrUnicodePropertyTables.py

""" + __file__ + """

> Source/JavaScriptCore/Scripts/generateYarrUnicodePropertyTables.py:117
> +        exit(1)

This is all a lot of error handling for a python script! Such dedication, I'd just `with open(...) as ...:` and call it a day. :)

> Source/JavaScriptCore/Scripts/hasher.py:1
> +#! /usr/bin/python

Ditto on shebang.

> Source/JavaScriptCore/Scripts/hasher.py:27
> +# Boston, MA 02110-1301, USA.

Isn't this all newly translated code?

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:307
> +            size_t high = matches.size() - 1;

Is it ever empty?

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:336
> +            size_t high = ranges.size() - 1;

Ditto.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:354
> +        const size_t thresholdForBinarySearch = 6;

I guess it's never empty because of this!

Did you choose 6 out of some scientific measurement? Or pulled the number from a Magical Place?

> Source/JavaScriptCore/yarr/YarrParser.h:458
> +                    // m_err should be set tryConsumeUnicodePropertyExpression()

I don't get this comment. "set by" ?

> Source/JavaScriptCore/yarr/YarrPattern.cpp:98
> +                addSortedMatchOrRange(min, max + 1);

Can this ever overflow max? Ditto below for the arithmetic.

> Source/JavaScriptCore/yarr/YarrPattern.cpp:418
>      void atomBuiltInCharacterClass(BuiltInCharacterClassID classID, bool invert)

I find calls to this hard to read because of the boolean param. Could you make it an enum class CharacterClassInvert { Do, Dont }; or something similar?

> Source/JavaScriptCore/yarr/YarrPattern.cpp:1389
> +

Extra

> Source/JavaScriptCore/yarr/YarrPattern.h:50
> +    }

You could just set these values above, and then rely on the compiler auto-generating the default ctor properly.

> Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:52
> +    ALWAYS_INLINE int entry(WTF::String key) const

const WTF::String &   ?

> Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:75
> +std::optional<BuiltInCharacterClassID> unicodeMatchPropertyValue(WTF::String unicodePropertyName, WTF::String unicodePropertyValue)

const WTF::String &

> Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:92
> +std::optional<BuiltInCharacterClassID> unicodeMatchProperty(WTF::String unicodePropertyValue)

const WTF::String &
Comment 10 Michael Saboff 2017-10-06 16:51:21 PDT
Comment on attachment 323051 [details]
Main Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=323051&action=review

>> Source/JavaScriptCore/ChangeLog:22
>> +        names to the function index.  Using these hashing tables, we can lookup up a property
> 
> up up up up

down with "up" - removed.

>> Source/JavaScriptCore/Scripts/generateYarrUnicodePropertyTables.py:1
>> +#! /usr/bin/python
> 
> #!/usr/bin/env python

Fixed

>> Source/JavaScriptCore/Scripts/generateYarrUnicodePropertyTables.py:62
>> +// DO NO EDIT! - This file was generated by generateYarrUnicodePropertyTables.py
> 
> """ + __file__ + """

Fixed

>> Source/JavaScriptCore/Scripts/generateYarrUnicodePropertyTables.py:117
>> +        exit(1)
> 
> This is all a lot of error handling for a python script! Such dedication, I'd just `with open(...) as ...:` and call it a day. :)

I do this since there are several UCD files that must be present for this script to run.

>> Source/JavaScriptCore/Scripts/hasher.py:1
>> +#! /usr/bin/python
> 
> Ditto on shebang.

Fixed

>> Source/JavaScriptCore/Scripts/hasher.py:27
>> +# Boston, MA 02110-1301, USA.
> 
> Isn't this all newly translated code?

I translated it from the Perl and C++ code we already have.  I believe that the license is required.

>> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:307
>> +            size_t high = matches.size() - 1;
> 
> Is it ever empty?

No, see below.

>> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:354
>> +        const size_t thresholdForBinarySearch = 6;
> 
> I guess it's never empty because of this!
> 
> Did you choose 6 out of some scientific measurement? Or pulled the number from a Magical Place?

I tried various numbers, but didn't do rigorous testing before settling on 6.

>> Source/JavaScriptCore/yarr/YarrParser.h:458
>> +                    // m_err should be set tryConsumeUnicodePropertyExpression()
> 
> I don't get this comment. "set by" ?

Yes, "set by".  Since it only sets on error, I changed this to: "tryConsumeUnicodePropertyExpression() will set m_err for a malformed property expression
"

>> Source/JavaScriptCore/yarr/YarrPattern.cpp:98
>> +                addSortedMatchOrRange(min, max + 1);
> 
> Can this ever overflow max? Ditto below for the arithmetic.

Not with in the foreseeable future range of Unicode characters.  The current max is 0x10ffff, well below 2^32-1.

>> Source/JavaScriptCore/yarr/YarrPattern.cpp:418
>>      void atomBuiltInCharacterClass(BuiltInCharacterClassID classID, bool invert)
> 
> I find calls to this hard to read because of the boolean param. Could you make it an enum class CharacterClassInvert { Do, Dont }; or something similar?

This is a pre-existing pattern.  It may be this way since this is a delegate function that is part of a template parameter class.

>> Source/JavaScriptCore/yarr/YarrPattern.cpp:1389
>> +
> 
> Extra

Removed.

>> Source/JavaScriptCore/yarr/YarrPattern.h:50
>> +    }
> 
> You could just set these values above, and then rely on the compiler auto-generating the default ctor properly.

Done

>> Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:52
>> +    ALWAYS_INLINE int entry(WTF::String key) const
> 
> const WTF::String &   ?

Done

>> Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:75
>> +std::optional<BuiltInCharacterClassID> unicodeMatchPropertyValue(WTF::String unicodePropertyName, WTF::String unicodePropertyValue)
> 
> const WTF::String &

That breaks the compile as this is called with the result from another function.
Comment 11 Michael Saboff 2017-10-06 17:21:03 PDT
Created attachment 323072 [details]
Patch for EWS with suggested changes and speculative build fixes.
Comment 12 Build Bot 2017-10-06 17:24:26 PDT
Attachment 323072 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:738:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:996:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1135:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1692:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/ScriptExtensions.txt:38:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/yarr/YarrPattern.cpp:1026:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:72:  Alphabetical sorting problem.  [build/include_order] [4]
ERROR: Source/JavaScriptCore/ucd/Scripts.txt:18:  Line contains tab character.  [whitespace/tab] [5]
Total errors found: 8 in 32 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 13 Michael Saboff 2017-10-09 11:48:44 PDT
Created attachment 323194 [details]
Patch for EWS with updated build fixes
Comment 14 Build Bot 2017-10-09 11:52:16 PDT
Attachment 323194 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:738:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:996:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1135:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1692:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/ScriptExtensions.txt:38:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/yarr/YarrPattern.cpp:1026:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:72:  Alphabetical sorting problem.  [build/include_order] [4]
ERROR: Source/JavaScriptCore/ucd/Scripts.txt:18:  Line contains tab character.  [whitespace/tab] [5]
Total errors found: 8 in 32 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 15 Build Bot 2017-10-09 13:09:27 PDT
Comment on attachment 323194 [details]
Patch for EWS with updated build fixes

Attachment 323194 [details] did not pass mac-ews (mac):
Output: http://webkit-queues.webkit.org/results/4802520

New failing tests:
js/regexp-unicode-properties.html
Comment 16 Build Bot 2017-10-09 13:09:29 PDT
Created attachment 323202 [details]
Archive of layout-test-results from ews100 for mac-elcapitan

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews100  Port: mac-elcapitan  Platform: Mac OS X 10.11.6
Comment 17 Build Bot 2017-10-09 13:13:06 PDT
Comment on attachment 323194 [details]
Patch for EWS with updated build fixes

Attachment 323194 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.webkit.org/results/4802548

New failing tests:
js/regexp-unicode-properties.html
Comment 18 Build Bot 2017-10-09 13:13:08 PDT
Created attachment 323206 [details]
Archive of layout-test-results from ews105 for mac-elcapitan-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: ews105  Port: mac-elcapitan-wk2  Platform: Mac OS X 10.11.6
Comment 19 Michael Saboff 2017-10-09 13:19:47 PDT
Created attachment 323207 [details]
Patch for EWS with test fix

Converted UTF-8 strings in test file to Unicode escapes.
Comment 20 Build Bot 2017-10-09 13:22:18 PDT
Attachment 323207 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:738:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:996:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1135:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt:1692:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/ucd/ScriptExtensions.txt:38:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Source/JavaScriptCore/yarr/YarrPattern.cpp:1026:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:72:  Alphabetical sorting problem.  [build/include_order] [4]
ERROR: Source/JavaScriptCore/ucd/Scripts.txt:18:  Line contains tab character.  [whitespace/tab] [5]
Total errors found: 8 in 32 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 21 Michael Saboff 2017-10-09 16:14:48 PDT
Committed r223081: <http://trac.webkit.org/changeset/223081>
Comment 22 Ryan Haddad 2017-10-09 18:12:00 PDT
The following tests were marked as passing, but are failing on the Release test262 bot:

test262.yaml/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js.default-strict: ERROR: Unexpected exit code: 3
test262.yaml/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js.default: ERROR: Unexpected exit code: 3
test262.yaml/test262/test/built-ins/RegExp/named-groups/unicode-references.js.default-strict: ERROR: Unexpected exit code: 3
test262.yaml/test262/test/built-ins/RegExp/named-groups/unicode-references.js.default: ERROR: Unexpected exit code: 3

https://build.webkit.org/builders/Apple%20El%20Capitan%20Release%20Test262%20%28Tests%29/builds/4731
Comment 23 Michael Saboff 2017-10-11 11:41:11 PDT
(In reply to Ryan Haddad from comment #22)
> The following tests were marked as passing, but are failing on the Release
> test262 bot:
> 
> test262.yaml/test262/test/built-ins/RegExp/named-groups/unicode-property-
> names.js.default-strict: ERROR: Unexpected exit code: 3
> test262.yaml/test262/test/built-ins/RegExp/named-groups/unicode-property-
> names.js.default: ERROR: Unexpected exit code: 3
> test262.yaml/test262/test/built-ins/RegExp/named-groups/unicode-references.
> js.default-strict: ERROR: Unexpected exit code: 3
> test262.yaml/test262/test/built-ins/RegExp/named-groups/unicode-references.
> js.default: ERROR: Unexpected exit code: 3
> 
> https://build.webkit.org/builders/
> Apple%20El%20Capitan%20Release%20Test262%20%28Tests%29/builds/4731

Failing tests tracked in https://bugs.webkit.org/show_bug.cgi?id=178177