Bug 154842

Summary: [ES6] Add support for Unicode regular expressions
Product: WebKit Reporter: Michael Saboff <msaboff>
Component: JavaScriptCoreAssignee: Michael Saboff <msaboff>
Status: RESOLVED FIXED    
Severity: Normal CC: buildbot, commit-queue, mathias, rniwa, ryanhaddad
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 154863, 154875, 154898    
Bug Blocks: 151598    
Attachments:
Description Flags
Patch
buildbot: commit-queue-
Archive of layout-test-results from ews101 for mac-yosemite
none
Archive of layout-test-results from ews104 for mac-yosemite-wk2
none
Archive of layout-test-results from ews114 for mac-yosemite
none
Updated patch with suggested changes and fixed test
fpizlo: review+, buildbot: commit-queue-
Archive of layout-test-results from ews113 for mac-yosemite none

Description Michael Saboff 2016-02-29 17:00:27 PST
Add support for the ‘u’ flags and the processing of Unicode regular expressions as described in the EcmaScript 6 standard, section 21.2.  See https://tc39.github.io/ecma262/#sec-regexp-regular-expression-objects

<rdar://problem/24898527>
Comment 1 Michael Saboff 2016-02-29 18:36:43 PST
Created attachment 272536 [details]
Patch

This is a complete implementation in the Yarr interpreter.  I plan on adding a few more tests, however I believe that the new functionality is tested.  The test I'm thinking about adding would test code that wasn't changed, like counted, greedy and non-greedy back references.

This also passes the Chakra tests with slightly different results.  Those differences appear to be due to using different versions of the unicode standard.  On Mac OS X, we use 7.0, while the latest version is 8.0.
Comment 2 WebKit Commit Bot 2016-02-29 18:39:25 PST
Attachment 272536 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp:38:  Missing space inside { }.  [whitespace/braces] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:38:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:39:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:40:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:41:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:42:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:52:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:56:  UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:60:  enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums.  [readability/enum_casing] [4]
ERROR: Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1231:  One line control clauses should not use braces.  [whitespace/braces] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:51:  UCS2_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:70:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:527:  UNICODE_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:592:  UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
Total errors found: 14 in 33 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 Build Bot 2016-02-29 19:31:45 PST
Comment on attachment 272536 [details]
Patch

Attachment 272536 [details] did not pass mac-ews (mac):
Output: http://webkit-queues.webkit.org/results/903723

New failing tests:
js/regexp-unicode.html
Comment 4 Build Bot 2016-02-29 19:31:48 PST
Created attachment 272540 [details]
Archive of layout-test-results from ews101 for mac-yosemite

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews101  Port: mac-yosemite  Platform: Mac OS X 10.10.5
Comment 5 Build Bot 2016-02-29 19:41:43 PST
Comment on attachment 272536 [details]
Patch

Attachment 272536 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.webkit.org/results/903741

New failing tests:
js/regexp-unicode.html
Comment 6 Build Bot 2016-02-29 19:41:47 PST
Created attachment 272542 [details]
Archive of layout-test-results from ews104 for mac-yosemite-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: ews104  Port: mac-yosemite-wk2  Platform: Mac OS X 10.10.5
Comment 7 Build Bot 2016-02-29 19:42:42 PST
Comment on attachment 272536 [details]
Patch

Attachment 272536 [details] did not pass mac-debug-ews (mac):
Output: http://webkit-queues.webkit.org/results/903733

New failing tests:
js/regexp-unicode.html
Comment 8 Build Bot 2016-02-29 19:42:45 PST
Created attachment 272543 [details]
Archive of layout-test-results from ews114 for mac-yosemite

The attached test failures were seen while running run-webkit-tests on the mac-debug-ews.
Bot: ews114  Port: mac-yosemite  Platform: Mac OS X 10.10.5
Comment 9 Darin Adler 2016-03-01 08:57:04 PST
Comment on attachment 272536 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=272536&action=review

> Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:89
> -inline UChar getCanonicalPair(const UCS2CanonicalizationRange* info, UChar ch)
> +    inline UChar32 getCanonicalPair(const CanonicalizationRange* info, UChar32 ch)

Stray leading space got in there.

> Source/JavaScriptCore/yarr/YarrPattern.h:45
> +inline bool isFirstSurrogate(UChar c)
> +{
> +    return (c & 0xdc00) == 0xd800;
> +}
> +
> +inline bool isSecondSurrogate(UChar c)
> +{
> +    return (c & 0xdc00) == 0xdc00;
> +}

It’s not good to add these functions. In WebKit code we normally use the ICU library, and it has macros for this: U16_IS_LEAD and U16_IS_TRAIL. We use those elsewhere in the project. In fact, these functions are not correct; isFirstSurrogate will return true for some non-surrogate values such as 0xF800. We’d be better off using the ICU macro.

> Source/JavaScriptCore/yarr/YarrPattern.h:55
> +inline UChar32 toUnicode(UChar surrogate1, UChar surrogate2)
> +{
> +    ASSERT(isFirstSurrogate(surrogate1));
> +    ASSERT(isSecondSurrogate(surrogate2));
> +    UChar32 result = 0x10000 +((surrogate1 & 0x3ff) << 10) | (surrogate2 & 0x3ff);
> +    ASSERT(result <= UCHAR_MAX_VALUE);
> +
> +    return result;
> +}

Same issue here. ICU’s macro for this is U16_GET_SUPPLEMENTARY. I suggest having our code use the macro directly as we do in many places in WebCore.

> Source/JavaScriptCore/yarr/YarrPattern.h:60
> +inline bool isExtendedUnicode(UChar32 ch)
> +{
> +    return ch >= 0x10000;
> +}

The ICU macro for this is U_IS_BMP, but it’s got a reverse sense so we’d have to use !U_IS_BMP. There’s also U16_LENGTH which is good if you are trying to convert this into either a 1 or a 2 like in the QuantifierGreedy call site, and U_IS_SUPPLEMENTARY, which could also be used in some cases; it’s almost the same thing as !U_IS_BMP.
Comment 10 Michael Saboff 2016-03-01 10:44:29 PST
Found a bug in WebKitTestRunner / DumpRenderTree with the added tests.  Filed https://bugs.webkit.org/show_bug.cgi?id=154863.
Comment 11 Michael Saboff 2016-03-01 11:46:07 PST
Created attachment 272583 [details]
Updated patch with suggested changes and fixed test
Comment 12 WebKit Commit Bot 2016-03-01 11:48:34 PST
Attachment 272583 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp:38:  Missing space inside { }.  [whitespace/braces] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:38:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:39:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:40:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:41:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:42:  One space before end of line comments  [whitespace/comments] [5]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:52:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:56:  UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:60:  enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums.  [readability/enum_casing] [4]
ERROR: Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1231:  One line control clauses should not use braces.  [whitespace/braces] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:51:  UCS2_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:70:  UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:527:  UNICODE_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:592:  UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
Total errors found: 14 in 33 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 13 Filip Pizlo 2016-03-01 12:12:05 PST
Comment on attachment 272583 [details]
Updated patch with suggested changes and fixed test

Can has perf data?  I think it would be good to ensure that Octane, SunSpider, and Kraken are unaffected.  They all have some benchmarks that do regexp.  They don't do unicode regexp but this touches a lot of code so it's good to run some numbers.
Comment 14 Build Bot 2016-03-01 12:48:31 PST
Comment on attachment 272583 [details]
Updated patch with suggested changes and fixed test

Attachment 272583 [details] did not pass mac-debug-ews (mac):
Output: http://webkit-queues.webkit.org/results/907418

New failing tests:
js/regexp-unicode.html
Comment 15 Build Bot 2016-03-01 12:48:33 PST
Created attachment 272588 [details]
Archive of layout-test-results from ews113 for mac-yosemite

The attached test failures were seen while running run-webkit-tests on the mac-debug-ews.
Bot: ews113  Port: mac-yosemite  Platform: Mac OS X 10.10.5
Comment 16 Michael Saboff 2016-03-01 13:48:37 PST
(In reply to comment #13)
> Comment on attachment 272583 [details]
> Updated patch with suggested changes and fixed test
> 
> Can has perf data?  I think it would be good to ensure that Octane,
> SunSpider, and Kraken are unaffected.  They all have some benchmarks that do
> regexp.  They don't do unicode regexp but this touches a lot of code so it's
> good to run some numbers.

Here are the perf results.  Basically neutral as expected.

Benchmark report for SunSpider, V8Spider, Octane, Kraken, and AsmBench on MSaboff-Mac-Pro (MacPro6,1).

VMs tested:
"Baseline" at /Volumes/Data/src/webkit.work/WebKitBuild/Release/jsc (r197407)
"UnicodeRegExp" at /Volumes/Data/src/webkit/WebKitBuild/Release/jsc (r197407)

Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                                                 Baseline               UnicodeRegExp                                   
SunSpider:
   3d-cube                                    6.4724+-0.8792            6.3614+-0.0876          might be 1.0174x faster
   3d-morph                                   6.3130+-0.1148     ?      6.5566+-0.4467        ? might be 1.0386x slower
   3d-raytrace                                7.7560+-0.5404     ?      7.8207+-0.3313        ?
   access-binary-trees                        2.6569+-0.0798     ?      2.7383+-0.1564        ? might be 1.0306x slower
   access-fannkuch                            7.6680+-0.2533            7.6600+-0.0965        
   access-nbody                               3.5101+-0.1309            3.4995+-0.1809        
   access-nsieve                              3.8644+-0.2277            3.7520+-0.0856          might be 1.0299x faster
   bitops-3bit-bits-in-byte                   1.4796+-0.0536     ?      1.5477+-0.1197        ? might be 1.0460x slower
   bitops-bits-in-byte                        4.1686+-0.2108            4.0218+-0.2616          might be 1.0365x faster
   bitops-bitwise-and                         2.4490+-0.1286     ?      2.4545+-0.0324        ?
   bitops-nsieve-bits                         3.5945+-0.0594     ?      3.6099+-0.1084        ?
   controlflow-recursive                      2.9412+-0.0595     ?      2.9825+-0.1262        ? might be 1.0140x slower
   crypto-aes                                 5.3172+-0.1918            5.2432+-0.1258          might be 1.0141x faster
   crypto-md5                                 3.2131+-0.0773     ?      3.2817+-0.1208        ? might be 1.0214x slower
   crypto-sha1                                3.0765+-0.1001     ?      3.0995+-0.2107        ?
   date-format-tofte                         11.6428+-0.2000     ^     11.2335+-0.1568        ^ definitely 1.0364x faster
   date-format-xparb                          6.2512+-0.5117            5.9100+-0.2037          might be 1.0577x faster
   math-cordic                                3.7687+-0.3324            3.7471+-0.2365        
   math-partial-sums                          6.4432+-0.4497            6.2180+-0.0607          might be 1.0362x faster
   math-spectral-norm                         2.5435+-0.0952            2.4893+-0.0384          might be 1.0217x faster
   regexp-dna                                 7.5165+-0.2682     ?      7.6465+-0.3764        ? might be 1.0173x slower
   string-base64                              5.2545+-0.1909            5.2020+-0.1971          might be 1.0101x faster
   string-fasta                               7.2119+-0.3853            7.1312+-0.0509          might be 1.0113x faster
   string-tagcloud                           10.2274+-0.2720           10.0006+-0.2291          might be 1.0227x faster
   string-unpack-code                        22.2756+-1.0275           21.7809+-0.0654          might be 1.0227x faster
   string-validate-input                      5.0912+-0.3499     ?      5.1015+-0.2152        ?

   <arithmetic>                               5.8733+-0.0489            5.8112+-0.0380          might be 1.0107x faster

                                                 Baseline               UnicodeRegExp                                   
V8Spider:
   crypto                                    48.3868+-1.4824     ?     48.4963+-1.1098        ?
   deltablue                                 63.8084+-3.2429     ?     66.4055+-3.8843        ? might be 1.0407x slower
   earley-boyer                              50.0785+-2.0052           49.6696+-0.4981        
   raytrace                                  33.5266+-0.7163     ?     34.1307+-1.2938        ? might be 1.0180x slower
   regexp                                    79.0688+-0.4542     ?     79.6226+-1.0321        ?
   richards                                  52.8094+-0.6355     ?     53.4932+-0.2985        ? might be 1.0129x slower
   splay                                     37.4368+-1.1894           37.1642+-0.7933        

   <geometric>                               50.2547+-0.2066     ?     50.7207+-0.6958        ? might be 1.0093x slower

                                                 Baseline               UnicodeRegExp                                   
Octane:
   encrypt                                   0.19533+-0.00095    ?     0.19542+-0.00136       ?
   decrypt                                   3.43146+-0.02663    ?     3.43751+-0.01563       ?
   deltablue                        x2       0.16784+-0.00298    ?     0.16886+-0.00146       ?
   earley                                    0.36664+-0.00535          0.36503+-0.00355       
   boyer                                     5.73080+-0.04207    ?     5.75159+-0.03340       ?
   navier-stokes                    x2       5.46605+-0.02238    ?     5.47148+-0.02738       ?
   raytrace                         x2       1.11384+-0.01471    ?     1.11408+-0.01315       ?
   richards                         x2       0.10730+-0.00328    ?     0.10787+-0.00409       ?
   splay                            x2       0.39980+-0.00460    ?     0.40033+-0.00614       ?
   regexp                           x2      28.09828+-0.34750    ?    28.20451+-0.23852       ?
   pdfjs                            x2      48.37957+-1.13556    ?    48.50616+-0.50640       ?
   mandreel                         x2      53.37880+-0.57019         52.98871+-0.15902       
   gbemu                            x2      35.10795+-1.13756    ?    35.69578+-1.16941       ? might be 1.0167x slower
   closure                                   0.78557+-0.00891    ?     0.79597+-0.00862       ? might be 1.0132x slower
   jquery                                   10.28692+-0.40264         10.26504+-0.08987       
   box2d                            x2      12.58036+-0.23046    ?    12.62000+-0.10726       ?
   zlib                             x2     443.81559+-3.52429        440.90833+-2.70816       
   typescript                       x2     926.65057+-7.86089    ?   931.62952+-10.48346      ?

   <geometric>                               6.59642+-0.06192    ?     6.61303+-0.03362       ? might be 1.0025x slower

                                                 Baseline               UnicodeRegExp                                   
Kraken:
   ai-astar                                  119.603+-6.271      ?     119.866+-1.607         ?
   audio-beat-detection                       62.606+-1.965             60.947+-7.181           might be 1.0272x faster
   audio-dft                                 115.838+-8.254            112.920+-0.812           might be 1.0258x faster
   audio-fft                                  42.684+-0.546      ?      42.704+-0.464         ?
   audio-oscillator                           58.573+-0.372      !      59.853+-0.516         ! definitely 1.0219x slower
   imaging-darkroom                           73.757+-1.174      ?      73.980+-1.306         ?
   imaging-desaturate                         64.638+-1.382             64.232+-0.284         
   imaging-gaussian-blur                      95.134+-1.304             94.534+-0.352         
   json-parse-financial                       48.030+-0.356      ?      48.767+-2.001         ? might be 1.0153x slower
   json-stringify-tinderbox                   29.664+-0.210             29.356+-0.357           might be 1.0105x faster
   stanford-crypto-aes                        48.261+-2.271             46.923+-1.041           might be 1.0285x faster
   stanford-crypto-ccm                        47.622+-4.466      ?      47.958+-2.021         ?
   stanford-crypto-pbkdf2                    119.014+-1.475            116.867+-1.319           might be 1.0184x faster
   stanford-crypto-sha256-iterative           47.277+-0.969             46.415+-0.456           might be 1.0186x faster

   <arithmetic>                               69.479+-0.709             68.952+-0.539           might be 1.0076x faster

                                                 Baseline               UnicodeRegExp                                   
AsmBench:
   bigfib.cpp                               544.2454+-3.9047     ?    546.1682+-4.1859        ?
   cray.c                                   417.8620+-8.8990     ?    420.9171+-6.8331        ?
   dry.c                                    546.8600+-56.2474         536.1193+-40.9941         might be 1.0200x faster
   FloatMM.c                                796.4737+-12.3896         791.5008+-11.2293       
   gcc-loops.cpp                           4640.3712+-31.5267        4598.9880+-35.2771       
   n-body.c                                 981.2210+-5.6146     ?    990.9190+-18.4641       ?
   Quicksort.c                              452.5415+-13.1799         441.5269+-5.6493          might be 1.0249x faster
   stepanov_container.cpp                  3659.0862+-16.2502    ?   3660.7055+-31.3770       ?
   Towers.c                                 301.6083+-2.7183     ?    301.8635+-4.1961        ?

   <geometric>                              847.2058+-6.8675          843.7315+-6.0609          might be 1.0041x faster

                                                 Baseline               UnicodeRegExp                                   
Geomean of preferred means:
   <scaled-result>                           40.9108+-0.0954           40.8240+-0.1612          might be 1.0021x faster
Comment 17 Michael Saboff 2016-03-01 13:50:40 PST
(In reply to comment #14)
> Comment on attachment 272583 [details]
> Updated patch with suggested changes and fixed test
> 
> Attachment 272583 [details] did not pass mac-debug-ews (mac):
> Output: http://webkit-queues.webkit.org/results/907418
> 
> New failing tests:
> js/regexp-unicode.html

This test failure is actually due to an ASSERT that is being hit.  I filed <https://bugs.webkit.org/show_bug.cgi?id=154875> - "ASSERT in platform/graphics/mac/ComplexTextController.cpp::capitalize()".
Comment 18 Michael Saboff 2016-03-01 16:38:51 PST
Committed r197426: <http://trac.webkit.org/changeset/197426>
Comment 19 Ryan Haddad 2016-03-01 17:54:38 PST
This change may have caused an API test failure:

FAIL ContentExtensionTest.ParsingFailures

/Volumes/Data/slave/elcapitan-release/build/Tools/TestWebKitAPI/Tests/WebCore/ContentExtensions.cpp:1531
Value of: parser.addPattern(pattern, false, 0)
  Actual: 4
Expected: status
Which is: 0

<https://build.webkit.org/builders/Apple%20El%20Capitan%20Release%20WK1%20%28Tests%29/builds/3939>
Comment 20 Michael Saboff 2016-03-01 18:05:09 PST
(In reply to comment #19)
> This change may have caused an API test failure:
> 
> FAIL ContentExtensionTest.ParsingFailures
> 
> /Volumes/Data/slave/elcapitan-release/build/Tools/TestWebKitAPI/Tests/
> WebCore/ContentExtensions.cpp:1531
> Value of: parser.addPattern(pattern, false, 0)
>   Actual: 4
> Expected: status
> Which is: 0
> 
> <https://build.webkit.org/builders/
> Apple%20El%20Capitan%20Release%20WK1%20%28Tests%29/builds/3939>

I have a fix in the works.
Comment 21 Michael Saboff 2016-03-01 18:34:48 PST
(In reply to comment #20)
> (In reply to comment #19)
> > This change may have caused an API test failure:
> > 
> > FAIL ContentExtensionTest.ParsingFailures
> > 
> > /Volumes/Data/slave/elcapitan-release/build/Tools/TestWebKitAPI/Tests/
> > WebCore/ContentExtensions.cpp:1531
> > Value of: parser.addPattern(pattern, false, 0)
> >   Actual: 4
> > Expected: status
> > Which is: 0
> > 
> > <https://build.webkit.org/builders/
> > Apple%20El%20Capitan%20Release%20WK1%20%28Tests%29/builds/3939>
> 
> I have a fix in the works.

Tracked in <https://bugs.webkit.org/show_bug.cgi?id=154898>.

Fix working its way through commit queue.
Comment 22 Darin Adler 2016-03-02 09:34:50 PST
Comment on attachment 272583 [details]
Updated patch with suggested changes and fixed test

View in context: https://bugs.webkit.org/attachment.cgi?id=272583&action=review

Missed a few tiny improvements on my first review. I won’t insist on these, but I think they are slightly better:

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:212
> +            if (U16_IS_LEAD(result) && decodeSurrogatePairs && p + 1 < length
> +                && U16_IS_TRAIL(input[p + 1])) {

I think this would read better all on one line.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:214
> +                if (atEnd())
> +                    return -1;

How can this ever be true? Do we have a test case that covers this?

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:222
> +        int readSurrogatePairChecked(unsigned negativePositionOffest)

Typo here: "offest" instead of offset.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:232
> +            int first = input[p];
> +            if (U16_IS_LEAD(first) && U16_IS_TRAIL(input[p + 1]))
> +                return U16_GET_SUPPLEMENTARY(first, input[p + 1]);

Unclear why we put "first" into a local variable, but not input[p + 1]. Code would read better if we were consistent.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:246
> +            int result = input[from];
> +            if (U16_IS_LEAD(result) && decodeSurrogatePairs && from + 1 < length
> +                && U16_IS_TRAIL(input[from + 1])) {
> +                
> +                result = U16_GET_SUPPLEMENTARY(result, input[from + 1]);
> +            }
> +            return result;

If it wasn’t for the decodeSurrogatePairs boolean, we could write this in a cleaner way using an ICU macro:

    UChar32 character;
    U16_GET(input, 0, from, length, character);
    return character;

Or we could write this:

    UChar32 character;
    if (decodeSurrogatePairs)
        U16_GET(input, 0, from, length, character);
    else
        character = input[from];
    return character;

Not 100% sure this is better, but I think it is.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:322
> +        if (ch & 0x1FFF80) {

Missed this in my previous review. I would suggest instead writing this:

    if (!isASCII(ch)) {

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:439
> +                if (unicode && !U_IS_BMP(term.atom.patternCharacter))
> +                    input.uncheckInput(2);
> +                else
> +                    input.uncheckInput(1);

I think this should just be:

    input.uncheckInput(U16_LENGTH(term.atom.patternCharacter));

I do not think that checking the unicode boolean is necessary or helpful. I am pretty sure that there could never be a patternCharacter >0xFFFF if unicode was false. It’s nice to not have to think about whether the check of U_IS_BMP has the right sense and U16_LENGTH does that for us.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1270
> +                // Case insensitive matching of unicode charaters are handled as TypeCharacterClass

Typo: charaters

Small grammatical error: "matching is handled" is correct; "matching are handled" is not.

Would be nice to use a period here as in the typical WebKit comment style.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1293
> +            // Case insensitive matching of unicode charaters are handled as TypeCharacterClass

Ditto.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1311
> +            // Case insensitive matching of unicode charaters are handled as TypeCharacterClass

Ditto.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1622
> -            ASSERT(u_tolower(ch) <= 0xFFFF);
> -            ASSERT(u_toupper(ch) <= 0xFFFF);
> +            ASSERT(u_tolower(ch) <= UCHAR_MAX_VALUE);
> +            ASSERT(u_toupper(ch) <= UCHAR_MAX_VALUE);

I don’t think these assertions are valuable any more. The old assertions were asserting something incorrect, but they were valuable in the way they stated an (incorrect) assumption that the code below was making. The new assertions are asserting that ICU isn’t totally broken; it’s a little strange to assert that since I think there’s no super-direct connection with anything the code is assuming. We don’t assert that the passed-in character is <= UCHAR_MAX_VALUE, for example.

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1625
> +            UChar32 lo = u_tolower(ch);
> +            UChar32 hi = u_toupper(ch);

I suspect there are some cases of case folding that this won’t get exactly right (same problem with the old code, of course); I ran into this recently when doing some work on upper/lowercasing in the String class, where comparing against both uppercased and lowercased versions of a character are not the same as the correct case folding. I’ll try to research what test cases we’ll need to add to demonstrate that issue.

> Source/JavaScriptCore/yarr/YarrParser.h:427
> +                    if (!WTF::isASCIIHexDigit(peek()))

No need for WTF:: here.

> Source/JavaScriptCore/yarr/YarrParser.h:430
> +                    codePoint = (codePoint << 4) | WTF::toASCIIHexValue(consume());

No need for WTF:: here.

> Source/JavaScriptCore/yarr/YarrParser.h:432
> +                    if (codePoint > UCHAR_MAX_VALUE)

I’m interested in knowing whether all we want to do is check the maximum value, or if we need the more complete check for a valid Unicode character, U_IS_UNICODE_CHAR. We should make test cases to help us understand that. Examples of values that U_IS_UNICODE_CHAR returns no for:

    D800-DFFF
    FDD0-FDEF
    FFFE-FFFF

We should add test cases for those, and make sure that either they are intentionally allowed, or intentionally not allowed. Right now we are allowing them and I am not are why.

On reflection, thought, If we do want to check U_IS_UNICODE_CHAR, we’ll probably have to do it outside the loop, and keep the UCHAR_MAX_VALUE check inside the loop.

> Source/JavaScriptCore/yarr/YarrParser.h:463
>                  delegate.atomPatternCharacter(u);

Here we allow unpaired surrogates in the pattern. Do we want to allow them? Probably boils down to the same question: Do we need a U_IS_UNICODE_CHAR check here? Need test cases that try to include unpaired surrogates and characters like FFFF and FFFE.

> Source/JavaScriptCore/yarr/YarrPattern.cpp:72
>          if (ch <= 0x7f) {

I suggest using isASCII here.

> Source/JavaScriptCore/yarr/YarrPattern.cpp:111
>          if (lo <= 0x7f) {

I suggest using isASCII here.

> Source/JavaScriptCore/yarr/YarrPattern.cpp:123
>          if (hi <= 0x7f)

I suggest using isASCII here.

> Source/JavaScriptCore/yarr/YarrPattern.cpp:193
> +        addSorted(ch <= 0x7f ? m_matches : m_matchesUnicode, ch);

Again here I would suggest using isASCII:

    addSorted(isASCII(ch) ? m_matches : m_matchesUnicode, ch);

> Source/JavaScriptCore/yarr/YarrPattern.cpp:606
> +                    currentInputPosition += (!U_IS_BMP(term.patternCharacter) ? 2 : 1) * term.quantityCount;

There’s a better ICU macro for this:

    currentInputPosition += U16_LENGTH(term.patternCharacter) * term.quantityCount;

Using that macro makes it clearer we didn’t do the ternary backwards by accident, for example.
Comment 23 Michael Saboff 2016-03-02 17:08:44 PST
(In reply to comment #22)
> Comment on attachment 272583 [details]
> Updated patch with suggested changes and fixed test

Thanks for the suggestions.  Most of these I will take care of if a follow on patch to increase testing of corner cases and tighter parsing mandated by the unicode flag.

Below are my replies to the comments for items that I don't think should change or are more involved.

> View in context:
> https://bugs.webkit.org/attachment.cgi?id=272583&action=review
> 
> Missed a few tiny improvements on my first review. I won’t insist on these,
> but I think they are slightly better:
> 
> > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:214
> > +                if (atEnd())
> > +                    return -1;
> 
> How can this ever be true? Do we have a test case that covers this?

This is a subtle case.  The input is "checked" in that we know the minimum characters that are required to match.  This "if atEnd() then return -1" is a real case and there are many tests for it.  Here is one:
    shouldBeFalse('/(?:a|\u{10123}|b)x/u.test("\u{10123}")');

Here is an overview of why this is needed.  Consider the example regexp that involves multiple alternative where a one alternative is longer than the others.  The one alternative will check a longer input, thus advancing "pos" farther than the other alternatives.  We can read a surrogate pair and try to advance past the end of the input string because of the longer alternative.  With this check that doesn't happen, we backtrack and start with the next alternative.  For the test above, the checked value is 2 as well as the subject string length.  When we go to match the second alternative and read out unicode character, we would call "next()" without this check and crash.  Instead we fail that alternative and go on to the next one.  In this test we never match, but other tests we match via other alternatives.
 
> > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:246
> > +            int result = input[from];
> > +            if (U16_IS_LEAD(result) && decodeSurrogatePairs && from + 1 < length
> > +                && U16_IS_TRAIL(input[from + 1])) {
> > +                
> > +                result = U16_GET_SUPPLEMENTARY(result, input[from + 1]);
> > +            }
> > +            return result;
> 
> If it wasn’t for the decodeSurrogatePairs boolean, we could write this in a
> cleaner way using an ICU macro:
> 
>     UChar32 character;
>     U16_GET(input, 0, from, length, character);
>     return character;
> 
> Or we could write this:
> 
>     UChar32 character;
>     if (decodeSurrogatePairs)
>         U16_GET(input, 0, from, length, character);
>     else
>         character = input[from];
>     return character;
> 
> Not 100% sure this is better, but I think it is.

I'm mulling this change.
 
...
> > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1625
> > +            UChar32 lo = u_tolower(ch);
> > +            UChar32 hi = u_toupper(ch);
> 
> I suspect there are some cases of case folding that this won’t get exactly
> right (same problem with the old code, of course); I ran into this recently
> when doing some work on upper/lowercasing in the String class, where
> comparing against both uppercased and lowercased versions of a character are
> not the same as the correct case folding. I’ll try to research what test
> cases we’ll need to add to demonstrate that issue.

This code probably needs some work.
 
> > Source/JavaScriptCore/yarr/YarrParser.h:432
> > +                    if (codePoint > UCHAR_MAX_VALUE)
> 
> I’m interested in knowing whether all we want to do is check the maximum
> value, or if we need the more complete check for a valid Unicode character,
> U_IS_UNICODE_CHAR. We should make test cases to help us understand that.
> Examples of values that U_IS_UNICODE_CHAR returns no for:
> 
>     D800-DFFF
>     FDD0-FDEF
>     FFFE-FFFF
> 
> We should add test cases for those, and make sure that either they are
> intentionally allowed, or intentionally not allowed. Right now we are
> allowing them and I am not are why.
> 
> On reflection, thought, If we do want to check U_IS_UNICODE_CHAR, we’ll
> probably have to do it outside the loop, and keep the UCHAR_MAX_VALUE check
> inside the loop.

Here we only need to check the maximum value.  The ES6 standard requires that we match dangling surrogates as well as other hex patterns that don't correspond to Unicode code points.

> > Source/JavaScriptCore/yarr/YarrParser.h:463
> >                  delegate.atomPatternCharacter(u);
> 
> Here we allow unpaired surrogates in the pattern. Do we want to allow them?
> Probably boils down to the same question: Do we need a U_IS_UNICODE_CHAR
> check here? Need test cases that try to include unpaired surrogates and
> characters like FFFF and FFFE.

Yes, the standard requires it.
Comment 24 Mathias Bynens 2016-03-30 12:54:34 PDT
Looks like this was a duplicate of https://bugs.webkit.org/show_bug.cgi?id=151597 which is now about solving the bugs in the current implementation.