Add support for the ‘u’ flags and the processing of Unicode regular expressions as described in the EcmaScript 6 standard, section 21.2. See https://tc39.github.io/ecma262/#sec-regexp-regular-expression-objects <rdar://problem/24898527>
Created attachment 272536 [details] Patch This is a complete implementation in the Yarr interpreter. I plan on adding a few more tests, however I believe that the new functionality is tested. The test I'm thinking about adding would test code that wasn't changed, like counted, greedy and non-greedy back references. This also passes the Chakra tests with slightly different results. Those differences appear to be due to using different versions of the unicode standard. On Mac OS X, we use 7.0, while the latest version is 8.0.
Attachment 272536 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp:38: Missing space inside { }. [whitespace/braces] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:38: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:39: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:40: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:41: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:42: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:52: UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:56: UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:60: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] ERROR: Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1231: One line control clauses should not use braces. [whitespace/braces] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:51: UCS2_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:70: UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:527: UNICODE_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:592: UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] Total errors found: 14 in 33 files If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 272536 [details] Patch Attachment 272536 [details] did not pass mac-ews (mac): Output: http://webkit-queues.webkit.org/results/903723 New failing tests: js/regexp-unicode.html
Created attachment 272540 [details] Archive of layout-test-results from ews101 for mac-yosemite The attached test failures were seen while running run-webkit-tests on the mac-ews. Bot: ews101 Port: mac-yosemite Platform: Mac OS X 10.10.5
Comment on attachment 272536 [details] Patch Attachment 272536 [details] did not pass mac-wk2-ews (mac-wk2): Output: http://webkit-queues.webkit.org/results/903741 New failing tests: js/regexp-unicode.html
Created attachment 272542 [details] Archive of layout-test-results from ews104 for mac-yosemite-wk2 The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews. Bot: ews104 Port: mac-yosemite-wk2 Platform: Mac OS X 10.10.5
Comment on attachment 272536 [details] Patch Attachment 272536 [details] did not pass mac-debug-ews (mac): Output: http://webkit-queues.webkit.org/results/903733 New failing tests: js/regexp-unicode.html
Created attachment 272543 [details] Archive of layout-test-results from ews114 for mac-yosemite The attached test failures were seen while running run-webkit-tests on the mac-debug-ews. Bot: ews114 Port: mac-yosemite Platform: Mac OS X 10.10.5
Comment on attachment 272536 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=272536&action=review > Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:89 > -inline UChar getCanonicalPair(const UCS2CanonicalizationRange* info, UChar ch) > + inline UChar32 getCanonicalPair(const CanonicalizationRange* info, UChar32 ch) Stray leading space got in there. > Source/JavaScriptCore/yarr/YarrPattern.h:45 > +inline bool isFirstSurrogate(UChar c) > +{ > + return (c & 0xdc00) == 0xd800; > +} > + > +inline bool isSecondSurrogate(UChar c) > +{ > + return (c & 0xdc00) == 0xdc00; > +} It’s not good to add these functions. In WebKit code we normally use the ICU library, and it has macros for this: U16_IS_LEAD and U16_IS_TRAIL. We use those elsewhere in the project. In fact, these functions are not correct; isFirstSurrogate will return true for some non-surrogate values such as 0xF800. We’d be better off using the ICU macro. > Source/JavaScriptCore/yarr/YarrPattern.h:55 > +inline UChar32 toUnicode(UChar surrogate1, UChar surrogate2) > +{ > + ASSERT(isFirstSurrogate(surrogate1)); > + ASSERT(isSecondSurrogate(surrogate2)); > + UChar32 result = 0x10000 +((surrogate1 & 0x3ff) << 10) | (surrogate2 & 0x3ff); > + ASSERT(result <= UCHAR_MAX_VALUE); > + > + return result; > +} Same issue here. ICU’s macro for this is U16_GET_SUPPLEMENTARY. I suggest having our code use the macro directly as we do in many places in WebCore. > Source/JavaScriptCore/yarr/YarrPattern.h:60 > +inline bool isExtendedUnicode(UChar32 ch) > +{ > + return ch >= 0x10000; > +} The ICU macro for this is U_IS_BMP, but it’s got a reverse sense so we’d have to use !U_IS_BMP. There’s also U16_LENGTH which is good if you are trying to convert this into either a 1 or a 2 like in the QuantifierGreedy call site, and U_IS_SUPPLEMENTARY, which could also be used in some cases; it’s almost the same thing as !U_IS_BMP.
Found a bug in WebKitTestRunner / DumpRenderTree with the added tests. Filed https://bugs.webkit.org/show_bug.cgi?id=154863.
Created attachment 272583 [details] Updated patch with suggested changes and fixed test
Attachment 272583 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp:38: Missing space inside { }. [whitespace/braces] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:38: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:39: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:40: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:41: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:42: One space before end of line comments [whitespace/comments] [5] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:52: UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:56: UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h:60: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] ERROR: Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1231: One line control clauses should not use braces. [whitespace/braces] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:51: UCS2_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:70: UCS2_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:527: UNICODE_CANONICALIZATION_SETS is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] ERROR: Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp:592: UNICODE_CANONICALIZATION_RANGES is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] Total errors found: 14 in 33 files If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 272583 [details] Updated patch with suggested changes and fixed test Can has perf data? I think it would be good to ensure that Octane, SunSpider, and Kraken are unaffected. They all have some benchmarks that do regexp. They don't do unicode regexp but this touches a lot of code so it's good to run some numbers.
Comment on attachment 272583 [details] Updated patch with suggested changes and fixed test Attachment 272583 [details] did not pass mac-debug-ews (mac): Output: http://webkit-queues.webkit.org/results/907418 New failing tests: js/regexp-unicode.html
Created attachment 272588 [details] Archive of layout-test-results from ews113 for mac-yosemite The attached test failures were seen while running run-webkit-tests on the mac-debug-ews. Bot: ews113 Port: mac-yosemite Platform: Mac OS X 10.10.5
(In reply to comment #13) > Comment on attachment 272583 [details] > Updated patch with suggested changes and fixed test > > Can has perf data? I think it would be good to ensure that Octane, > SunSpider, and Kraken are unaffected. They all have some benchmarks that do > regexp. They don't do unicode regexp but this touches a lot of code so it's > good to run some numbers. Here are the perf results. Basically neutral as expected. Benchmark report for SunSpider, V8Spider, Octane, Kraken, and AsmBench on MSaboff-Mac-Pro (MacPro6,1). VMs tested: "Baseline" at /Volumes/Data/src/webkit.work/WebKitBuild/Release/jsc (r197407) "UnicodeRegExp" at /Volumes/Data/src/webkit/WebKitBuild/Release/jsc (r197407) Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. Baseline UnicodeRegExp SunSpider: 3d-cube 6.4724+-0.8792 6.3614+-0.0876 might be 1.0174x faster 3d-morph 6.3130+-0.1148 ? 6.5566+-0.4467 ? might be 1.0386x slower 3d-raytrace 7.7560+-0.5404 ? 7.8207+-0.3313 ? access-binary-trees 2.6569+-0.0798 ? 2.7383+-0.1564 ? might be 1.0306x slower access-fannkuch 7.6680+-0.2533 7.6600+-0.0965 access-nbody 3.5101+-0.1309 3.4995+-0.1809 access-nsieve 3.8644+-0.2277 3.7520+-0.0856 might be 1.0299x faster bitops-3bit-bits-in-byte 1.4796+-0.0536 ? 1.5477+-0.1197 ? might be 1.0460x slower bitops-bits-in-byte 4.1686+-0.2108 4.0218+-0.2616 might be 1.0365x faster bitops-bitwise-and 2.4490+-0.1286 ? 2.4545+-0.0324 ? bitops-nsieve-bits 3.5945+-0.0594 ? 3.6099+-0.1084 ? controlflow-recursive 2.9412+-0.0595 ? 2.9825+-0.1262 ? might be 1.0140x slower crypto-aes 5.3172+-0.1918 5.2432+-0.1258 might be 1.0141x faster crypto-md5 3.2131+-0.0773 ? 3.2817+-0.1208 ? might be 1.0214x slower crypto-sha1 3.0765+-0.1001 ? 3.0995+-0.2107 ? date-format-tofte 11.6428+-0.2000 ^ 11.2335+-0.1568 ^ definitely 1.0364x faster date-format-xparb 6.2512+-0.5117 5.9100+-0.2037 might be 1.0577x faster math-cordic 3.7687+-0.3324 3.7471+-0.2365 math-partial-sums 6.4432+-0.4497 6.2180+-0.0607 might be 1.0362x faster math-spectral-norm 2.5435+-0.0952 2.4893+-0.0384 might be 1.0217x faster regexp-dna 7.5165+-0.2682 ? 7.6465+-0.3764 ? might be 1.0173x slower string-base64 5.2545+-0.1909 5.2020+-0.1971 might be 1.0101x faster string-fasta 7.2119+-0.3853 7.1312+-0.0509 might be 1.0113x faster string-tagcloud 10.2274+-0.2720 10.0006+-0.2291 might be 1.0227x faster string-unpack-code 22.2756+-1.0275 21.7809+-0.0654 might be 1.0227x faster string-validate-input 5.0912+-0.3499 ? 5.1015+-0.2152 ? <arithmetic> 5.8733+-0.0489 5.8112+-0.0380 might be 1.0107x faster Baseline UnicodeRegExp V8Spider: crypto 48.3868+-1.4824 ? 48.4963+-1.1098 ? deltablue 63.8084+-3.2429 ? 66.4055+-3.8843 ? might be 1.0407x slower earley-boyer 50.0785+-2.0052 49.6696+-0.4981 raytrace 33.5266+-0.7163 ? 34.1307+-1.2938 ? might be 1.0180x slower regexp 79.0688+-0.4542 ? 79.6226+-1.0321 ? richards 52.8094+-0.6355 ? 53.4932+-0.2985 ? might be 1.0129x slower splay 37.4368+-1.1894 37.1642+-0.7933 <geometric> 50.2547+-0.2066 ? 50.7207+-0.6958 ? might be 1.0093x slower Baseline UnicodeRegExp Octane: encrypt 0.19533+-0.00095 ? 0.19542+-0.00136 ? decrypt 3.43146+-0.02663 ? 3.43751+-0.01563 ? deltablue x2 0.16784+-0.00298 ? 0.16886+-0.00146 ? earley 0.36664+-0.00535 0.36503+-0.00355 boyer 5.73080+-0.04207 ? 5.75159+-0.03340 ? navier-stokes x2 5.46605+-0.02238 ? 5.47148+-0.02738 ? raytrace x2 1.11384+-0.01471 ? 1.11408+-0.01315 ? richards x2 0.10730+-0.00328 ? 0.10787+-0.00409 ? splay x2 0.39980+-0.00460 ? 0.40033+-0.00614 ? regexp x2 28.09828+-0.34750 ? 28.20451+-0.23852 ? pdfjs x2 48.37957+-1.13556 ? 48.50616+-0.50640 ? mandreel x2 53.37880+-0.57019 52.98871+-0.15902 gbemu x2 35.10795+-1.13756 ? 35.69578+-1.16941 ? might be 1.0167x slower closure 0.78557+-0.00891 ? 0.79597+-0.00862 ? might be 1.0132x slower jquery 10.28692+-0.40264 10.26504+-0.08987 box2d x2 12.58036+-0.23046 ? 12.62000+-0.10726 ? zlib x2 443.81559+-3.52429 440.90833+-2.70816 typescript x2 926.65057+-7.86089 ? 931.62952+-10.48346 ? <geometric> 6.59642+-0.06192 ? 6.61303+-0.03362 ? might be 1.0025x slower Baseline UnicodeRegExp Kraken: ai-astar 119.603+-6.271 ? 119.866+-1.607 ? audio-beat-detection 62.606+-1.965 60.947+-7.181 might be 1.0272x faster audio-dft 115.838+-8.254 112.920+-0.812 might be 1.0258x faster audio-fft 42.684+-0.546 ? 42.704+-0.464 ? audio-oscillator 58.573+-0.372 ! 59.853+-0.516 ! definitely 1.0219x slower imaging-darkroom 73.757+-1.174 ? 73.980+-1.306 ? imaging-desaturate 64.638+-1.382 64.232+-0.284 imaging-gaussian-blur 95.134+-1.304 94.534+-0.352 json-parse-financial 48.030+-0.356 ? 48.767+-2.001 ? might be 1.0153x slower json-stringify-tinderbox 29.664+-0.210 29.356+-0.357 might be 1.0105x faster stanford-crypto-aes 48.261+-2.271 46.923+-1.041 might be 1.0285x faster stanford-crypto-ccm 47.622+-4.466 ? 47.958+-2.021 ? stanford-crypto-pbkdf2 119.014+-1.475 116.867+-1.319 might be 1.0184x faster stanford-crypto-sha256-iterative 47.277+-0.969 46.415+-0.456 might be 1.0186x faster <arithmetic> 69.479+-0.709 68.952+-0.539 might be 1.0076x faster Baseline UnicodeRegExp AsmBench: bigfib.cpp 544.2454+-3.9047 ? 546.1682+-4.1859 ? cray.c 417.8620+-8.8990 ? 420.9171+-6.8331 ? dry.c 546.8600+-56.2474 536.1193+-40.9941 might be 1.0200x faster FloatMM.c 796.4737+-12.3896 791.5008+-11.2293 gcc-loops.cpp 4640.3712+-31.5267 4598.9880+-35.2771 n-body.c 981.2210+-5.6146 ? 990.9190+-18.4641 ? Quicksort.c 452.5415+-13.1799 441.5269+-5.6493 might be 1.0249x faster stepanov_container.cpp 3659.0862+-16.2502 ? 3660.7055+-31.3770 ? Towers.c 301.6083+-2.7183 ? 301.8635+-4.1961 ? <geometric> 847.2058+-6.8675 843.7315+-6.0609 might be 1.0041x faster Baseline UnicodeRegExp Geomean of preferred means: <scaled-result> 40.9108+-0.0954 40.8240+-0.1612 might be 1.0021x faster
(In reply to comment #14) > Comment on attachment 272583 [details] > Updated patch with suggested changes and fixed test > > Attachment 272583 [details] did not pass mac-debug-ews (mac): > Output: http://webkit-queues.webkit.org/results/907418 > > New failing tests: > js/regexp-unicode.html This test failure is actually due to an ASSERT that is being hit. I filed <https://bugs.webkit.org/show_bug.cgi?id=154875> - "ASSERT in platform/graphics/mac/ComplexTextController.cpp::capitalize()".
Committed r197426: <http://trac.webkit.org/changeset/197426>
This change may have caused an API test failure: FAIL ContentExtensionTest.ParsingFailures /Volumes/Data/slave/elcapitan-release/build/Tools/TestWebKitAPI/Tests/WebCore/ContentExtensions.cpp:1531 Value of: parser.addPattern(pattern, false, 0) Actual: 4 Expected: status Which is: 0 <https://build.webkit.org/builders/Apple%20El%20Capitan%20Release%20WK1%20%28Tests%29/builds/3939>
(In reply to comment #19) > This change may have caused an API test failure: > > FAIL ContentExtensionTest.ParsingFailures > > /Volumes/Data/slave/elcapitan-release/build/Tools/TestWebKitAPI/Tests/ > WebCore/ContentExtensions.cpp:1531 > Value of: parser.addPattern(pattern, false, 0) > Actual: 4 > Expected: status > Which is: 0 > > <https://build.webkit.org/builders/ > Apple%20El%20Capitan%20Release%20WK1%20%28Tests%29/builds/3939> I have a fix in the works.
(In reply to comment #20) > (In reply to comment #19) > > This change may have caused an API test failure: > > > > FAIL ContentExtensionTest.ParsingFailures > > > > /Volumes/Data/slave/elcapitan-release/build/Tools/TestWebKitAPI/Tests/ > > WebCore/ContentExtensions.cpp:1531 > > Value of: parser.addPattern(pattern, false, 0) > > Actual: 4 > > Expected: status > > Which is: 0 > > > > <https://build.webkit.org/builders/ > > Apple%20El%20Capitan%20Release%20WK1%20%28Tests%29/builds/3939> > > I have a fix in the works. Tracked in <https://bugs.webkit.org/show_bug.cgi?id=154898>. Fix working its way through commit queue.
Comment on attachment 272583 [details] Updated patch with suggested changes and fixed test View in context: https://bugs.webkit.org/attachment.cgi?id=272583&action=review Missed a few tiny improvements on my first review. I won’t insist on these, but I think they are slightly better: > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:212 > + if (U16_IS_LEAD(result) && decodeSurrogatePairs && p + 1 < length > + && U16_IS_TRAIL(input[p + 1])) { I think this would read better all on one line. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:214 > + if (atEnd()) > + return -1; How can this ever be true? Do we have a test case that covers this? > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:222 > + int readSurrogatePairChecked(unsigned negativePositionOffest) Typo here: "offest" instead of offset. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:232 > + int first = input[p]; > + if (U16_IS_LEAD(first) && U16_IS_TRAIL(input[p + 1])) > + return U16_GET_SUPPLEMENTARY(first, input[p + 1]); Unclear why we put "first" into a local variable, but not input[p + 1]. Code would read better if we were consistent. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:246 > + int result = input[from]; > + if (U16_IS_LEAD(result) && decodeSurrogatePairs && from + 1 < length > + && U16_IS_TRAIL(input[from + 1])) { > + > + result = U16_GET_SUPPLEMENTARY(result, input[from + 1]); > + } > + return result; If it wasn’t for the decodeSurrogatePairs boolean, we could write this in a cleaner way using an ICU macro: UChar32 character; U16_GET(input, 0, from, length, character); return character; Or we could write this: UChar32 character; if (decodeSurrogatePairs) U16_GET(input, 0, from, length, character); else character = input[from]; return character; Not 100% sure this is better, but I think it is. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:322 > + if (ch & 0x1FFF80) { Missed this in my previous review. I would suggest instead writing this: if (!isASCII(ch)) { > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:439 > + if (unicode && !U_IS_BMP(term.atom.patternCharacter)) > + input.uncheckInput(2); > + else > + input.uncheckInput(1); I think this should just be: input.uncheckInput(U16_LENGTH(term.atom.patternCharacter)); I do not think that checking the unicode boolean is necessary or helpful. I am pretty sure that there could never be a patternCharacter >0xFFFF if unicode was false. It’s nice to not have to think about whether the check of U_IS_BMP has the right sense and U16_LENGTH does that for us. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1270 > + // Case insensitive matching of unicode charaters are handled as TypeCharacterClass Typo: charaters Small grammatical error: "matching is handled" is correct; "matching are handled" is not. Would be nice to use a period here as in the typical WebKit comment style. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1293 > + // Case insensitive matching of unicode charaters are handled as TypeCharacterClass Ditto. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1311 > + // Case insensitive matching of unicode charaters are handled as TypeCharacterClass Ditto. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1622 > - ASSERT(u_tolower(ch) <= 0xFFFF); > - ASSERT(u_toupper(ch) <= 0xFFFF); > + ASSERT(u_tolower(ch) <= UCHAR_MAX_VALUE); > + ASSERT(u_toupper(ch) <= UCHAR_MAX_VALUE); I don’t think these assertions are valuable any more. The old assertions were asserting something incorrect, but they were valuable in the way they stated an (incorrect) assumption that the code below was making. The new assertions are asserting that ICU isn’t totally broken; it’s a little strange to assert that since I think there’s no super-direct connection with anything the code is assuming. We don’t assert that the passed-in character is <= UCHAR_MAX_VALUE, for example. > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1625 > + UChar32 lo = u_tolower(ch); > + UChar32 hi = u_toupper(ch); I suspect there are some cases of case folding that this won’t get exactly right (same problem with the old code, of course); I ran into this recently when doing some work on upper/lowercasing in the String class, where comparing against both uppercased and lowercased versions of a character are not the same as the correct case folding. I’ll try to research what test cases we’ll need to add to demonstrate that issue. > Source/JavaScriptCore/yarr/YarrParser.h:427 > + if (!WTF::isASCIIHexDigit(peek())) No need for WTF:: here. > Source/JavaScriptCore/yarr/YarrParser.h:430 > + codePoint = (codePoint << 4) | WTF::toASCIIHexValue(consume()); No need for WTF:: here. > Source/JavaScriptCore/yarr/YarrParser.h:432 > + if (codePoint > UCHAR_MAX_VALUE) I’m interested in knowing whether all we want to do is check the maximum value, or if we need the more complete check for a valid Unicode character, U_IS_UNICODE_CHAR. We should make test cases to help us understand that. Examples of values that U_IS_UNICODE_CHAR returns no for: D800-DFFF FDD0-FDEF FFFE-FFFF We should add test cases for those, and make sure that either they are intentionally allowed, or intentionally not allowed. Right now we are allowing them and I am not are why. On reflection, thought, If we do want to check U_IS_UNICODE_CHAR, we’ll probably have to do it outside the loop, and keep the UCHAR_MAX_VALUE check inside the loop. > Source/JavaScriptCore/yarr/YarrParser.h:463 > delegate.atomPatternCharacter(u); Here we allow unpaired surrogates in the pattern. Do we want to allow them? Probably boils down to the same question: Do we need a U_IS_UNICODE_CHAR check here? Need test cases that try to include unpaired surrogates and characters like FFFF and FFFE. > Source/JavaScriptCore/yarr/YarrPattern.cpp:72 > if (ch <= 0x7f) { I suggest using isASCII here. > Source/JavaScriptCore/yarr/YarrPattern.cpp:111 > if (lo <= 0x7f) { I suggest using isASCII here. > Source/JavaScriptCore/yarr/YarrPattern.cpp:123 > if (hi <= 0x7f) I suggest using isASCII here. > Source/JavaScriptCore/yarr/YarrPattern.cpp:193 > + addSorted(ch <= 0x7f ? m_matches : m_matchesUnicode, ch); Again here I would suggest using isASCII: addSorted(isASCII(ch) ? m_matches : m_matchesUnicode, ch); > Source/JavaScriptCore/yarr/YarrPattern.cpp:606 > + currentInputPosition += (!U_IS_BMP(term.patternCharacter) ? 2 : 1) * term.quantityCount; There’s a better ICU macro for this: currentInputPosition += U16_LENGTH(term.patternCharacter) * term.quantityCount; Using that macro makes it clearer we didn’t do the ternary backwards by accident, for example.
(In reply to comment #22) > Comment on attachment 272583 [details] > Updated patch with suggested changes and fixed test Thanks for the suggestions. Most of these I will take care of if a follow on patch to increase testing of corner cases and tighter parsing mandated by the unicode flag. Below are my replies to the comments for items that I don't think should change or are more involved. > View in context: > https://bugs.webkit.org/attachment.cgi?id=272583&action=review > > Missed a few tiny improvements on my first review. I won’t insist on these, > but I think they are slightly better: > > > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:214 > > + if (atEnd()) > > + return -1; > > How can this ever be true? Do we have a test case that covers this? This is a subtle case. The input is "checked" in that we know the minimum characters that are required to match. This "if atEnd() then return -1" is a real case and there are many tests for it. Here is one: shouldBeFalse('/(?:a|\u{10123}|b)x/u.test("\u{10123}")'); Here is an overview of why this is needed. Consider the example regexp that involves multiple alternative where a one alternative is longer than the others. The one alternative will check a longer input, thus advancing "pos" farther than the other alternatives. We can read a surrogate pair and try to advance past the end of the input string because of the longer alternative. With this check that doesn't happen, we backtrack and start with the next alternative. For the test above, the checked value is 2 as well as the subject string length. When we go to match the second alternative and read out unicode character, we would call "next()" without this check and crash. Instead we fail that alternative and go on to the next one. In this test we never match, but other tests we match via other alternatives. > > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:246 > > + int result = input[from]; > > + if (U16_IS_LEAD(result) && decodeSurrogatePairs && from + 1 < length > > + && U16_IS_TRAIL(input[from + 1])) { > > + > > + result = U16_GET_SUPPLEMENTARY(result, input[from + 1]); > > + } > > + return result; > > If it wasn’t for the decodeSurrogatePairs boolean, we could write this in a > cleaner way using an ICU macro: > > UChar32 character; > U16_GET(input, 0, from, length, character); > return character; > > Or we could write this: > > UChar32 character; > if (decodeSurrogatePairs) > U16_GET(input, 0, from, length, character); > else > character = input[from]; > return character; > > Not 100% sure this is better, but I think it is. I'm mulling this change. ... > > Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1625 > > + UChar32 lo = u_tolower(ch); > > + UChar32 hi = u_toupper(ch); > > I suspect there are some cases of case folding that this won’t get exactly > right (same problem with the old code, of course); I ran into this recently > when doing some work on upper/lowercasing in the String class, where > comparing against both uppercased and lowercased versions of a character are > not the same as the correct case folding. I’ll try to research what test > cases we’ll need to add to demonstrate that issue. This code probably needs some work. > > Source/JavaScriptCore/yarr/YarrParser.h:432 > > + if (codePoint > UCHAR_MAX_VALUE) > > I’m interested in knowing whether all we want to do is check the maximum > value, or if we need the more complete check for a valid Unicode character, > U_IS_UNICODE_CHAR. We should make test cases to help us understand that. > Examples of values that U_IS_UNICODE_CHAR returns no for: > > D800-DFFF > FDD0-FDEF > FFFE-FFFF > > We should add test cases for those, and make sure that either they are > intentionally allowed, or intentionally not allowed. Right now we are > allowing them and I am not are why. > > On reflection, thought, If we do want to check U_IS_UNICODE_CHAR, we’ll > probably have to do it outside the loop, and keep the UCHAR_MAX_VALUE check > inside the loop. Here we only need to check the maximum value. The ES6 standard requires that we match dangling surrogates as well as other hex patterns that don't correspond to Unicode code points. > > Source/JavaScriptCore/yarr/YarrParser.h:463 > > delegate.atomPatternCharacter(u); > > Here we allow unpaired surrogates in the pattern. Do we want to allow them? > Probably boils down to the same question: Do we need a U_IS_UNICODE_CHAR > check here? Need test cases that try to include unpaired surrogates and > characters like FFFF and FFFE. Yes, the standard requires it.
Looks like this was a duplicate of https://bugs.webkit.org/show_bug.cgi?id=151597 which is now about solving the bugs in the current implementation.