RESOLVED FIXED180874
YARR: JIT RegExps with back references
https://bugs.webkit.org/show_bug.cgi?id=180874
Summary YARR: JIT RegExps with back references
Michael Saboff
Reported 2017-12-15 11:30:58 PST
Currently, patterns with back references, e.g. /(a)\1/, are not supported in the Yarr JIT. We should add support for some or all back references. It should be straightforward to support back references without case folding. JIT support for case folding with 8 bit characters shouldn't be to hard as the case folding table is a reasonable size. JIT support of 16-bit and/or Unicode patterns will require calling out to C++ for case folding.
Attachments
Patch (26.38 KB, patch)
2018-08-31 14:55 PDT, Michael Saboff
no flags
Updated pPatch with speculative build fixes (28.16 KB, patch)
2018-09-01 09:03 PDT, Michael Saboff
fpizlo: review+
Radar WebKit Bug Importer
Comment 1 2017-12-15 11:35:50 PST
Michael Saboff
Comment 2 2018-08-31 14:55:44 PDT
Michael Saboff
Comment 3 2018-09-01 09:03:48 PDT
Created attachment 348713 [details] Updated pPatch with speculative build fixes
EWS Watchlist
Comment 4 2018-09-01 09:06:07 PDT
Attachment 348713 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/yarr/YarrJIT.cpp:3818: Multi line control clauses should use braces. [whitespace/braces] [4] ERROR: Source/JavaScriptCore/runtime/RegExp.cpp:308: Multi line control clauses should use braces. [whitespace/braces] [4] ERROR: Source/JavaScriptCore/runtime/RegExp.cpp:368: Multi line control clauses should use braces. [whitespace/braces] [4] Total errors found: 3 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
Michael Saboff
Comment 5 2018-09-04 10:23:11 PDT
Comment on attachment 348713 [details] Updated pPatch with speculative build fixes View in context: https://bugs.webkit.org/attachment.cgi?id=348713&action=review > Source/JavaScriptCore/assembler/MacroAssemblerX86Common.h:1288 > + I'll revert this whitespace change.
Michael Saboff
Comment 6 2018-09-04 14:19:00 PDT
Saam Barati
Comment 7 2018-09-04 14:51:26 PDT
Comment on attachment 348713 [details] Updated pPatch with speculative build fixes View in context: https://bugs.webkit.org/attachment.cgi?id=348713&action=review LGTM too. It would be nice to add some more tests for some of the trickier tests in the JIT. > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:548 > + 0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0x39c, 0xb6, 0xb7, 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, Any idea what 0x39c is? > Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:552 > + 0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xf7, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0x178 ditto about 0x178? Can you add tests for these? > Source/JavaScriptCore/yarr/YarrJIT.cpp:1135 > + readCharacterDontDecodeSurrogates(0, patternCharacter, patternIndex); > + readCharacterDontDecodeSurrogates(m_checkedOffset - term->inputPosition, character); Indentation is wrong. > Source/JavaScriptCore/yarr/YarrJIT.cpp:1181 > + load32(Address(output, (subpatternId << 1) * sizeof(int)), patternIndex); > + load32(Address(output, ((subpatternId << 1) + 1) * sizeof(int)), patternTemp); Why "<< 1" everywhere instead of "* 2"? > Source/JavaScriptCore/yarr/YarrJIT.cpp:1263 > + load32(Address(output, (subpatternId << 1) * sizeof(int)), patternIndex); > + load32(Address(output, ((subpatternId << 1) + 1) * sizeof(int)), patternTemp); Why not just use TimesFour for these addresses instead of "* sizeof(int)"? Wouldn't we end up with better instruction encoding?
Michael Saboff
Comment 8 2018-09-04 16:21:17 PDT
Comment on attachment 348713 [details] Updated pPatch with speculative build fixes View in context: https://bugs.webkit.org/attachment.cgi?id=348713&action=review >> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:548 >> + 0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0x39c, 0xb6, 0xb7, 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, > > Any idea what 0x39c is? That is the appropriate toUpper() result for 0xb5. >> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:552 >> + 0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xf7, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0x178 > > ditto about 0x178? > > Can you add tests for these? That is the correct toUpper() result for 0xff. I'll add tests for these two cases. >> Source/JavaScriptCore/yarr/YarrJIT.cpp:1135 >> + readCharacterDontDecodeSurrogates(m_checkedOffset - term->inputPosition, character); > > Indentation is wrong. I'll land a fix for this. >> Source/JavaScriptCore/yarr/YarrJIT.cpp:1181 >> + load32(Address(output, ((subpatternId << 1) + 1) * sizeof(int)), patternTemp); > > Why "<< 1" everywhere instead of "* 2"? Following the convention already in this file. >> Source/JavaScriptCore/yarr/YarrJIT.cpp:1263 >> + load32(Address(output, ((subpatternId << 1) + 1) * sizeof(int)), patternTemp); > > Why not just use TimesFour for these addresses instead of "* sizeof(int)"? Wouldn't we end up with better instruction encoding? The (subpatternId << 1) * sizeof(int)) becomes an offset of subpattern * 2 * 4. The instruction coding is fine (see last two instructions): 7:OpTerm BackReference pattern #1 {0,...} non-greedy ... 0x29db3e40904e: jmp 0x29db3e4090ba 0x29db3e409053: mov 0x8(%rcx), %r9d 0x29db3e409057: mov 0xc(%rcx), %r10d
Michael Saboff
Comment 9 2018-09-04 17:59:05 PDT
(In reply to Michael Saboff from comment #8) > Comment on attachment 348713 [details] > >> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:552 > >> + 0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xf7, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0x178 > > > > ditto about 0x178? > > > > Can you add tests for these? New tests tracked in https://bugs.webkit.org/show_bug.cgi?id=189291.
Note You need to log in before you can comment on or make changes to this bug.