Bug 180874 - YARR: JIT RegExps with back references
Summary: YARR: JIT RegExps with back references
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: Other
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords: InRadar
Depends on: 189291
Blocks: 179230
  Show dependency treegraph
 
Reported: 2017-12-15 11:30 PST by Michael Saboff
Modified: 2018-09-04 17:59 PDT (History)
10 users (show)

See Also:


Attachments
Patch (26.38 KB, patch)
2018-08-31 14:55 PDT, Michael Saboff
no flags Details | Formatted Diff | Diff
Updated pPatch with speculative build fixes (28.16 KB, patch)
2018-09-01 09:03 PDT, Michael Saboff
fpizlo: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2017-12-15 11:30:58 PST
Currently, patterns with back references, e.g. /(a)\1/, are not supported in the Yarr JIT.  We should add support for some or all back references.  It should be straightforward to support back references without case folding.  JIT support for case folding with 8 bit characters shouldn't be to hard as the case folding table is a reasonable size.  JIT support of 16-bit and/or Unicode patterns will require calling out to C++ for case folding.
Comment 1 Radar WebKit Bug Importer 2017-12-15 11:35:50 PST
<rdar://problem/36078115>
Comment 2 Michael Saboff 2018-08-31 14:55:44 PDT
Created attachment 348676 [details]
Patch
Comment 3 Michael Saboff 2018-09-01 09:03:48 PDT
Created attachment 348713 [details]
Updated pPatch with speculative build fixes
Comment 4 EWS Watchlist 2018-09-01 09:06:07 PDT
Attachment 348713 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/yarr/YarrJIT.cpp:3818:  Multi line control clauses should use braces.  [whitespace/braces] [4]
ERROR: Source/JavaScriptCore/runtime/RegExp.cpp:308:  Multi line control clauses should use braces.  [whitespace/braces] [4]
ERROR: Source/JavaScriptCore/runtime/RegExp.cpp:368:  Multi line control clauses should use braces.  [whitespace/braces] [4]
Total errors found: 3 in 13 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 5 Michael Saboff 2018-09-04 10:23:11 PDT
Comment on attachment 348713 [details]
Updated pPatch with speculative build fixes

View in context: https://bugs.webkit.org/attachment.cgi?id=348713&action=review

> Source/JavaScriptCore/assembler/MacroAssemblerX86Common.h:1288
> +

I'll revert this whitespace change.
Comment 6 Michael Saboff 2018-09-04 14:19:00 PDT
Committed r235636: <https://trac.webkit.org/changeset/235636>
Comment 7 Saam Barati 2018-09-04 14:51:26 PDT
Comment on attachment 348713 [details]
Updated pPatch with speculative build fixes

View in context: https://bugs.webkit.org/attachment.cgi?id=348713&action=review

LGTM too. It would be nice to add some more tests for some of the trickier tests in the JIT.

> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:548
> +    0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0x39c, 0xb6, 0xb7, 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf,

Any idea what 0x39c is?

> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:552
> +    0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xf7, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0x178

ditto about 0x178?

Can you add tests for these?

> Source/JavaScriptCore/yarr/YarrJIT.cpp:1135
> +            readCharacterDontDecodeSurrogates(0, patternCharacter, patternIndex);
> +            readCharacterDontDecodeSurrogates(m_checkedOffset - term->inputPosition, character);

Indentation is wrong.

> Source/JavaScriptCore/yarr/YarrJIT.cpp:1181
> +            load32(Address(output, (subpatternId << 1) * sizeof(int)), patternIndex);
> +            load32(Address(output, ((subpatternId << 1) + 1) * sizeof(int)), patternTemp);

Why "<< 1" everywhere instead of "* 2"?

> Source/JavaScriptCore/yarr/YarrJIT.cpp:1263
> +            load32(Address(output, (subpatternId << 1) * sizeof(int)), patternIndex);
> +            load32(Address(output, ((subpatternId << 1) + 1) * sizeof(int)), patternTemp);

Why not just use TimesFour for these addresses instead of "* sizeof(int)"? Wouldn't we end up with better instruction encoding?
Comment 8 Michael Saboff 2018-09-04 16:21:17 PDT
Comment on attachment 348713 [details]
Updated pPatch with speculative build fixes

View in context: https://bugs.webkit.org/attachment.cgi?id=348713&action=review

>> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:548
>> +    0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0x39c, 0xb6, 0xb7, 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf,
> 
> Any idea what 0x39c is?

That is the appropriate toUpper() result for 0xb5.

>> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:552
>> +    0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xf7, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0x178
> 
> ditto about 0x178?
> 
> Can you add tests for these?

That is the correct toUpper() result for 0xff.

I'll add tests for these two cases.

>> Source/JavaScriptCore/yarr/YarrJIT.cpp:1135
>> +            readCharacterDontDecodeSurrogates(m_checkedOffset - term->inputPosition, character);
> 
> Indentation is wrong.

I'll land a fix for this.

>> Source/JavaScriptCore/yarr/YarrJIT.cpp:1181
>> +            load32(Address(output, ((subpatternId << 1) + 1) * sizeof(int)), patternTemp);
> 
> Why "<< 1" everywhere instead of "* 2"?

Following the convention already in this file.

>> Source/JavaScriptCore/yarr/YarrJIT.cpp:1263
>> +            load32(Address(output, ((subpatternId << 1) + 1) * sizeof(int)), patternTemp);
> 
> Why not just use TimesFour for these addresses instead of "* sizeof(int)"? Wouldn't we end up with better instruction encoding?

The (subpatternId << 1) * sizeof(int)) becomes an offset of subpattern * 2 * 4. The instruction coding is fine (see last two instructions):
   7:OpTerm BackReference pattern #1 {0,...} non-greedy
        ...
        0x29db3e40904e: jmp 0x29db3e4090ba
        0x29db3e409053: mov 0x8(%rcx), %r9d
        0x29db3e409057: mov 0xc(%rcx), %r10d
Comment 9 Michael Saboff 2018-09-04 17:59:05 PDT
(In reply to Michael Saboff from comment #8)
> Comment on attachment 348713 [details]

> >> Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp:552
> >> +    0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xf7, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0x178
> > 
> > ditto about 0x178?
> > 
> > Can you add tests for these?

New tests tracked in https://bugs.webkit.org/show_bug.cgi?id=189291.