RESOLVED FIXED 50295
Regexp JIT'ed Code Contains Branches to Branches for Backtracking
https://bugs.webkit.org/show_bug.cgi?id=50295
Summary Regexp JIT'ed Code Contains Branches to Branches for Backtracking
Michael Saboff
Reported 2010-11-30 16:47:43 PST
The regular expression JIT compiler currently generates code one term at a time. A term may contain backtracking logic to handle trying another alternative of a parenthetical subexpression. The generated code often contains sequences of branches to branches. There are also branches around some logic. If that logic was moved some of those jumps could be eliminated. This is one of the issues covered by <rdar://problem/8511767>. Here is a symbolic version of the code generated for /(a|A)(b|B)/ push %rbp mov %rsp,%rbp push %rbx mov %esi,(%rcx) sub $0x10,%rsp add $0x2,%esi cmp %edx,%esi ja nextChar loop: mov %rsi,%rax add $0xfffffffffffffffe,%eax mov %eax,0x8(%rcx) cmpw $0x61,-0x4(%rdi,%rsi,2) jne tryAlt1A mov $tryAlt1A,%r11 mov %r11,(%rsp) jmpq term1Match tryAlt1A: cmpw $0x41,-0x4(%rdi,%rsi,2) jne term1AFail // Jump to a jump mov $term1AFail,%r11 mov %r11,(%rsp) jmpq term1Match // Jump around two other jumps term1AFail: jmpq term1Fail backTrackTerm1Select: jmpq *(%rsp) term1Match: mov %rsi,%rax add $0xffffffffffffffff,%eax mov %eax,0xc(%rcx) jmpq startTerm2 // Jump around "fail" code backTrackTerm1: jmpq backTrackTerm1Select // Jump to an indirect jump term1Fail: movl $0xffffffff,0x8(%rcx) jmpq nextChar startTerm2: mov %rsi,%rax add $0xffffffffffffffff,%eax mov %eax,0x10(%rcx) cmpw $0x62,-0x2(%rdi,%rsi,2) jne tryAlt2B mov $tryAlt2B,%r11 mov %r11,0x8(%rsp) jmpq term2Match tryAlt2B: cmpw $0x42,-0x2(%rdi,%rsi,2) jne term2BFail mov $term2BFail,%r11 mov %r11,0x8(%rsp) jmpq term2Match // Jump around jumps term2BFail: jmpq term2Fail backTrackTerm2Select: jmpq *0x8(%rsp) // Unused code term2Match: mov %esi,0x14(%rcx) jmpq matchSuccess // Jump around "fail" logic jmpq backTrackTerm2Select // Jump to an indirect jump - also not used!! term2Fail: movl $0xffffffff,0x10(%rcx) jmpq backTrackTerm1 // << Jump to a jump to an indirect jump matchSuccess: add $0x10,%rsp mov (%rcx),%eax mov %esi,0x4(%rcx) pop %rbx pop %rbp retq nextChar: mov %rsi,%rax sub $0x1,%eax mov %eax,(%rcx) add $0x1,%esi cmp %edx,%esi jbe loop add $0x10,%rsp mov $0xffffffff,%eax pop %rbx pop %rbp retq
Attachments
Patch to remove most paren related branches to branches (81.23 KB, patch)
2010-11-30 17:03 PST, Michael Saboff
no flags
Updated Proposed Patch (89.62 KB, patch)
2010-12-01 14:41 PST, Michael Saboff
no flags
Updated patch addressing reviewers concerns and fixing a bug. (89.91 KB, patch)
2010-12-02 15:48 PST, Michael Saboff
barraclough: review+
commit-queue: commit-queue-
Patch with updates for latest top of trunk. (89.78 KB, patch)
2010-12-03 10:18 PST, Michael Saboff
no flags
Michael Saboff
Comment 1 2010-11-30 17:03:51 PST
Created attachment 75232 [details] Patch to remove most paren related branches to branches This patch propagates branches from within parentheses backtracking code to later alternatives. It also emits the parentheses failure code at the end of the routine. It handles indirect branches as backtracking destinations and emits a trampoline to an indirect branch when needed. Some cleanups were done as part of this patch. Specifically the invertOrCapture flag was separated into two. The code generated by this patch for the same regular expression in the description is: push %rbp mov %rsp,%rbp push %rbx mov %esi,(%rcx) sub $0x10,%rsp add $0x2,%esi cmp %edx,%esi ja nextChar loop: mov %rsi,%rax add $0xfffffffffffffffe,%eax mov %eax,0x8(%rcx) cmpw $0x61,-0x4(%rdi,%rsi,2) jne tryAlt1A mov $tryAlt1A,%r11 mov %r11,(%rsp) jmpq term1Match tryAlt1A: cmpw $0x41,-0x4(%rdi,%rsi,2) jne term1Fail mov $term1Fail,%r11 mov %r11,(%rsp) jmpq term1Match term1Match: mov %rsi,%rax add $0xffffffffffffffff,%eax mov %eax,0xc(%rcx) mov %rsi,%rax add $0xffffffffffffffff,%eax mov %eax,0x10(%rcx) cmpw $0x62,-0x2(%rdi,%rsi,2) jne tryAlt2B mov $tryAlt2B,%r11 mov %r11,0x8(%rsp) jmpq term2Match tryAlt2B: cmpw $0x42,-0x2(%rdi,%rsi,2) jne term2Fail mov $term2Fail,%r11 mov %r11,0x8(%rsp) jmpq term2Match term2Match: mov %esi,0x14(%rcx) add $0x10,%rsp mov (%rcx),%eax mov %esi,0x4(%rcx) pop %rbx pop %rbp retq nextChar: mov %rsi,%rax sub $0x1,%eax mov %eax,(%rcx) add $0x1,%esi cmp %edx,%esi jbe loop: add $0x10,%rsp mov $0xffffffff,%eax pop %rbx pop %rbp retq term2Fail: movl $0xffffffff,0x10(%rcx) jmpq *(%rsp) term1Fail: movl $0xffffffff,0x8(%rcx) jmpq nextChar
Sam Weinig
Comment 2 2010-12-01 08:34:09 PST
Comment on attachment 75232 [details] Patch to remove most paren related branches to branches View in context: https://bugs.webkit.org/attachment.cgi?id=75232&action=review > JavaScriptCore/yarr/RegexJIT.cpp:295 > + struct OffsetHashTraits : GenericHashTraits<int> { > + static const bool emptyValueIsZero = false; > + static const bool needsDestruction = false; > + static int emptyValue() { return -1; } > + }; As discussed in person, I think it would be better to store the stack offset as a uint32_t and use UnsignedWithZeroKeyHashTraits<uint32_t> from HashTraits.h instead.
Michael Saboff
Comment 3 2010-12-01 14:41:01 PST
Created attachment 75323 [details] Updated Proposed Patch This patch addresses the concern raised by the prior patch and now uses UnsignedWithZeroKeyHashTraits<uint32_t>. Fixed the jump to next instruction in the success case. Fixed a bug in the updated fast/regex/pcre tests. Added ChangeLogs.
Michael Saboff
Comment 4 2010-12-02 10:28:53 PST
Comment on attachment 75323 [details] Updated Proposed Patch working on a bug. Will post an updated patch when available.
Gavin Barraclough
Comment 5 2010-12-02 13:38:41 PST
Hi Michael, The patch generally looks fine, few small issues. (1) You use the variable name 'reGen', but our style guidelines do not allow abbreviations. I suggest renaming to 'generator'. (2) You add a couple of methods with names prefixed 'have' to BacktrackDestination – throughout JSC we tend to use 'has' for these kinds of method names. Though objectively equivalent to each other, for consistency I think we should stick to 'has'. (3) You add a method used() to DataLabelPtr – in the long run, I'm not sure this is safe, so I'd rather not take this step at this point. From a security perspective, we'll want to harder the assembler against integer overflow. I think logically in the assembler we want all 32bit values to be valid (otherwise converting an offset to a label would require a guard). So we should not be stealing bits from these offsets. I'm aware that in label we already do so, but I think we should consider that a bug to be fixed, rather than a design pattern to copy. (On the other hand I'm fine with exposing the isSet() method from the assembler). Rather than stealing bits from DataLabelPtr, there are bits you could steal from within BacktrackDestination – BacktrackType doesn't need to be using a full 32-bits, you could steal from here, and this would reduce the size of the patch. (4) Also in BacktrackDestination - setDataLabelPtr. I think this method could do with a slightly more descriptive name, that describes more why you are calling it rather than what it does. The DataLabelPtr is recording a store instruction that we be be repatched to write the address to be jumped to to resume this disjunction when matching, right? Errk, not sure how succinct and descriptive a name we can come up with for this, I think setBacktrackStoreLocation would be an improvement, feel free to do better. (5) Finally, in these cases the extra braces around the else statements should be removed. 996 } else { 997 m_backtrack.setBacktrackJumpList(&jumpsToNext); 998 } 999 } else { 1000 nextBacktrackFallThrough = false; 1001 } 1504 } else if (state.isLastAlternative()) { 1505 propogateBacktrack = true; 1506 } cheers, G.
Michael Saboff
Comment 6 2010-12-02 15:48:49 PST
Created attachment 75425 [details] Updated patch addressing reviewers concerns and fixing a bug. Fixed a bug where non fixed sized paren backtrack handling was going to the outside of the paren code instead of the inside the paren code and a related issue were some nested paren handling wasn't saving the right backtrack address on the stack. Added test cases for this fix. Addressed the concerns raised from the prior patch. For point 3, changed the dataLabelPtr.isUsed() to .isSet(), which is more in keeping with the logic. For point 4, changed BacktrackDestination::setDataLabelPtr to setBacktrackDataLabel.
Gavin Barraclough
Comment 7 2010-12-02 16:52:21 PST
Comment on attachment 75425 [details] Updated patch addressing reviewers concerns and fixing a bug. Looks great Michael.
WebKit Commit Bot
Comment 8 2010-12-02 22:18:28 PST
Comment on attachment 75425 [details] Updated patch addressing reviewers concerns and fixing a bug. Rejecting patch 75425 from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=abarth-cq-sl', 'apply-attachment', '--force-clean', '--non-interactive', 75425]" exit_code: 2 Last 500 characters of output: et -15 lines). patching file JavaScriptCore/yarr/RegexPattern.h Hunk #5 succeeded at 166 with fuzz 2. patching file LayoutTests/ChangeLog Hunk #1 succeeded at 1 with fuzz 3. patching file LayoutTests/fast/regex/parentheses-expected.txt patching file LayoutTests/fast/regex/parentheses.html patching file LayoutTests/fast/regex/script-tests/parentheses.js Failed to run "[u'/Users/abarth/git/webkit-queue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Gavin Barraclough', u'--force']" exit_code: 1 Full output: http://queues.webkit.org/results/6804018
Michael Saboff
Comment 9 2010-12-03 10:18:59 PST
Created attachment 75507 [details] Patch with updates for latest top of trunk.
WebKit Commit Bot
Comment 10 2010-12-03 14:48:40 PST
Comment on attachment 75507 [details] Patch with updates for latest top of trunk. Clearing flags on attachment: 75507 Committed r73307: <http://trac.webkit.org/changeset/73307>
WebKit Commit Bot
Comment 11 2010-12-03 14:48:48 PST
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.