RESOLVED FIXED 228810
[JSC] Yarr's character tracking for BoyerMoore search should be more precise
https://bugs.webkit.org/show_bug.cgi?id=228810
Summary [JSC] Yarr's character tracking for BoyerMoore search should be more precise
Yusuke Suzuki
Reported 2021-08-04 20:34:33 PDT
[JSC] Yarr's character tracking for BoyerMoore search should be more precise
Attachments
Patch (14.19 KB, patch)
2021-08-04 20:37 PDT, Yusuke Suzuki
no flags
Patch (14.44 KB, patch)
2021-08-04 23:31 PDT, Yusuke Suzuki
saam: review+
Yusuke Suzuki
Comment 1 2021-08-04 20:37:00 PDT
Yusuke Suzuki
Comment 2 2021-08-04 23:31:44 PDT
Saam Barati
Comment 3 2021-08-05 15:14:32 PDT
Comment on attachment 434965 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review r=me > Source/JavaScriptCore/ChangeLog:8 > + We should track candidate characters without masking up to 2 so let's fix this first sentence to be more clear like you suggested on slack. > Source/JavaScriptCore/ChangeLog:12 > + We also remove m_table giving up. This is because m_table can be used "We also remove m_table giving up...." => "We also add support for m_table because m_table can be used" > Source/JavaScriptCore/ChangeLog:17 > + To make `\s` work on Char8 case, we use Char8 / Char16 information > + effectively in BoyreMoore search bitmap construction. how so? It'd be helpful to share more detail here, like, what makes it "effective" > Source/JavaScriptCore/yarr/YarrJIT.cpp:2414 > + auto [map, characters] = op.m_bmInfo->createCandidateBitmap(beginIndex, endIndex); nit: "characters" is a generic name given this context. Similar suggestion to naming the class. maybe "charactersFastPath" or something? > Source/JavaScriptCore/yarr/YarrJIT.cpp:2429 > UChar32 character2 = 0xff; > - if (mapCount == 2) { > - character2 = map.findBit(character1 + 1, true); > - ASSERT(character2 != BoyerMooreBitmap::Map::size()); > - } > - dataLogLnIf(Options::verboseRegExpCompilation(), "Found 1-or-2 characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),isMaskEffective:(", isMaskEffective,"),range:[", beginIndex, ", ", endIndex, ")"); > + if (characters.size() > 1) > + character2 = characters.at(1); > + dataLogLnIf(Options::verboseRegExpCompilation(), "Found ", characters.size(), " characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),range:[", beginIndex, ", ", endIndex, ")"); Can all of the "character2", "characters.size() > 1" just be moved into "characters.size() > 1" branch below? I think that'd make things a lot more clear. No need to set to 0xff, etc. > Source/JavaScriptCore/yarr/YarrJIT.h:63 > +class BoyerMooreCharacterCandidates { this name feels weirdly generic. I wonder if we can go with "BoyerMooreFastCase" or "BoyerMooreFastCandidates" or something similar that's just a bit more descriptive of what we're aiming for? Since this isn't really a comprehensive list of "character candidates".
Saam Barati
Comment 4 2021-08-05 15:14:52 PDT
Comment on attachment 434965 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review > Source/JavaScriptCore/yarr/YarrJIT.h:145 > + // Early return since characters are sorted. can we debug assert they're sorted?
Yusuke Suzuki
Comment 5 2021-08-06 16:55:13 PDT
Comment on attachment 434965 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review Thanks >> Source/JavaScriptCore/ChangeLog:8 >> + We should track candidate characters without masking up to 2 so > > let's fix this first sentence to be more clear like you suggested on slack. Fixed. >> Source/JavaScriptCore/ChangeLog:12 >> + We also remove m_table giving up. This is because m_table can be used > > "We also remove m_table giving up...." => "We also add support for m_table because m_table can be used" Fixed. >> Source/JavaScriptCore/ChangeLog:17 >> + effectively in BoyreMoore search bitmap construction. > > how so? It'd be helpful to share more detail here, like, what makes it "effective" Fixed. >> Source/JavaScriptCore/yarr/YarrJIT.cpp:2414 >> + auto [map, characters] = op.m_bmInfo->createCandidateBitmap(beginIndex, endIndex); > > nit: "characters" is a generic name given this context. Similar suggestion to naming the class. maybe "charactersFastPath" or something? Fixed. >> Source/JavaScriptCore/yarr/YarrJIT.cpp:2429 >> + dataLogLnIf(Options::verboseRegExpCompilation(), "Found ", characters.size(), " characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),range:[", beginIndex, ", ", endIndex, ")"); > > Can all of the "character2", "characters.size() > 1" just be moved into "characters.size() > 1" branch below? I think that'd make things a lot more clear. No need to set to 0xff, etc. Fixed. >> Source/JavaScriptCore/yarr/YarrJIT.h:63 >> +class BoyerMooreCharacterCandidates { > > this name feels weirdly generic. I wonder if we can go with "BoyerMooreFastCase" or "BoyerMooreFastCandidates" or something similar that's just a bit more descriptive of what we're aiming for? Since this isn't really a comprehensive list of "character candidates". Fixed. >> Source/JavaScriptCore/yarr/YarrJIT.h:145 >> + // Early return since characters are sorted. > > can we debug assert they're sorted? Added.
Yusuke Suzuki
Comment 6 2021-08-06 17:03:18 PDT
Radar WebKit Bug Importer
Comment 7 2021-08-06 17:04:22 PDT
Yusuke Suzuki
Comment 8 2021-08-06 17:26:36 PDT
Note You need to log in before you can comment on or make changes to this bug.