[JSC] Yarr's character tracking for BoyerMoore search should be more precise
Created attachment 434963 [details] Patch
Created attachment 434965 [details] Patch
Comment on attachment 434965 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review r=me > Source/JavaScriptCore/ChangeLog:8 > + We should track candidate characters without masking up to 2 so let's fix this first sentence to be more clear like you suggested on slack. > Source/JavaScriptCore/ChangeLog:12 > + We also remove m_table giving up. This is because m_table can be used "We also remove m_table giving up...." => "We also add support for m_table because m_table can be used" > Source/JavaScriptCore/ChangeLog:17 > + To make `\s` work on Char8 case, we use Char8 / Char16 information > + effectively in BoyreMoore search bitmap construction. how so? It'd be helpful to share more detail here, like, what makes it "effective" > Source/JavaScriptCore/yarr/YarrJIT.cpp:2414 > + auto [map, characters] = op.m_bmInfo->createCandidateBitmap(beginIndex, endIndex); nit: "characters" is a generic name given this context. Similar suggestion to naming the class. maybe "charactersFastPath" or something? > Source/JavaScriptCore/yarr/YarrJIT.cpp:2429 > UChar32 character2 = 0xff; > - if (mapCount == 2) { > - character2 = map.findBit(character1 + 1, true); > - ASSERT(character2 != BoyerMooreBitmap::Map::size()); > - } > - dataLogLnIf(Options::verboseRegExpCompilation(), "Found 1-or-2 characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),isMaskEffective:(", isMaskEffective,"),range:[", beginIndex, ", ", endIndex, ")"); > + if (characters.size() > 1) > + character2 = characters.at(1); > + dataLogLnIf(Options::verboseRegExpCompilation(), "Found ", characters.size(), " characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),range:[", beginIndex, ", ", endIndex, ")"); Can all of the "character2", "characters.size() > 1" just be moved into "characters.size() > 1" branch below? I think that'd make things a lot more clear. No need to set to 0xff, etc. > Source/JavaScriptCore/yarr/YarrJIT.h:63 > +class BoyerMooreCharacterCandidates { this name feels weirdly generic. I wonder if we can go with "BoyerMooreFastCase" or "BoyerMooreFastCandidates" or something similar that's just a bit more descriptive of what we're aiming for? Since this isn't really a comprehensive list of "character candidates".
Comment on attachment 434965 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review > Source/JavaScriptCore/yarr/YarrJIT.h:145 > + // Early return since characters are sorted. can we debug assert they're sorted?
Comment on attachment 434965 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review Thanks >> Source/JavaScriptCore/ChangeLog:8 >> + We should track candidate characters without masking up to 2 so > > let's fix this first sentence to be more clear like you suggested on slack. Fixed. >> Source/JavaScriptCore/ChangeLog:12 >> + We also remove m_table giving up. This is because m_table can be used > > "We also remove m_table giving up...." => "We also add support for m_table because m_table can be used" Fixed. >> Source/JavaScriptCore/ChangeLog:17 >> + effectively in BoyreMoore search bitmap construction. > > how so? It'd be helpful to share more detail here, like, what makes it "effective" Fixed. >> Source/JavaScriptCore/yarr/YarrJIT.cpp:2414 >> + auto [map, characters] = op.m_bmInfo->createCandidateBitmap(beginIndex, endIndex); > > nit: "characters" is a generic name given this context. Similar suggestion to naming the class. maybe "charactersFastPath" or something? Fixed. >> Source/JavaScriptCore/yarr/YarrJIT.cpp:2429 >> + dataLogLnIf(Options::verboseRegExpCompilation(), "Found ", characters.size(), " characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),range:[", beginIndex, ", ", endIndex, ")"); > > Can all of the "character2", "characters.size() > 1" just be moved into "characters.size() > 1" branch below? I think that'd make things a lot more clear. No need to set to 0xff, etc. Fixed. >> Source/JavaScriptCore/yarr/YarrJIT.h:63 >> +class BoyerMooreCharacterCandidates { > > this name feels weirdly generic. I wonder if we can go with "BoyerMooreFastCase" or "BoyerMooreFastCandidates" or something similar that's just a bit more descriptive of what we're aiming for? Since this isn't really a comprehensive list of "character candidates". Fixed. >> Source/JavaScriptCore/yarr/YarrJIT.h:145 >> + // Early return since characters are sorted. > > can we debug assert they're sorted? Added.
Committed r280744 (240330@main): <https://commits.webkit.org/240330@main>
<rdar://problem/81638511>
Committed r280745 (240331@main): <https://commits.webkit.org/240331@main>