Bug 228810 - [JSC] Yarr's character tracking for BoyerMoore search should be more precise
Summary: [JSC] Yarr's character tracking for BoyerMoore search should be more precise
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-08-04 20:34 PDT by Yusuke Suzuki
Modified: 2021-08-06 17:26 PDT (History)
7 users (show)

See Also:


Attachments
Patch (14.19 KB, patch)
2021-08-04 20:37 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (14.44 KB, patch)
2021-08-04 23:31 PDT, Yusuke Suzuki
saam: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2021-08-04 20:34:33 PDT
[JSC] Yarr's character tracking for BoyerMoore search should be more precise
Comment 1 Yusuke Suzuki 2021-08-04 20:37:00 PDT
Created attachment 434963 [details]
Patch
Comment 2 Yusuke Suzuki 2021-08-04 23:31:44 PDT
Created attachment 434965 [details]
Patch
Comment 3 Saam Barati 2021-08-05 15:14:32 PDT
Comment on attachment 434965 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review

r=me

> Source/JavaScriptCore/ChangeLog:8
> +        We should track candidate characters without masking up to 2 so

let's fix this first sentence to be more clear like you suggested on slack.

> Source/JavaScriptCore/ChangeLog:12
> +        We also remove m_table giving up. This is because m_table can be used

"We also remove m_table giving up...." => "We also add support for m_table because m_table can be used"

> Source/JavaScriptCore/ChangeLog:17
> +        To make `\s` work on Char8 case, we use Char8 / Char16 information
> +        effectively in BoyreMoore search bitmap construction.

how so? It'd be helpful to share more detail here, like, what makes it "effective"

> Source/JavaScriptCore/yarr/YarrJIT.cpp:2414
> +                        auto [map, characters] = op.m_bmInfo->createCandidateBitmap(beginIndex, endIndex);

nit: "characters" is a generic name given this context. Similar suggestion to naming the class. maybe "charactersFastPath" or something?

> Source/JavaScriptCore/yarr/YarrJIT.cpp:2429
>                              UChar32 character2 = 0xff;
> -                            if (mapCount == 2) {
> -                                character2 = map.findBit(character1 + 1, true);
> -                                ASSERT(character2 != BoyerMooreBitmap::Map::size());
> -                            }
> -                            dataLogLnIf(Options::verboseRegExpCompilation(), "Found 1-or-2 characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),isMaskEffective:(", isMaskEffective,"),range:[", beginIndex, ", ", endIndex, ")");
> +                            if (characters.size() > 1)
> +                                character2 = characters.at(1);
> +                            dataLogLnIf(Options::verboseRegExpCompilation(), "Found ", characters.size(), " characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),range:[", beginIndex, ", ", endIndex, ")");

Can all of the "character2", "characters.size() > 1" just be moved into "characters.size() > 1" branch below? I think that'd make things a lot more clear. No need to set to 0xff, etc.

> Source/JavaScriptCore/yarr/YarrJIT.h:63
> +class BoyerMooreCharacterCandidates {

this name feels weirdly generic. I wonder if we can go with "BoyerMooreFastCase" or "BoyerMooreFastCandidates" or something similar that's just a bit more descriptive of what we're aiming for? Since this isn't really a comprehensive list of "character candidates".
Comment 4 Saam Barati 2021-08-05 15:14:52 PDT
Comment on attachment 434965 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review

> Source/JavaScriptCore/yarr/YarrJIT.h:145
> +            // Early return since characters are sorted.

can we debug assert they're sorted?
Comment 5 Yusuke Suzuki 2021-08-06 16:55:13 PDT
Comment on attachment 434965 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434965&action=review

Thanks

>> Source/JavaScriptCore/ChangeLog:8
>> +        We should track candidate characters without masking up to 2 so
> 
> let's fix this first sentence to be more clear like you suggested on slack.

Fixed.

>> Source/JavaScriptCore/ChangeLog:12
>> +        We also remove m_table giving up. This is because m_table can be used
> 
> "We also remove m_table giving up...." => "We also add support for m_table because m_table can be used"

Fixed.

>> Source/JavaScriptCore/ChangeLog:17
>> +        effectively in BoyreMoore search bitmap construction.
> 
> how so? It'd be helpful to share more detail here, like, what makes it "effective"

Fixed.

>> Source/JavaScriptCore/yarr/YarrJIT.cpp:2414
>> +                        auto [map, characters] = op.m_bmInfo->createCandidateBitmap(beginIndex, endIndex);
> 
> nit: "characters" is a generic name given this context. Similar suggestion to naming the class. maybe "charactersFastPath" or something?

Fixed.

>> Source/JavaScriptCore/yarr/YarrJIT.cpp:2429
>> +                            dataLogLnIf(Options::verboseRegExpCompilation(), "Found ", characters.size(), " characters lookahead character:(0x", hex(character1), "),character2:(", hex(character2), "),range:[", beginIndex, ", ", endIndex, ")");
> 
> Can all of the "character2", "characters.size() > 1" just be moved into "characters.size() > 1" branch below? I think that'd make things a lot more clear. No need to set to 0xff, etc.

Fixed.

>> Source/JavaScriptCore/yarr/YarrJIT.h:63
>> +class BoyerMooreCharacterCandidates {
> 
> this name feels weirdly generic. I wonder if we can go with "BoyerMooreFastCase" or "BoyerMooreFastCandidates" or something similar that's just a bit more descriptive of what we're aiming for? Since this isn't really a comprehensive list of "character candidates".

Fixed.

>> Source/JavaScriptCore/yarr/YarrJIT.h:145
>> +            // Early return since characters are sorted.
> 
> can we debug assert they're sorted?

Added.
Comment 6 Yusuke Suzuki 2021-08-06 17:03:18 PDT
Committed r280744 (240330@main): <https://commits.webkit.org/240330@main>
Comment 7 Radar WebKit Bug Importer 2021-08-06 17:04:22 PDT
<rdar://problem/81638511>
Comment 8 Yusuke Suzuki 2021-08-06 17:26:36 PDT
Committed r280745 (240331@main): <https://commits.webkit.org/240331@main>