Bug 228613 - [JSC] Yarr BoyerMoore search should support character-class
Summary: [JSC] Yarr BoyerMoore search should support character-class
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-07-29 15:12 PDT by Yusuke Suzuki
Modified: 2021-08-03 13:25 PDT (History)
8 users (show)

See Also:


Attachments
Patch (13.42 KB, patch)
2021-07-29 23:45 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (15.11 KB, patch)
2021-07-29 23:52 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (15.38 KB, patch)
2021-07-30 12:53 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (19.65 KB, patch)
2021-07-30 13:14 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (19.87 KB, patch)
2021-07-30 13:22 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (19.83 KB, patch)
2021-07-30 18:26 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (22.04 KB, patch)
2021-07-30 19:40 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (22.05 KB, patch)
2021-07-31 13:56 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (23.78 KB, patch)
2021-08-02 10:55 PDT, Yusuke Suzuki
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (23.56 KB, patch)
2021-08-02 11:12 PDT, Yusuke Suzuki
saam: review+
ews-feeder: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2021-07-29 15:12:51 PDT
...
Comment 1 Yusuke Suzuki 2021-07-29 23:45:13 PDT
Created attachment 434606 [details]
Patch
Comment 2 Yusuke Suzuki 2021-07-29 23:52:00 PDT
Created attachment 434607 [details]
Patch
Comment 3 Yusuke Suzuki 2021-07-30 12:53:44 PDT
Created attachment 434655 [details]
Patch
Comment 4 Yusuke Suzuki 2021-07-30 13:14:35 PDT
Created attachment 434659 [details]
Patch
Comment 5 Yusuke Suzuki 2021-07-30 13:22:01 PDT
Created attachment 434661 [details]
Patch
Comment 6 Yusuke Suzuki 2021-07-30 18:26:20 PDT
Created attachment 434681 [details]
Patch
Comment 7 Yusuke Suzuki 2021-07-30 19:40:48 PDT
Created attachment 434682 [details]
Patch
Comment 8 Yusuke Suzuki 2021-07-30 19:42:56 PDT
Comment on attachment 434682 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434682&action=review

> Source/JavaScriptCore/yarr/YarrJIT.cpp:2462
>                          if (!m_pattern.m_body->m_hasFixedSize) {

Previously, this path was dead code since we run this only when `disjunction->m_hasFixedSize` is true. After removing that restriction, I found the bug that we didn't update that when the match is failed, that's the above change.
Comment 9 Yusuke Suzuki 2021-07-31 13:56:40 PDT
Created attachment 434704 [details]
Patch
Comment 10 Yusuke Suzuki 2021-08-02 10:55:45 PDT
Created attachment 434770 [details]
Patch
Comment 11 Yusuke Suzuki 2021-08-02 11:12:03 PDT
Created attachment 434773 [details]
Patch
Comment 12 Saam Barati 2021-08-02 15:50:14 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

> Source/JavaScriptCore/yarr/YarrJIT.cpp:3974
> +                    if (!characterClass.m_rangesUnicode.isEmpty())
> +                        bmInfo.addRanges(cursor, characterClass.m_rangesUnicode);
> +                    if (!characterClass.m_matchesUnicode.isEmpty())
> +                        bmInfo.addCharacters(cursor, characterClass.m_matchesUnicode);

don't we check that the regex isn't unicode before collecting BM info?

> Source/JavaScriptCore/yarr/YarrJIT.h:111
> +            if (static_cast<unsigned>(range.end - range.begin + 1) >= mapSize) {

why isn't this looking at m_count + (range.end-range.begin+1)?

> Source/JavaScriptCore/yarr/YarrJIT.h:122
> +        m_count = mapSize;

we don't need to set the actual bits?
Comment 13 Yusuke Suzuki 2021-08-02 16:00:25 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

>> Source/JavaScriptCore/yarr/YarrJIT.cpp:3974
>> +                        bmInfo.addCharacters(cursor, characterClass.m_matchesUnicode);
> 
> don't we check that the regex isn't unicode before collecting BM info?

unicode() flag is whether we decode surrogate-pairs. This is different concept from including unicode in RegExp, so RegExp can include non-ASCII characters without unicode flag.

>> Source/JavaScriptCore/yarr/YarrJIT.h:111
>> +            if (static_cast<unsigned>(range.end - range.begin + 1) >= mapSize) {
> 
> why isn't this looking at m_count + (range.end-range.begin+1)?

It is possible that these characters overlaps with the already included characters in the bitmap. In that case, `m_count + (range.end-range.begin+1)` is too restrictive.
On the other hand, if the range exceeds the mapSize, then we can definitely say "this range does not fit in mapSize".

>> Source/JavaScriptCore/yarr/YarrJIT.h:122
>> +        m_count = mapSize;
> 
> we don't need to set the actual bits?

This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful".
Comment 14 Saam Barati 2021-08-02 16:06:01 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

r=me

>>> Source/JavaScriptCore/yarr/YarrJIT.h:122
>>> +        m_count = mapSize;
>> 
>> we don't need to set the actual bits?
> 
> This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful".

can we static assert that the limit is <= mapSize?
Comment 15 Yusuke Suzuki 2021-08-02 16:31:20 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

>>>> Source/JavaScriptCore/yarr/YarrJIT.h:122
>>>> +        m_count = mapSize;
>>> 
>>> we don't need to set the actual bits?
>> 
>> This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful".
> 
> can we static assert that the limit is <= mapSize?

Yes, done :)
Comment 16 Yusuke Suzuki 2021-08-02 16:43:24 PDT
Committed r280570 (240194@main): <https://commits.webkit.org/240194@main>
Comment 17 Radar WebKit Bug Importer 2021-08-02 16:44:25 PDT
<rdar://problem/81435189>
Comment 18 Darin Adler 2021-08-02 17:40:06 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

> Source/JavaScriptCore/yarr/YarrJIT.h:98
> +        for (UChar character : characters)

Why is this narrowing the UChar32 to UChar?
Comment 19 Yusuke Suzuki 2021-08-02 18:05:48 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

>> Source/JavaScriptCore/yarr/YarrJIT.h:98
>> +        for (UChar character : characters)
> 
> Why is this narrowing the UChar32 to UChar?

Oops, I'll change it. (but this is not an issue since we mask this character with 0x7f anyway in `add`.
Comment 20 Yusuke Suzuki 2021-08-02 18:11:38 PDT
Committed r280577 (240199@main): <https://commits.webkit.org/240199@main>
Comment 21 Darin Adler 2021-08-03 13:25:30 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

>>> Source/JavaScriptCore/yarr/YarrJIT.h:98
>>> +        for (UChar character : characters)
>> 
>> Why is this narrowing the UChar32 to UChar?
> 
> Oops, I'll change it. (but this is not an issue since we mask this character with 0x7f anyway in `add`.

Glad this didn’t matter. It’s an example of why I like auto so much.