Bug 228613

Summary: [JSC] Yarr BoyerMoore search should support character-class
Product: WebKit Reporter: Yusuke Suzuki <ysuzuki>
Component: JavaScriptCoreAssignee: Yusuke Suzuki <ysuzuki>
Status: RESOLVED FIXED    
Severity: Normal CC: darin, ews-watchlist, keith_miller, mark.lam, msaboff, saam, tzagallo, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
ews-feeder: commit-queue-
Patch saam: review+, ews-feeder: commit-queue-

Description Yusuke Suzuki 2021-07-29 15:12:51 PDT
...
Comment 1 Yusuke Suzuki 2021-07-29 23:45:13 PDT
Created attachment 434606 [details]
Patch
Comment 2 Yusuke Suzuki 2021-07-29 23:52:00 PDT
Created attachment 434607 [details]
Patch
Comment 3 Yusuke Suzuki 2021-07-30 12:53:44 PDT
Created attachment 434655 [details]
Patch
Comment 4 Yusuke Suzuki 2021-07-30 13:14:35 PDT
Created attachment 434659 [details]
Patch
Comment 5 Yusuke Suzuki 2021-07-30 13:22:01 PDT
Created attachment 434661 [details]
Patch
Comment 6 Yusuke Suzuki 2021-07-30 18:26:20 PDT
Created attachment 434681 [details]
Patch
Comment 7 Yusuke Suzuki 2021-07-30 19:40:48 PDT
Created attachment 434682 [details]
Patch
Comment 8 Yusuke Suzuki 2021-07-30 19:42:56 PDT
Comment on attachment 434682 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434682&action=review

> Source/JavaScriptCore/yarr/YarrJIT.cpp:2462
>                          if (!m_pattern.m_body->m_hasFixedSize) {

Previously, this path was dead code since we run this only when `disjunction->m_hasFixedSize` is true. After removing that restriction, I found the bug that we didn't update that when the match is failed, that's the above change.
Comment 9 Yusuke Suzuki 2021-07-31 13:56:40 PDT
Created attachment 434704 [details]
Patch
Comment 10 Yusuke Suzuki 2021-08-02 10:55:45 PDT
Created attachment 434770 [details]
Patch
Comment 11 Yusuke Suzuki 2021-08-02 11:12:03 PDT
Created attachment 434773 [details]
Patch
Comment 12 Saam Barati 2021-08-02 15:50:14 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

> Source/JavaScriptCore/yarr/YarrJIT.cpp:3974
> +                    if (!characterClass.m_rangesUnicode.isEmpty())
> +                        bmInfo.addRanges(cursor, characterClass.m_rangesUnicode);
> +                    if (!characterClass.m_matchesUnicode.isEmpty())
> +                        bmInfo.addCharacters(cursor, characterClass.m_matchesUnicode);

don't we check that the regex isn't unicode before collecting BM info?

> Source/JavaScriptCore/yarr/YarrJIT.h:111
> +            if (static_cast<unsigned>(range.end - range.begin + 1) >= mapSize) {

why isn't this looking at m_count + (range.end-range.begin+1)?

> Source/JavaScriptCore/yarr/YarrJIT.h:122
> +        m_count = mapSize;

we don't need to set the actual bits?
Comment 13 Yusuke Suzuki 2021-08-02 16:00:25 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

>> Source/JavaScriptCore/yarr/YarrJIT.cpp:3974
>> +                        bmInfo.addCharacters(cursor, characterClass.m_matchesUnicode);
> 
> don't we check that the regex isn't unicode before collecting BM info?

unicode() flag is whether we decode surrogate-pairs. This is different concept from including unicode in RegExp, so RegExp can include non-ASCII characters without unicode flag.

>> Source/JavaScriptCore/yarr/YarrJIT.h:111
>> +            if (static_cast<unsigned>(range.end - range.begin + 1) >= mapSize) {
> 
> why isn't this looking at m_count + (range.end-range.begin+1)?

It is possible that these characters overlaps with the already included characters in the bitmap. In that case, `m_count + (range.end-range.begin+1)` is too restrictive.
On the other hand, if the range exceeds the mapSize, then we can definitely say "this range does not fit in mapSize".

>> Source/JavaScriptCore/yarr/YarrJIT.h:122
>> +        m_count = mapSize;
> 
> we don't need to set the actual bits?

This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful".
Comment 14 Saam Barati 2021-08-02 16:06:01 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

r=me

>>> Source/JavaScriptCore/yarr/YarrJIT.h:122
>>> +        m_count = mapSize;
>> 
>> we don't need to set the actual bits?
> 
> This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful".

can we static assert that the limit is <= mapSize?
Comment 15 Yusuke Suzuki 2021-08-02 16:31:20 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

>>>> Source/JavaScriptCore/yarr/YarrJIT.h:122
>>>> +        m_count = mapSize;
>>> 
>>> we don't need to set the actual bits?
>> 
>> This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful".
> 
> can we static assert that the limit is <= mapSize?

Yes, done :)
Comment 16 Yusuke Suzuki 2021-08-02 16:43:24 PDT
Committed r280570 (240194@main): <https://commits.webkit.org/240194@main>
Comment 17 Radar WebKit Bug Importer 2021-08-02 16:44:25 PDT
<rdar://problem/81435189>
Comment 18 Darin Adler 2021-08-02 17:40:06 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

> Source/JavaScriptCore/yarr/YarrJIT.h:98
> +        for (UChar character : characters)

Why is this narrowing the UChar32 to UChar?
Comment 19 Yusuke Suzuki 2021-08-02 18:05:48 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

>> Source/JavaScriptCore/yarr/YarrJIT.h:98
>> +        for (UChar character : characters)
> 
> Why is this narrowing the UChar32 to UChar?

Oops, I'll change it. (but this is not an issue since we mask this character with 0x7f anyway in `add`.
Comment 20 Yusuke Suzuki 2021-08-02 18:11:38 PDT
Committed r280577 (240199@main): <https://commits.webkit.org/240199@main>
Comment 21 Darin Adler 2021-08-03 13:25:30 PDT
Comment on attachment 434773 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review

>>> Source/JavaScriptCore/yarr/YarrJIT.h:98
>>> +        for (UChar character : characters)
>> 
>> Why is this narrowing the UChar32 to UChar?
> 
> Oops, I'll change it. (but this is not an issue since we mask this character with 0x7f anyway in `add`.

Glad this didn’t matter. It’s an example of why I like auto so much.