228613 – [JSC] Yarr BoyerMoore search should support character-class

RESOLVED FIXED 228613

[JSC] Yarr BoyerMoore search should support character-class

https://bugs.webkit.org/show_bug.cgi?id=228613

Summary [JSC] Yarr BoyerMoore search should support character-class

Yusuke Suzuki

Reported 2021-07-29 15:12:51 PDT

...

Attachments
Patch (13.42 KB, patch) 2021-07-29 23:45 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (15.11 KB, patch) 2021-07-29 23:52 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (15.38 KB, patch) 2021-07-30 12:53 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (19.65 KB, patch) 2021-07-30 13:14 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (19.87 KB, patch) 2021-07-30 13:22 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (19.83 KB, patch) 2021-07-30 18:26 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (22.04 KB, patch) 2021-07-30 19:40 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (22.05 KB, patch) 2021-07-31 13:56 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (23.78 KB, patch) 2021-08-02 10:55 PDT, Yusuke Suzuki	ews-feeder: commit-queue-	Details Formatted Diff Diff
Patch (23.56 KB, patch) 2021-08-02 11:12 PDT, Yusuke Suzuki	saam: review+ ews-feeder: commit-queue-	Details Formatted Diff Diff
Show Obsolete (9) View All Add attachment proposed patch, testcase, etc.

Yusuke Suzuki

Comment 1 2021-07-29 23:45:13 PDT

Created attachment 434606 [details] Patch

Yusuke Suzuki

Comment 2 2021-07-29 23:52:00 PDT

Created attachment 434607 [details] Patch

Yusuke Suzuki

Comment 3 2021-07-30 12:53:44 PDT

Created attachment 434655 [details] Patch

Yusuke Suzuki

Comment 4 2021-07-30 13:14:35 PDT

Created attachment 434659 [details] Patch

Yusuke Suzuki

Comment 5 2021-07-30 13:22:01 PDT

Created attachment 434661 [details] Patch

Yusuke Suzuki

Comment 6 2021-07-30 18:26:20 PDT

Created attachment 434681 [details] Patch

Yusuke Suzuki

Comment 7 2021-07-30 19:40:48 PDT

Created attachment 434682 [details] Patch

Yusuke Suzuki

Comment 8 2021-07-30 19:42:56 PDT

Comment on attachment 434682 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434682&action=review > Source/JavaScriptCore/yarr/YarrJIT.cpp:2462 > if (!m_pattern.m_body->m_hasFixedSize) { Previously, this path was dead code since we run this only when `disjunction->m_hasFixedSize` is true. After removing that restriction, I found the bug that we didn't update that when the match is failed, that's the above change.

Yusuke Suzuki

Comment 9 2021-07-31 13:56:40 PDT

Created attachment 434704 [details] Patch

Yusuke Suzuki

Comment 10 2021-08-02 10:55:45 PDT

Created attachment 434770 [details] Patch

Yusuke Suzuki

Comment 11 2021-08-02 11:12:03 PDT

Created attachment 434773 [details] Patch

Saam Barati

Comment 12 2021-08-02 15:50:14 PDT

Comment on attachment 434773 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review > Source/JavaScriptCore/yarr/YarrJIT.cpp:3974 > + if (!characterClass.m_rangesUnicode.isEmpty()) > + bmInfo.addRanges(cursor, characterClass.m_rangesUnicode); > + if (!characterClass.m_matchesUnicode.isEmpty()) > + bmInfo.addCharacters(cursor, characterClass.m_matchesUnicode); don't we check that the regex isn't unicode before collecting BM info? > Source/JavaScriptCore/yarr/YarrJIT.h:111 > + if (static_cast<unsigned>(range.end - range.begin + 1) >= mapSize) { why isn't this looking at m_count + (range.end-range.begin+1)? > Source/JavaScriptCore/yarr/YarrJIT.h:122 > + m_count = mapSize; we don't need to set the actual bits?

Yusuke Suzuki

Comment 13 2021-08-02 16:00:25 PDT

Comment on attachment 434773 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review >> Source/JavaScriptCore/yarr/YarrJIT.cpp:3974 >> + bmInfo.addCharacters(cursor, characterClass.m_matchesUnicode); > > don't we check that the regex isn't unicode before collecting BM info? unicode() flag is whether we decode surrogate-pairs. This is different concept from including unicode in RegExp, so RegExp can include non-ASCII characters without unicode flag. >> Source/JavaScriptCore/yarr/YarrJIT.h:111 >> + if (static_cast<unsigned>(range.end - range.begin + 1) >= mapSize) { > > why isn't this looking at m_count + (range.end-range.begin+1)? It is possible that these characters overlaps with the already included characters in the bitmap. In that case, `m_count + (range.end-range.begin+1)` is too restrictive. On the other hand, if the range exceeds the mapSize, then we can definitely say "this range does not fit in mapSize". >> Source/JavaScriptCore/yarr/YarrJIT.h:122 >> + m_count = mapSize; > > we don't need to set the actual bits? This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful".

Saam Barati

Comment 14 2021-08-02 16:06:01 PDT

Comment on attachment 434773 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review r=me >>> Source/JavaScriptCore/yarr/YarrJIT.h:122 >>> + m_count = mapSize; >> >> we don't need to set the actual bits? > > This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful". can we static assert that the limit is <= mapSize?

Yusuke Suzuki

Comment 15 2021-08-02 16:31:20 PDT

Comment on attachment 434773 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review >>>> Source/JavaScriptCore/yarr/YarrJIT.h:122 >>>> + m_count = mapSize; >>> >>> we don't need to set the actual bits? >> >> This map-count value is used to check whether we should care this bitmap in `findBestCharacterSequence`. So by setting it to mapSize, we can say "the collected characters are not sure, but this map is not useful". > > can we static assert that the limit is <= mapSize? Yes, done :)

Yusuke Suzuki

Comment 16 2021-08-02 16:43:24 PDT

Committed r280570 (240194@main): <https://commits.webkit.org/240194@main>

Radar WebKit Bug Importer

Comment 17 2021-08-02 16:44:25 PDT

<rdar://problem/81435189>

Darin Adler

Comment 18 2021-08-02 17:40:06 PDT

Yusuke Suzuki

Comment 19 2021-08-02 18:05:48 PDT

Comment on attachment 434773 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review >> Source/JavaScriptCore/yarr/YarrJIT.h:98 >> + for (UChar character : characters) > > Why is this narrowing the UChar32 to UChar? Oops, I'll change it. (but this is not an issue since we mask this character with 0x7f anyway in `add`.

Yusuke Suzuki

Comment 20 2021-08-02 18:11:38 PDT

Committed r280577 (240199@main): <https://commits.webkit.org/240199@main>

Darin Adler

Comment 21 2021-08-03 13:25:30 PDT

Comment on attachment 434773 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=434773&action=review >>> Source/JavaScriptCore/yarr/YarrJIT.h:98 >>> + for (UChar character : characters) >> >> Why is this narrowing the UChar32 to UChar? > > Oops, I'll change it. (but this is not an issue since we mask this character with 0x7f anyway in `add`. Glad this didn’t matter. It’s an example of why I like auto so much.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Yusuke Suzuki

Reported

2021-07-29 15:12 PDT

Modified

2021-08-03 13:25 PDT History

CC List

8 users Show

URL

Keywords InRadar

Depends on

Blocks