Bug 180537 - YARR: Coalesce constructed character classes
Summary: YARR: Coalesce constructed character classes
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: Other
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords: InRadar
Depends on:
Blocks: 179230
  Show dependency treegraph
 
Reported: 2017-12-07 11:34 PST by Michael Saboff
Modified: 2022-02-27 23:30 PST (History)
7 users (show)

See Also:


Attachments
Patch (15.08 KB, patch)
2017-12-07 12:35 PST, Michael Saboff
jfbastien: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2017-12-07 11:34:40 PST
Currently when we construct a character class like [abcde], we end up with a check for each character instead of characters in the range of a..e.  It is also common for RegExp's to be written with something like [\s\S] when the programmer really wanted a . with the newly added 's', aka dotAll flag.  In that case we perform lots of individual character and range checks.

Instead we should coalesce characters and ranges when constructing a character class to reduce the resulting checks.
Comment 1 Radar WebKit Bug Importer 2017-12-07 11:35:39 PST
<rdar://problem/35914557>
Comment 2 Michael Saboff 2017-12-07 12:35:54 PST
Created attachment 328716 [details]
Patch
Comment 3 EWS Watchlist 2017-12-07 12:38:23 PST
Attachment 328716 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/yarr/YarrPattern.cpp:406:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 1 in 5 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 4 JF Bastien 2017-12-08 09:30:03 PST
Comment on attachment 328716 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=328716&action=review

I'm not an expert in this code, but looks fine overall. Minor comments.

> Source/JavaScriptCore/yarr/YarrPattern.cpp:286
> +                    if (pos + index > 0 && matches[pos + index - 1] == ch - 1) {

pos and index are both unsigned, so this is just checking that it's non-zero? Or was the intent to capture wraparound as well?

> Source/JavaScriptCore/yarr/YarrPattern.cpp:358
> +        // each iteration of the loop we will either remove something from the list, or break the loop.

Break the loop, or just break out of it?

> Source/JavaScriptCore/yarr/YarrPattern.cpp:407
> +            && m_rangesUnicode[0].begin == 0x80 && m_rangesUnicode[0].end == 0x10ffff)

I don't get the Unicode range comparison. That's the general non-ASCII range, can the user specify invalid codepoint ranges?

Or put another way, when it this range *not* the Unicode range?
Comment 5 Michael Saboff 2017-12-08 10:15:09 PST
Comment on attachment 328716 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=328716&action=review

>> Source/JavaScriptCore/yarr/YarrPattern.cpp:286
>> +                    if (pos + index > 0 && matches[pos + index - 1] == ch - 1) {
> 
> pos and index are both unsigned, so this is just checking that it's non-zero? Or was the intent to capture wraparound as well?

Just checking that it's non-zero.  Due to the range of character values (0..0x10ffff), we can't get close to wrapping around even if there was one character per range.

>> Source/JavaScriptCore/yarr/YarrPattern.cpp:358
>> +        // each iteration of the loop we will either remove something from the list, or break the loop.
> 
> Break the loop, or just break out of it?

Break *out of* the loop.

>> Source/JavaScriptCore/yarr/YarrPattern.cpp:407
>> +            && m_rangesUnicode[0].begin == 0x80 && m_rangesUnicode[0].end == 0x10ffff)
> 
> I don't get the Unicode range comparison. That's the general non-ASCII range, can the user specify invalid codepoint ranges?
> 
> Or put another way, when it this range *not* the Unicode range?

This checks that this character class matches every possible character.
Comment 6 Michael Saboff 2017-12-08 10:27:20 PST
Committed r225683: <https://trac.webkit.org/changeset/225683>