RESOLVED FIXED 180537
YARR: Coalesce constructed character classes
https://bugs.webkit.org/show_bug.cgi?id=180537
Summary YARR: Coalesce constructed character classes
Michael Saboff
Reported 2017-12-07 11:34:40 PST
Currently when we construct a character class like [abcde], we end up with a check for each character instead of characters in the range of a..e. It is also common for RegExp's to be written with something like [\s\S] when the programmer really wanted a . with the newly added 's', aka dotAll flag. In that case we perform lots of individual character and range checks. Instead we should coalesce characters and ranges when constructing a character class to reduce the resulting checks.
Attachments
Patch (15.08 KB, patch)
2017-12-07 12:35 PST, Michael Saboff
jfbastien: review+
Radar WebKit Bug Importer
Comment 1 2017-12-07 11:35:39 PST
Michael Saboff
Comment 2 2017-12-07 12:35:54 PST
EWS Watchlist
Comment 3 2017-12-07 12:38:23 PST
Attachment 328716 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/yarr/YarrPattern.cpp:406: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 1 in 5 files If any of these errors are false positives, please file a bug against check-webkit-style.
JF Bastien
Comment 4 2017-12-08 09:30:03 PST
Comment on attachment 328716 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=328716&action=review I'm not an expert in this code, but looks fine overall. Minor comments. > Source/JavaScriptCore/yarr/YarrPattern.cpp:286 > + if (pos + index > 0 && matches[pos + index - 1] == ch - 1) { pos and index are both unsigned, so this is just checking that it's non-zero? Or was the intent to capture wraparound as well? > Source/JavaScriptCore/yarr/YarrPattern.cpp:358 > + // each iteration of the loop we will either remove something from the list, or break the loop. Break the loop, or just break out of it? > Source/JavaScriptCore/yarr/YarrPattern.cpp:407 > + && m_rangesUnicode[0].begin == 0x80 && m_rangesUnicode[0].end == 0x10ffff) I don't get the Unicode range comparison. That's the general non-ASCII range, can the user specify invalid codepoint ranges? Or put another way, when it this range *not* the Unicode range?
Michael Saboff
Comment 5 2017-12-08 10:15:09 PST
Comment on attachment 328716 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=328716&action=review >> Source/JavaScriptCore/yarr/YarrPattern.cpp:286 >> + if (pos + index > 0 && matches[pos + index - 1] == ch - 1) { > > pos and index are both unsigned, so this is just checking that it's non-zero? Or was the intent to capture wraparound as well? Just checking that it's non-zero. Due to the range of character values (0..0x10ffff), we can't get close to wrapping around even if there was one character per range. >> Source/JavaScriptCore/yarr/YarrPattern.cpp:358 >> + // each iteration of the loop we will either remove something from the list, or break the loop. > > Break the loop, or just break out of it? Break *out of* the loop. >> Source/JavaScriptCore/yarr/YarrPattern.cpp:407 >> + && m_rangesUnicode[0].begin == 0x80 && m_rangesUnicode[0].end == 0x10ffff) > > I don't get the Unicode range comparison. That's the general non-ASCII range, can the user specify invalid codepoint ranges? > > Or put another way, when it this range *not* the Unicode range? This checks that this character class matches every possible character.
Michael Saboff
Comment 6 2017-12-08 10:27:20 PST
Note You need to log in before you can comment on or make changes to this bug.