RESOLVED FIXED275279
[JSC] Use immediate bit-vectors for character class matching in YarrJIT
https://bugs.webkit.org/show_bug.cgi?id=275279
Summary [JSC] Use immediate bit-vectors for character class matching in YarrJIT
David Degazio
Reported 2024-06-07 15:06:08 PDT
Currently YarrJIT is relatively naive when it comes to matching character classes, as far as I can tell (ignoring some special cases, i.e. unifying ASCII letters using a bit flip, lookup tables for spaces) there are pretty much two ways we match characters: - Binary search over the codepoint values, which is reasonably effective in general for lots of contiguous ranges, but relatively branchy. - Sequential character-by-character equality checks (!) for sets of non-range character matches. The latter especially is super naive, if we have a set of N discontiguous characters we are potentially doing N independent branch-if-equal checks... I think we can improve this by exploiting the fact that lots of character checks are quite close together. Even considering unicode, my guess is it's pretty likely for ranges in a character class to be close to each other in the codepoint range, or for there to be clusters of codepoints in general within a few tens of each other. In these cases, we can do a single range-check, a subtract, and a bit-vector test to see if a character is within some small set, even if the set is sparse. If we keep the sets small enough, we can also avoid any (data) memory overhead by directly materializing the set into a register.
Attachments
Radar WebKit Bug Importer
Comment 1 2024-06-07 15:06:19 PDT
David Degazio
Comment 2 2024-06-11 15:34:37 PDT
EWS
Comment 3 2024-06-27 12:56:21 PDT
Committed 280425@main (34b0b047bb64): <https://commits.webkit.org/280425@main> Reviewed commits have been landed. Closing PR #29727 and removing active labels.
Note You need to log in before you can comment on or make changes to this bug.