Bug 291923

Summary: [Yarr] Improve reading of Surrogate Pairs in Unicode Regular Expressions
Product: WebKit Reporter: Michael Saboff <msaboff>
Component: New BugsAssignee: Michael Saboff <msaboff>
Status: RESOLVED FIXED    
Severity: Normal CC: webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: Other   
Hardware: Unspecified   
OS: Unspecified   

Michael Saboff
Reported 2025-04-22 16:54:07 PDT
Currently we create a helper to read possible surrogate pairs. That helper reads a single 16 byte character and checks to see if it is a surrogate and if it is a leading surrogate, it reads a second character to see if it is a trailing surrogate. If so we construct a non-BMP character and return it. That helper is generated at the end of every RegExp JIT'ed code. There are a few optimizations we can make. 1. If possible, we can load 32 bits and check to see if the two characters that read are a valid surrogate pair. If so, we convert it and return. 2. We can reduce the number of branches in the hot paths. 3. We can turn the helper into thunk that is created when needed, thus reducing the JIT footprint when multiple Unicode RegExp have been compiled.
Attachments
Radar WebKit Bug Importer
Comment 1 2025-04-22 16:54:40 PDT
Michael Saboff
Comment 2 2025-04-22 17:48:43 PDT
EWS
Comment 3 2025-04-23 22:21:24 PDT
Committed 294046@main (aaee2a6f166a): <https://commits.webkit.org/294046@main> Reviewed commits have been landed. Closing PR #44394 and removing active labels.
Note You need to log in before you can comment on or make changes to this bug.