Bug 16207

Summary: JavaScript regular expressions should match UTF-16 code units rather than characters
Product: WebKit Reporter: Darin Adler <darin>
Component: JavaScriptCoreAssignee: Darin Adler <darin>
Severity: Minor CC: eric
Priority: P3    
Version: 528+ (Nightly build)   
Hardware: Mac   
OS: OS X 10.4   
Description Flags
patch, speeds up SunSpider aroben: review+

Description Darin Adler 2007-11-30 07:02:13 PST
Testing with other browsers indicates that the JavaScript regular expression code needs to treat surrogate pairs as two "characters" rather than a single character to match them.

This is good news in a way, because it's an easy way to make the regular expression engine faster, by removing the UTF-16 smarts from most of the engine.
Comment 1 Darin Adler 2007-11-30 07:08:54 PST
Created attachment 17606 [details]
patch, speeds up SunSpider
Comment 2 Adam Roben (:aroben) 2007-11-30 10:08:37 PST
Comment on attachment 17606 [details]
patch, speeds up SunSpider

 2425                                 d = *++ptr;

The precedence here seems correct, but potentially confusing. Maybe *(++ptr) would be better?

 757                 int c = *stack.currentFrame->args.subjectPtr++;

Again, parentheses might make it clearer what precedence you're expecting here (and in the other instances of this expression).

 1640                                 if (stack.currentFrame->args.subjectPtr >= md.end_subject || isNewline(*stack.currentFrame->args.subjectPtr))

Why did you leave the comparison with md.end_subject here but now elsewhere?

Comment 3 Darin Adler 2007-11-30 10:55:00 PST
Committed revision 28243.