Bug 16207 - JavaScript regular expressions should match UTF-16 code units rather than characters
Summary: JavaScript regular expressions should match UTF-16 code units rather than cha...
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Mac OS X 10.4
: P3 Minor
Assignee: Darin Adler
Depends on:
Reported: 2007-11-30 07:02 PST by Darin Adler
Modified: 2007-11-30 10:55 PST (History)
1 user (show)

See Also:

patch, speeds up SunSpider (64.63 KB, patch)
2007-11-30 07:08 PST, Darin Adler
aroben: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Darin Adler 2007-11-30 07:02:13 PST
Testing with other browsers indicates that the JavaScript regular expression code needs to treat surrogate pairs as two "characters" rather than a single character to match them.

This is good news in a way, because it's an easy way to make the regular expression engine faster, by removing the UTF-16 smarts from most of the engine.
Comment 1 Darin Adler 2007-11-30 07:08:54 PST
Created attachment 17606 [details]
patch, speeds up SunSpider
Comment 2 Adam Roben (:aroben) 2007-11-30 10:08:37 PST
Comment on attachment 17606 [details]
patch, speeds up SunSpider

 2425                                 d = *++ptr;

The precedence here seems correct, but potentially confusing. Maybe *(++ptr) would be better?

 757                 int c = *stack.currentFrame->args.subjectPtr++;

Again, parentheses might make it clearer what precedence you're expecting here (and in the other instances of this expression).

 1640                                 if (stack.currentFrame->args.subjectPtr >= md.end_subject || isNewline(*stack.currentFrame->args.subjectPtr))

Why did you leave the comparison with md.end_subject here but now elsewhere?

Comment 3 Darin Adler 2007-11-30 10:55:00 PST
Committed revision 28243.