Bug 10370 - RegExp fails to match non-ASCII characters against [\S\s]
Summary: RegExp fails to match non-ASCII characters against [\S\s]
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: All OS X 10.4
: P2 Major
Assignee: Alexey Proskuryakov
URL: http://www.dougweb.org/bugzilla/safar...
Keywords: HasReduction
: 14877 15224 (view as bug list)
Depends on:
Reported: 2006-08-12 09:51 PDT by Doug Wright
Modified: 2007-11-04 03:20 PST (History)
5 users (show)

See Also:

Reduced test case (703 bytes, text/html)
2006-08-12 19:20 PDT, Mark Rowe (bdash)
no flags Details
a more complete test case (1.06 KB, text/html)
2007-09-20 05:53 PDT, Alexey Proskuryakov
no flags Details
a more complete test case (1.45 KB, text/html)
2007-09-22 13:07 PDT, Alexey Proskuryakov
no flags Details
proposed fix (68.01 KB, patch)
2007-09-29 02:41 PDT, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Doug Wright 2006-08-12 09:51:19 PDT
See testcase. The 2nd alert() should display a few lines of text, but instead displays null because the regexp scanner has barfed upon encountering ’.
Comment 1 Mark Rowe (bdash) 2006-08-12 19:19:19 PDT
Confirmed with WebKit ToT and 418.8.  The character in question is Unicode "RIGHT SINGLE QUOTATION MARK".  Reduction forthcoming.
Comment 2 Mark Rowe (bdash) 2006-08-12 19:20:11 PDT
Created attachment 10008 [details]
Reduced test case
Comment 3 Alexey Proskuryakov 2007-09-19 03:53:07 PDT
*** Bug 15224 has been marked as a duplicate of this bug. ***
Comment 4 Alexey Proskuryakov 2007-09-19 03:56:13 PDT
As seen in bug 15224, this affects all non-ASCII characters, and causes problems in prototype.js. Looks like a very important bug to me.
Comment 5 Alexey Proskuryakov 2007-09-20 05:53:11 PDT
Created attachment 16333 [details]
a more complete test case

Tests other regex special characters, too. Passes in Firefox, and mostly passes in IE7, which apparently doesn't treat Unicode whitespace characters as such.
Comment 6 Alexey Proskuryakov 2007-09-22 02:17:25 PDT
This issue is also present in original PCRE 6.1 and 7.4. From comments in code, I'm not sure what the intended behavior for Perl is, but the the fact that \S and [\S] work differently surely looks like an bug.
Comment 7 Alexey Proskuryakov 2007-09-22 13:07:17 PDT
Created attachment 16349 [details]
a more complete test case

Added a test for a closely related issue from <http://bugs.exim.org/show_bug.cgi?id=580>. That bug was recently fixed, see

svn diff -r218:219 svn://tahini.csx.cam.ac.uk/pcre

I'm going to file the problem with [\S] to PCRE bugzilla soon.
Comment 8 Alexey Proskuryakov 2007-09-22 13:19:12 PDT
(In reply to comment #7)
> svn diff -r218:219 svn://tahini.csx.cam.ac.uk/pcre

I've just found that there's a ViewVC for PCRE: http://vcs.pcre.org/viewvc?view=rev&revision=219

> I'm going to file the problem with [\S] to PCRE bugzilla soon.

Comment 9 Alexey Proskuryakov 2007-09-29 02:41:59 PDT
Created attachment 16446 [details]
proposed fix

This is based on an approach suggested by Philip Hazel, and on his fix for \S{2} vs. \S\S bug.

I think this fix is important enough to go to trunk.
Comment 10 Darin Adler 2007-10-01 16:54:02 PDT
Comment on attachment 16446 [details]
proposed fix

Comment 11 Alexey Proskuryakov 2007-10-02 21:42:27 PDT
Committed revision 25958 (feature branch).
Comment 12 Alexey Blinov 2007-10-08 00:36:38 PDT
Hi. Feature branch - is it nightly build of WebKit (http://nightly.webkit.org/)?
Or I have to compile it myself?
Comment 13 Mark Rowe (bdash) 2007-10-08 00:39:30 PDT
Nightly builds of the feature branch are available at http://nightly.webkit.org/builds/overview/feature-branch.
Comment 14 Alexey Proskuryakov 2007-11-04 03:20:57 PST
*** Bug 14877 has been marked as a duplicate of this bug. ***