Bug 16696 - JSCRE fails fails to match Acid3 regexp
Summary: JSCRE fails fails to match Acid3 regexp
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Darin Adler
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-01 01:46 PST by Eric Seidel (no email)
Modified: 2008-01-02 22:59 PST (History)
4 users (show)

See Also:


Attachments
patch (11.30 KB, patch)
2008-01-02 00:27 PST, Darin Adler
ggaren: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2008-01-01 01:46:04 PST
JSCRE fails fails to match Acid3 regexp

This is the assert we fail:
assert(x == 4, "/(\\3)(\\1)(a)/ failed to match 'cat'");

I'm really not sure why that assert is valid in the first place.

    function () {
      // test 86: Regular Expressions
      var ok = true;
      try {
        eval("/TA[])]/.exec('TA]')");
        // JS regexps aren't like Perl regexps, if their character
        // classes start with a ] that means they're empty. So this
        // is a syntax error; if we get here it's a bug.
        ok = false;
      } catch (e) { }
      assert(ok, "orphaned bracket not considered parse error in regular expression literal");
      try {
        if (eval("/[]/.exec('')"))
          ok = false;
      } catch (e) {
        ok = false;
      }
      assert(ok, "/[]/ either failed to parse or matched something");
      var x = /(\3)(\1)(a)/.exec('cat');
      assert(x == 4, "/(\\3)(\\1)(a)/ failed to match 'cat'");
      assert(x.length == 4, "/(\\3)(\\1)(a)/ failed to return four components");
      assert(x[0] == "a", "/(\\3)(\\1)(a)/ failed to find 'a' in 'cat'");
      assert(x[1] == "", "/(\\3)(\\1)(a)/ failed to find '' in 'cat' as first part");
      assert(x[2] == "", "/(\\3)(\\1)(a)/ failed to find '' in 'cat' as second part");
      assert(x[3] == "a", "/(\\3)(\\1)(a)/ failed to find 'a' in 'cat' as third part");
      return 6;
    },
Comment 1 Darin Adler 2008-01-01 09:49:33 PST
There may be a problem with the test here; it looks to me like that expression requires three "a" characters in a row, so the test string would need to be "caaat".

But also, our regular expression engine has a bug (and IE shares the bug, which is one reason I didn't fix it), where we don't recognize the \3 as a backreference because it's a forward reference. At the time we process it we don't know that there are 3 capturing brackets in the expression, so we treat it as an octal escape. To fix that we'll need to scan the regular expression for the number of brackets first before we do other computations on it. A simpler test case is just:

    /\1(a)/.exec('aa')

Which will return null in Safari, but should return "a,a".
Comment 2 Eric Seidel (no email) 2008-01-01 11:11:37 PST
CCing Ian just to make sure he sees this.
Comment 3 Ian 'Hixie' Hickson 2008-01-01 14:29:46 PST
Forward references to capture brackets that haven't been seen yet, in ES4, are defined to always match the empty string. So this:
   /a(\2)(x)a/
...matches only "axa", not "axxa", and on a positive match always returns ['axa', '', 'x', 'a'].

FWIW, ES4 regexps are not Perl-compatible in this and many other edge cases.
Comment 4 Darin Adler 2008-01-01 14:43:55 PST
(In reply to comment #3)
> Forward references to capture brackets that haven't been seen yet, in ES4, are defined to always match the empty string. So this:
>    /a(\2)(x)a/
> ...matches only "axa", not "axxa", and on a positive match always returns
> ['axa', '', 'x', 'a'].

OK, I see.

I believe I made a fix to our code to handle that correctly a while back.

Then I guess it's just the issue that our regular expression code doesn't recognize the forward reference as a capture at all, because of the quirk of treating them as octal escapes if they are higher than the number of capturing brackets. And the fact that the number of capturing brackets is based on the point in the parsing that we're at. I think a fix requires another pass over the regular expression to count the capturing brackets.
Comment 5 Eric Seidel (no email) 2008-01-01 15:07:14 PST
The specific supporting text according to Hixie:

http://bclary.com/2004/11/07/#a-15.10.2.9
Step 8. sub-steps 2-3
Comment 6 Ian 'Hixie' Hickson 2008-01-01 15:16:27 PST
Er, I meant ES3.

DecimalEscape in regexps (\0 to \9) are only ever backreferences in ES3. However, you still to raise a SyntaxError exception if the digit exceeds the number of captures in the regexp. But I suppose you don't need a second pass for that.
Comment 7 Darin Adler 2008-01-01 23:25:20 PST
(In reply to comment #6)
> DecimalEscape in regexps (\0 to \9) are only ever backreferences in ES3.

That's true. The ES3 B.1.2 appendix only lists octal escape sequences as an extension for string literals, not regular expressions.

But both Gecko and IE support octal escape sequences as well as backreferences, and I believe they use the number of capturing parentheses in the expression as a cut-off.

For example, try this expression:

    /\1/.exec(String.fromCharCode(1)).toString().charCodeAt(0)

In Gecko this returns the value "1".

> However, you still to raise a SyntaxError exception if the digit exceeds the
> number of captures in the regexp. But I suppose you don't need a second pass
> for that.

I suspect the octal escape sequences are used in real-life web pages. So I do think I need an extra pass over the regular expression.
Comment 8 Darin Adler 2008-01-01 23:28:20 PST
Maciej, Geoff, this is a flaw in ECMAScript that we should get fixed for version 4. It doesn't specify the octal escape sequences for regular expressions, only string literals.
Comment 9 Darin Adler 2008-01-01 23:51:33 PST
(In reply to comment #0)
>       assert(x == 4, "/(\\3)(\\1)(a)/ failed to match 'cat'");

I don't understand how this could be right. The value of x is an array, not the number 4.

>       assert(x.length == 4, "/(\\3)(\\1)(a)/ failed to return four components");
>       assert(x[0] == "a", "/(\\3)(\\1)(a)/ failed to find 'a' in 'cat'");
>       assert(x[1] == "", "/(\\3)(\\1)(a)/ failed to find '' in 'cat' as first part");
>       assert(x[2] == "", "/(\\3)(\\1)(a)/ failed to find '' in 'cat' as second part");
>       assert(x[3] == "a", "/(\\3)(\\1)(a)/ failed to find 'a' in 'cat' as third part");

We seem to pass all of these. I suspect this is just a mistake in the test.
Comment 10 Darin Adler 2008-01-02 00:10:45 PST
We were passing all of this in Safari 3, but now it is broken. Patch on its way.
Comment 11 Darin Adler 2008-01-02 00:27:47 PST
Created attachment 18235 [details]
patch
Comment 12 Geoffrey Garen 2008-01-02 15:28:38 PST
Comment on attachment 18235 [details]
patch

r=me

I wish the PCRE tests in fast/regexp were somehow separate from the rest.
Comment 13 Darin Adler 2008-01-02 22:59:04 PST
Committed revision 29110.