Bug 24166 - Regex: capture groups don't match on text longer than 49991 chars
Summary: Regex: capture groups don't match on text longer than 49991 chars
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Windows XP
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-25 11:47 PST by Nico Kaiser
Modified: 2010-07-28 01:03 PDT (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nico Kaiser 2009-02-25 11:47:29 PST
if a regex uses capture groups, it won't match if any of the resulting captures is longer than 49991 characters.

If you try this regex:

/start(.)*end/i

on this text

startXXXXXXend

then it will match until you put more than 49991 "X" between "start" and "end".

You can easily test it on http://regexpal.com/
It will work with firefox but not with safari or chrome.

This bug does only affect regex with groups in it. If you change the the regex to:

/start.*end/i

then it will match.
Comment 1 Alexey Proskuryakov 2009-02-26 10:48:36 PST
Duplicate of bug 18327?
Comment 2 Darin Adler 2009-02-26 11:12:20 PST
The title of this bug is misleading. There's no group longer than 49991 characters here. The group is one character long.
Comment 3 Nico Kaiser 2009-02-26 11:58:34 PST
(In reply to comment #2)
> The title of this bug is misleading. There's no group longer than 49991
> characters here. The group is one character long.
> 

I changed the title. Hope it's more precise now.
Comment 4 Darin Adler 2009-02-26 13:13:41 PST
The result of the capture group will only be the last character matched. The group is only going to capture a single character.
Comment 5 Nico Kaiser 2009-02-26 13:59:52 PST
(In reply to comment #4)
> The result of the capture group will only be the last character matched. The
> group is only going to capture a single character.
> 

OK. Bad example. The point is, that you'll get NO MATCH at all using webkit. If the text on which the group should match is longer than 49991 chars.

Change the regex to:
/start(.*)end/i

Now you should get the whole XXXX as result of the group. But as with the example above it won't match with webkit...
Comment 6 Gavin Barraclough 2009-02-26 23:55:55 PST
As a quick fix you may want to try rewriting (.)* as (?:.*(.))?

I think there is a good chance that this will be faster on most regex engines, and also this may well work on shipping Safari.

(I'm afraid I don't have a 49991 character input string lying around to check).

G.
Comment 7 Nico Kaiser 2009-02-27 02:31:43 PST
I'm getting confused...

It seems that's not only the length of the input string is important but also how you write the group condition...

I provided an example page:
http://mad-d-sign.de/webkit/regex.html

You will see that Test1 and Test2 have slightly different group conditions. While Test1 will return only the last "X" in the first (and only) group Test2 will return all the "X" in the first group. 

Both regex work on the 49997 characters long input string.

Test3 and Test4 again use the two different regex conditions but on a 49998 characters long input string.

Test3 does NOT match while Test4 does.

In non webkit browsers all 4 tests match.

I really hope now it's clear what I wanted to show.
Comment 8 Nico Kaiser 2009-02-28 09:16:15 PST
Here's another testcase which is a bit more realistic and very close to the situation which I was investigating when I found this issue.

http://www.mad-d-sign.de/webkit/regex2.html

It's interesting. The character limit seems to be different again here. We have some Text enclosed by <object> tags. Within are two groups we want to match. 

Test 1 will match and find the two groups. Group 1 includes all the "X" and group 2 just the word "sometext".

In Test2 there is just one more "X" included. Test 2 will not match.

The text in test 3 is exactly the same like in test 2. But the regex searches only one group. The one with the "X"s. Test 3 will match!

So the additional group and the long text break test 2.

Please notice. Test 2 will also match if I remove the question mark to do a greedy search. But in my case it could be that there are more object tags in my text like this:

<object>XXXsometext</object><object>XXXsometext</object><object>XXXsometext</object>

That's why I need the the none greedy switch. Otherwise I would only get one match including the whole text.
Comment 9 Nico Kaiser 2010-07-27 14:19:00 PDT
checked again after a long time. Seems to be fixed in Chrom 5.0.375.125! All tests are passed now.

Nice!
Comment 10 David Kilzer (:ddkilzer) 2010-07-27 14:56:04 PDT
This now works on Safari 5 on Mac OS X 10.6.4 as well.  Marking as RESOLVED/FIXED.
Comment 11 Darin Adler 2010-07-27 15:16:53 PDT
Testing with Chrome makes no sense. This is a bug in code that Chrome does not use.
Comment 12 Nico Kaiser 2010-07-28 01:03:22 PDT
As mentioned in the first post. When this bug was filed this issue existed in Chrome too. Since I did not watch this topic quite a long time I don't know since when it's gone in Chrome or Safari. I'm just glad it's fixed.