<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://bugs.webkit.org/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4.1"
          urlbase="https://bugs.webkit.org/"
          
          maintainer="admin@webkit.org"
>

    <bug>
          <bug_id>24166</bug_id>
          
          <creation_ts>2009-02-25 11:47:29 -0800</creation_ts>
          <short_desc>Regex: capture groups don&apos;t match on text longer than 49991 chars</short_desc>
          <delta_ts>2010-07-28 01:03:22 -0700</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebKit</product>
          <component>JavaScriptCore</component>
          <version>528+ (Nightly build)</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows XP</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>Normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>0</everconfirmed>
          <reporter name="Nico Kaiser">nk111</reporter>
          <assigned_to name="Nobody">webkit-unassigned</assigned_to>
          <cc>barraclough</cc>
    
    <cc>darin</cc>
    
    <cc>ddkilzer</cc>
    
    <cc>ggaren</cc>
    
    <cc>nk111</cc>
          

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>110995</commentid>
    <comment_count>0</comment_count>
    <who name="Nico Kaiser">nk111</who>
    <bug_when>2009-02-25 11:47:29 -0800</bug_when>
    <thetext>if a regex uses capture groups, it won&apos;t match if any of the resulting captures is longer than 49991 characters.

If you try this regex:

/start(.)*end/i

on this text

startXXXXXXend

then it will match until you put more than 49991 &quot;X&quot; between &quot;start&quot; and &quot;end&quot;.

You can easily test it on http://regexpal.com/
It will work with firefox but not with safari or chrome.

This bug does only affect regex with groups in it. If you change the the regex to:

/start.*end/i

then it will match.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>111225</commentid>
    <comment_count>1</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2009-02-26 10:48:36 -0800</bug_when>
    <thetext>Duplicate of bug 18327?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>111237</commentid>
    <comment_count>2</comment_count>
    <who name="Darin Adler">darin</who>
    <bug_when>2009-02-26 11:12:20 -0800</bug_when>
    <thetext>The title of this bug is misleading. There&apos;s no group longer than 49991 characters here. The group is one character long.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>111262</commentid>
    <comment_count>3</comment_count>
    <who name="Nico Kaiser">nk111</who>
    <bug_when>2009-02-26 11:58:34 -0800</bug_when>
    <thetext>(In reply to comment #2)
&gt; The title of this bug is misleading. There&apos;s no group longer than 49991
&gt; characters here. The group is one character long.
&gt; 

I changed the title. Hope it&apos;s more precise now.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>111291</commentid>
    <comment_count>4</comment_count>
    <who name="Darin Adler">darin</who>
    <bug_when>2009-02-26 13:13:41 -0800</bug_when>
    <thetext>The result of the capture group will only be the last character matched. The group is only going to capture a single character.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>111295</commentid>
    <comment_count>5</comment_count>
    <who name="Nico Kaiser">nk111</who>
    <bug_when>2009-02-26 13:59:52 -0800</bug_when>
    <thetext>(In reply to comment #4)
&gt; The result of the capture group will only be the last character matched. The
&gt; group is only going to capture a single character.
&gt; 

OK. Bad example. The point is, that you&apos;ll get NO MATCH at all using webkit. If the text on which the group should match is longer than 49991 chars.

Change the regex to:
/start(.*)end/i

Now you should get the whole XXXX as result of the group. But as with the example above it won&apos;t match with webkit...
</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>111399</commentid>
    <comment_count>6</comment_count>
    <who name="Gavin Barraclough">barraclough</who>
    <bug_when>2009-02-26 23:55:55 -0800</bug_when>
    <thetext>As a quick fix you may want to try rewriting (.)* as (?:.*(.))?

I think there is a good chance that this will be faster on most regex engines, and also this may well work on shipping Safari.

(I&apos;m afraid I don&apos;t have a 49991 character input string lying around to check).

G.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>111422</commentid>
    <comment_count>7</comment_count>
    <who name="Nico Kaiser">nk111</who>
    <bug_when>2009-02-27 02:31:43 -0800</bug_when>
    <thetext>I&apos;m getting confused...

It seems that&apos;s not only the length of the input string is important but also how you write the group condition...

I provided an example page:
http://mad-d-sign.de/webkit/regex.html

You will see that Test1 and Test2 have slightly different group conditions. While Test1 will return only the last &quot;X&quot; in the first (and only) group Test2 will return all the &quot;X&quot; in the first group. 

Both regex work on the 49997 characters long input string.

Test3 and Test4 again use the two different regex conditions but on a 49998 characters long input string.

Test3 does NOT match while Test4 does.

In non webkit browsers all 4 tests match.

I really hope now it&apos;s clear what I wanted to show.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>111625</commentid>
    <comment_count>8</comment_count>
    <who name="Nico Kaiser">nk111</who>
    <bug_when>2009-02-28 09:16:15 -0800</bug_when>
    <thetext>Here&apos;s another testcase which is a bit more realistic and very close to the situation which I was investigating when I found this issue.

http://www.mad-d-sign.de/webkit/regex2.html

It&apos;s interesting. The character limit seems to be different again here. We have some Text enclosed by &lt;object&gt; tags. Within are two groups we want to match. 

Test 1 will match and find the two groups. Group 1 includes all the &quot;X&quot; and group 2 just the word &quot;sometext&quot;.

In Test2 there is just one more &quot;X&quot; included. Test 2 will not match.

The text in test 3 is exactly the same like in test 2. But the regex searches only one group. The one with the &quot;X&quot;s. Test 3 will match!

So the additional group and the long text break test 2.

Please notice. Test 2 will also match if I remove the question mark to do a greedy search. But in my case it could be that there are more object tags in my text like this:

&lt;object&gt;XXXsometext&lt;/object&gt;&lt;object&gt;XXXsometext&lt;/object&gt;&lt;object&gt;XXXsometext&lt;/object&gt;

That&apos;s why I need the the none greedy switch. Otherwise I would only get one match including the whole text.
</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>256832</commentid>
    <comment_count>9</comment_count>
    <who name="Nico Kaiser">nk111</who>
    <bug_when>2010-07-27 14:19:00 -0700</bug_when>
    <thetext>checked again after a long time. Seems to be fixed in Chrom 5.0.375.125! All tests are passed now.

Nice!</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>256862</commentid>
    <comment_count>10</comment_count>
    <who name="David Kilzer (:ddkilzer)">ddkilzer</who>
    <bug_when>2010-07-27 14:56:04 -0700</bug_when>
    <thetext>This now works on Safari 5 on Mac OS X 10.6.4 as well.  Marking as RESOLVED/FIXED.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>256882</commentid>
    <comment_count>11</comment_count>
    <who name="Darin Adler">darin</who>
    <bug_when>2010-07-27 15:16:53 -0700</bug_when>
    <thetext>Testing with Chrome makes no sense. This is a bug in code that Chrome does not use.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>257060</commentid>
    <comment_count>12</comment_count>
    <who name="Nico Kaiser">nk111</who>
    <bug_when>2010-07-28 01:03:22 -0700</bug_when>
    <thetext>As mentioned in the first post. When this bug was filed this issue existed in Chrome too. Since I did not watch this topic quite a long time I don&apos;t know since when it&apos;s gone in Chrome or Safari. I&apos;m just glad it&apos;s fixed.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>