<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://bugs.webkit.org/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4.1"
          urlbase="https://bugs.webkit.org/"
          
          maintainer="admin@webkit.org"
>

    <bug>
          <bug_id>40347</bug_id>
          
          <creation_ts>2010-06-08 19:51:44 -0700</creation_ts>
          <short_desc>Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser</short_desc>
          <delta_ts>2012-11-27 12:52:53 -0800</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebKit</product>
          <component>DOM</component>
          <version>528+ (Nightly build)</version>
          <rep_platform>Other</rep_platform>
          <op_sys>OS X 10.5</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>WONTFIX</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>Normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          <blocked>39259</blocked>
          <everconfirmed>1</everconfirmed>
          <reporter name="Tony Gentilcore">tonyg</reporter>
          <assigned_to name="Tony Gentilcore">tonyg</assigned_to>
          <cc>abarth</cc>
    
    <cc>ap</cc>
    
    <cc>eric</cc>
    
    <cc>ian</cc>
    
    <cc>mitz</cc>
    
    <cc>webkit.review.bot</cc>
          

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>235622</commentid>
    <comment_count>0</comment_count>
    <who name="Tony Gentilcore">tonyg</who>
    <bug_when>2010-06-08 19:51:44 -0700</bug_when>
    <thetext>Fix fast/parser/entity-surrogate-pairs.html in HTML5 parser</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235623</commentid>
    <comment_count>1</comment_count>
      <attachid>58206</attachid>
    <who name="Tony Gentilcore">tonyg</who>
    <bug_when>2010-06-08 19:55:02 -0700</bug_when>
    <thetext>Created attachment 58206
Patch</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235624</commentid>
    <comment_count>2</comment_count>
      <attachid>58206</attachid>
    <who name="Adam Barth">abarth</who>
    <bug_when>2010-06-08 19:57:55 -0700</bug_when>
    <thetext>Comment on attachment 58206
Patch

WebCore/html/HTML5Lexer.cpp:101
 +      if (value == 0 || value &gt; 0x10FFFF)
Did I just dream up the condition on the right?  I thought I got it out of the spec....</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235625</commentid>
    <comment_count>3</comment_count>
    <who name="WebKit Review Bot">webkit.review.bot</who>
    <bug_when>2010-06-08 20:00:22 -0700</bug_when>
    <thetext>Attachment 58206 did not pass style-queue:

Failed to run &quot;[&apos;WebKitTools/Scripts/check-webkit-style&apos;, &apos;--no-squash&apos;]&quot; exit_code: 1
WebCore/html/HTML5Lexer.cpp:101:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 1 in 3 files


If any of these errors are false positives, please file a bug against check-webkit-style.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235626</commentid>
    <comment_count>4</comment_count>
    <who name="Tony Gentilcore">tonyg</who>
    <bug_when>2010-06-08 20:29:23 -0700</bug_when>
    <thetext>(In reply to comment #2)
&gt; (From update of attachment 58206 [details])
&gt; WebCore/html/HTML5Lexer.cpp:101
&gt;  +      if (value == 0 || value &gt; 0x10FFFF)
&gt; Did I just dream up the condition on the right?  I thought I got it out of the spec....

That condition is here for sure:
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream

And it is enforced in InputStreamPreprocessor::peek(). However, I can&apos;t find the part of the spec that talks about entity processing and enforcing it here certainly prevents surrogate pair entities from working.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235691</commentid>
    <comment_count>5</comment_count>
    <who name="Eric Seidel (no email)">eric</who>
    <bug_when>2010-06-09 00:45:14 -0700</bug_when>
    <thetext>Fixing the test is good.  I think our suragate pair handling is still confused.

The spec:
All U+0000 NULL characters and code points in the range U+D800 to U+DFFF in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters and code points are parse errors.

That seems wrong.  I guess I need to read the HTML5 spec more closely.  U+D800–U+DBFF  is the range used by the leading surrogate.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235693</commentid>
    <comment_count>6</comment_count>
    <who name="Eric Seidel (no email)">eric</who>
    <bug_when>2010-06-09 00:49:07 -0700</bug_when>
    <thetext>I think the spec is written in terms of characters, so I&apos;m confused by the use of &quot;code points&quot; in &quot;All U+0000 NULL characters and code points in the range U+D800 to U+DFFF&quot;.  But I think the &quot;code points&quot; bit is just a mistake.

I think that this pre-processing should already have been done at a layer below the InputStreamPreprocessor, so we can just hack out that code.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235836</commentid>
    <comment_count>7</comment_count>
    <who name="Tony Gentilcore">tonyg</who>
    <bug_when>2010-06-09 09:55:11 -0700</bug_when>
    <thetext>(In reply to comment #6)
&gt; I think the spec is written in terms of characters, so I&apos;m confused by the use of &quot;code points&quot; in &quot;All U+0000 NULL characters and code points in the range U+D800 to U+DFFF&quot;.  But I think the &quot;code points&quot; bit is just a mistake.
&gt; 
&gt; I think that this pre-processing should already have been done at a layer below the InputStreamPreprocessor, so we can just hack out that code.

Yes. Code below InputStreamPreprocessor is handling surrogate pairs. That&apos;s why they work when the filter in legalEntityFor() is removed.

Adam, let me know if this is off, but my assessment was that your checks in legalEntityFor were based on the spec in #preprocessing-the-input-stream. The bug was that those checks should only be enforced in peek() rather than prior to adding the entities to the stream.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235837</commentid>
    <comment_count>8</comment_count>
      <attachid>58206</attachid>
    <who name="Adam Barth">abarth</who>
    <bug_when>2010-06-09 09:59:05 -0700</bug_when>
    <thetext>Comment on attachment 58206
Patch

Ok.  I&apos;m still not sure I understand the issue 100%, but I&apos;m starting to be convinced.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235842</commentid>
    <comment_count>9</comment_count>
    <who name="Tony Gentilcore">tonyg</who>
    <bug_when>2010-06-09 10:02:44 -0700</bug_when>
    <thetext>(In reply to comment #8)
&gt; (From update of attachment 58206 [details])
&gt; Ok.  I&apos;m still not sure I understand the issue 100%, but I&apos;m starting to be convinced.

I&apos;ve flipped the cq? flag. 

I&apos;m assuming:
1. This is an appropriate style violation.
2. There isn&apos;t another part of the spec that I&apos;m missing which legalEntityFor() was based upon (ie. it was based on #preprocessing-the-input-stream).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235850</commentid>
    <comment_count>10</comment_count>
    <who name="Adam Barth">abarth</who>
    <bug_when>2010-06-09 10:11:20 -0700</bug_when>
    <thetext>http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references

[[
Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER.
]]</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235854</commentid>
    <comment_count>11</comment_count>
    <who name="Tony Gentilcore">tonyg</who>
    <bug_when>2010-06-09 10:16:08 -0700</bug_when>
    <thetext>(In reply to comment #10)
&gt; http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references
&gt; 
&gt; [[
&gt; Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER.
&gt; ]]

OK. So that means surrogate pair entities are expressly forbidden in HTML5. Since they were supported before, that leaves a compatibility question.

How do we handle these LayoutTests which test for legacy behavior that is expressly forbidden in HTML5?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235857</commentid>
    <comment_count>12</comment_count>
    <who name="Adam Barth">abarth</who>
    <bug_when>2010-06-09 10:20:39 -0700</bug_when>
    <thetext>We note them in the spreadsheet.  Can you test for how other browser (especially IE, Firefox, and Firefox nightly) behave?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235864</commentid>
    <comment_count>13</comment_count>
    <who name="Tony Gentilcore">tonyg</who>
    <bug_when>2010-06-09 10:42:54 -0700</bug_when>
    <thetext>(In reply to comment #12)
&gt; We note them in the spreadsheet.  Can you test for how other browser (especially IE, Firefox, and Firefox nightly) behave?

We are following the spec, so I&apos;m marking this issue as WONTFIX, and I&apos;ve noted it in the spreadsheet.

In answer to your question, here&apos;s how various browsers perform on entity-surrogate-pairs.html:
IE8: PASS
Firefox 3.6: FAIL
Minefield: FAIL
WK legacy: PASS
WK HTML5: FAIL

It is interesting that comment in the test says &quot;Firefox allows these&quot;. I suppose that means FF did at some point but that has now been removed. It is a better scary for compat that IE8 allows surrogate pair entities.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235873</commentid>
    <comment_count>14</comment_count>
    <who name="Adam Barth">abarth</who>
    <bug_when>2010-06-09 10:52:39 -0700</bug_when>
    <thetext>We should record all this information in a doc and send it to the HTML5 WG once we&apos;ve got everything sorted out.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235877</commentid>
    <comment_count>15</comment_count>
    <who name="Tony Gentilcore">tonyg</who>
    <bug_when>2010-06-09 10:59:20 -0700</bug_when>
    <thetext>(In reply to comment #14)
&gt; We should record all this information in a doc and send it to the HTML5 WG once we&apos;ve got everything sorted out.

Presumably, we are going to end up with a spreadsheet of only lines that have an &quot;N&quot; in needs code change. Those will be the list of issues we&apos;ll have to resolve with the spec. I&apos;ve added this bug number there so we can reference it when we prepare the list of issues.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235966</commentid>
    <comment_count>16</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2010-06-09 13:25:29 -0700</bug_when>
    <thetext>See bug 6446 for why this was done (no known Web compatibility issues, just compatibility with other browsers).

But thinking about this now, I like our forgiving behavior. There are processes on the Web that blindly encode all non-ASCII &quot;characters&quot; in input stream as numeric entities. I don&apos;t have a specific example. but depending on their definition of &quot;character&quot;, they can end up with exactly this. And I&apos;m not aware of any downside in supporting such split entities.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>235992</commentid>
    <comment_count>17</comment_count>
    <who name="Adam Barth">abarth</who>
    <bug_when>2010-06-09 13:57:05 -0700</bug_when>
    <thetext>I&apos;ve added your comment to our new document that&apos;s tracking these issues:

https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&amp;hl=en

Our current plan is to have the new parser 100% match the spec.  If we want to do something differently, we probably want to change the spec and then move the implementation to match the updated spec.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>777140</commentid>
    <comment_count>18</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2012-11-27 12:52:53 -0800</bug_when>
    <thetext>For future reference: &lt;rdar://problem/12747226&gt;and &lt;rdar://problem/10447418&gt; track two regressions from switching to the new behavior.</thetext>
  </long_desc>
      
          <attachment
              isobsolete="0"
              ispatch="1"
              isprivate="0"
          >
            <attachid>58206</attachid>
            <date>2010-06-08 19:55:02 -0700</date>
            <delta_ts>2010-06-09 10:51:33 -0700</delta_ts>
            <desc>Patch</desc>
            <filename>bug-40347-20100608195500.patch</filename>
            <type>text/plain</type>
            <size>2056</size>
            <attacher name="Tony Gentilcore">tonyg</attacher>
            
              <data encoding="base64">ZGlmZiAtLWdpdCBhL1dlYkNvcmUvQ2hhbmdlTG9nIGIvV2ViQ29yZS9DaGFuZ2VMb2cKaW5kZXgg
ZGJhNDJkYWM4NjhmMDYxNjkwZGRlZWJkZmE3ZDc4Y2RiMThkNjFlNS4uNjJjYzRkNGVkNTkxOTc2
ZjM2ODQ0N2M5N2RjNGU5YzA5MWFmMmJiNCAxMDA2NDQKLS0tIGEvV2ViQ29yZS9DaGFuZ2VMb2cK
KysrIGIvV2ViQ29yZS9DaGFuZ2VMb2cKQEAgLTEsMyArMSwxNyBAQAorMjAxMC0wNi0wOCAgVG9u
eSBHZW50aWxjb3JlICA8dG9ueWdAY2hyb21pdW0ub3JnPgorCisgICAgICAgIFJldmlld2VkIGJ5
IE5PQk9EWSAoT09QUyEpLgorCisgICAgICAgIEZpeCBmYXN0L3BhcnNlci9lbnRpdHktc3Vycm9n
YXRlLXBhaXJzLmh0bWwgaW4gSFRNTDUgcGFyc2VyCisgICAgICAgIGh0dHBzOi8vYnVncy53ZWJr
aXQub3JnL3Nob3dfYnVnLmNnaT9pZD00MDM0NworCisgICAgICAgIE5vIG5ldyB0ZXN0cyBiZWNh
dXNlIGNvdmVyZWQgYnkgZmFzdC9wYXJzZXIvZW50aXR5LXN1cnJvZ2F0ZS1wYWlycy5odG1sCisK
KyAgICAgICAgKiBodG1sL0hUTUw1TGV4ZXIuY3BwOgorICAgICAgICAoV2ViQ29yZTo6SFRNTE5h
bWVzOjpsZWdhbEVudGl0eUZvcik6CisgICAgICAgICogaHRtbC9IVE1MNUxleGVyLmg6CisgICAg
ICAgIChXZWJDb3JlOjpIVE1MNUxleGVyOjpJbnB1dFN0cmVhbVByZXByb2Nlc3Nvcjo6cGVlayk6
CisKIDIwMTAtMDYtMDggIEtlbm5ldGggUnVzc2VsbCAgPGtickBnb29nbGUuY29tPgogCiAgICAg
ICAgIFVucmV2aWV3ZWQsIGJ1aWxkIGZpeC4KZGlmZiAtLWdpdCBhL1dlYkNvcmUvaHRtbC9IVE1M
NUxleGVyLmNwcCBiL1dlYkNvcmUvaHRtbC9IVE1MNUxleGVyLmNwcAppbmRleCBmYWJlYWU0OWRh
YjEzMDhiMGQyMzZiM2VkM2NjYzc2NTE2ZTg3NzUyLi4xNmFjMzAzMTU4MDE1ODg5NjlkMzdjMmNh
ZWVlZDBhMTYwMjk2NTgyIDEwMDY0NAotLS0gYS9XZWJDb3JlL2h0bWwvSFRNTDVMZXhlci5jcHAK
KysrIGIvV2ViQ29yZS9odG1sL0hUTUw1TGV4ZXIuY3BwCkBAIC05OCw4ICs5OCw3IEBAIGlubGlu
ZSBVQ2hhciBhZGp1c3RFbnRpdHkodW5zaWduZWQgdmFsdWUpCiAKIGlubGluZSB1bnNpZ25lZCBs
ZWdhbEVudGl0eUZvcih1bnNpZ25lZCB2YWx1ZSkKIHsKLSAgICAvLyBGSVhNRTogQSBudW1iZXIg
b2Ygc3BlY2lmaWMgZW50aXR5IHZhbHVlcyBnZW5lcmF0ZSBwYXJzZSBlcnJvcnMuCi0gICAgaWYg
KHZhbHVlID09IDAgfHwgdmFsdWUgPiAweDEwRkZGRiB8fCAodmFsdWUgPj0gMHhEODAwICYmIHZh
bHVlIDw9IDB4REZGRikpCisgICAgaWYgKHZhbHVlID09IDAgfHwgdmFsdWUgPiAweDEwRkZGRikK
ICAgICAgICAgcmV0dXJuIDB4RkZGRDsKICAgICBpZiAodmFsdWUgPCAweEZGRkYpCiAgICAgICAg
IHJldHVybiBhZGp1c3RFbnRpdHkodmFsdWUpOwpkaWZmIC0tZ2l0IGEvV2ViQ29yZS9odG1sL0hU
TUw1TGV4ZXIuaCBiL1dlYkNvcmUvaHRtbC9IVE1MNUxleGVyLmgKaW5kZXggZDlkNzU5YmYwMmNk
NGE1NmNiM2IyYTAwZmUyMTE3YTFkYTQxMGVjNS4uOWJjYTgzOWJmZGY1NTk1NGRkNDI0N2UwMmJj
NzhkMDgwNjZmMjEzZCAxMDA2NDQKLS0tIGEvV2ViQ29yZS9odG1sL0hUTUw1TGV4ZXIuaAorKysg
Yi9XZWJDb3JlL2h0bWwvSFRNTDVMZXhlci5oCkBAIC0xNjEsNiArMTYxLDcgQEAgbmFtZXNwYWNl
IFdlYkNvcmUgewogICAgICAgICAgICAgICAgICAgICBtX3NraXBOZXh0TmV3TGluZSA9IGZhbHNl
OwogICAgICAgICAgICAgICAgICAgICBpZiAobV9uZXh0SW5wdXRDaGFyYWN0ZXIgPT0gJ1wwJyB8
fCAobV9uZXh0SW5wdXRDaGFyYWN0ZXIgPj0gMHhEODAwICYmIG1fbmV4dElucHV0Q2hhcmFjdGVy
IDw9IDB4REZGRikpCiAgICAgICAgICAgICAgICAgICAgICAgICBtX25leHRJbnB1dENoYXJhY3Rl
ciA9IDB4RkZGRDsKKyAgICAgICAgICAgICAgICAgICAgLy8gRklYTUU6IEEgbnVtYmVyIG9mIHNw
ZWNpZmljIGNoYXJhY3RlciB2YWx1ZXMgc2hvdWxkIGdlbmVyYXRlIHBhcnNlIGVycm9ycy4KICAg
ICAgICAgICAgICAgICB9CiAgICAgICAgICAgICAgICAgcmV0dXJuIHRydWU7CiAgICAgICAgICAg
ICB9Cg==
</data>

          </attachment>
      

    </bug>

</bugzilla>