Bug 14945

Summary: An ampersand ("&") appearing in a document is treated as a fatal error (instead of a non-fatal error)
Product: WebKit Reporter: Robert Burns <robburns1>
Component: XMLAssignee: Nobody <webkit-unassigned>
Status: RESOLVED WONTFIX    
Severity: Normal CC: ap, robburns1
Priority: P2    
Version: 523.x (Safari 3)   
Hardware: Mac   
OS: OS X 10.4   
Attachments:
Description Flags
reduced test case encountered on the web none

Description Robert Burns 2007-08-11 22:51:47 PDT
XML lists the following fatal errors (http://www.w3.org/TR/xml/#dt-fatal):
 • Well-fromedness constraint violation (http://www.w3.org/TR/xml/#dt-wellformed)
 • Encoding declaration errors (http://www.w3.org/TR/xml/#dt-fatal)
   - entity in the wrong encoding
   - an encoding declaration not at the beginning of an entity
   - whenever the encoding cannot be processed
 • And under forbidden (http://www.w3.org/TR/xml/#forbidden):
   -  appearance of a reference to an unparsed entity, except in the EntityValue in an entity declaration.
   -  the appearance of any character or general-entity reference in the DTD except within an EntityValue or AttValue.
   - a reference to an external entity in an attribute value.

There is no mention, in the list of fatal errors, of character entity references (general entity references), except in an XML DTD. So even errant general entities are not part of the fatal error definition.  No other errors are fatal and therefore: "Conforming software may detect and report an error and may recover from it" (http://www.w3.org/TR/xml/#dt-error).

On the other hand the recommendation says:

"The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section" (http://www.w3.org/TR/xml/#dt-chardata)

However, this is not a fatal error. Again, the recommendation says: "Conforming software may detect and report an error and may recover from it" This means that WebKit may report the stray ampersand, but it does not have to. Since it can recover from the error it should.

These sorts of bugs give XML the reputation for having more draconian error handling than it actually has. I may file a separate bug on the issue of general entities
Comment 1 Robert Burns 2007-08-11 22:59:24 PDT
Created attachment 15937 [details]
reduced test case encountered on the web

This is a reduction of an actual error encountered on the web. One of the ampersands in the URL fails to escape properly (missing the ending semi-colon). WebKit treats this as a fatal error.
Comment 2 Robert Burns 2007-08-11 23:24:58 PDT
More related to my planned follow-up bug on unknown character entity references, undeclared (unknown) character entity references are only a well-formedness constraint violation (a fatal error) for standalone='yes' documents. For standalone='no' documents these are a validity constraint violation (a non-fatal error) (see: http://www.w3.org/TR/xml/#sec-references).
Comment 3 Robert Burns 2007-08-12 16:27:23 PDT
reported follow-up bug on unknown character references as bug#14945. Since WebKit is not a validating application, I think the approach to take with reporting (and then recovering from) a stray ampersand &, would be to simply throw an exception without changing the DOM tree or the rendering. WebKit could
even treat the stray & as an &amp; as its method of recovery, as long as it reports the error.
Comment 4 Robert Burns 2007-08-12 19:29:23 PDT
^^^^

The related followup bug regarding unknown character entities (e.g., &somecharacter;) should be bug#14952
Comment 5 Maciej Stachowiak 2007-08-14 16:30:04 PDT
1) We should check what other XML processors do for this error.
2) I believe this particular document violates the well-formedness constraints. Its the fact that there is a semicolon before the next & that violates well-formedness, not the fact that the entity is not declared. So I think this bug is invalid, but I don't have time to study the spec further right now.
Comment 6 Maciej Stachowiak 2007-08-14 16:31:25 PDT
"Its the fact that there is a semicolon before the next & that violates well-formedness"

I meant

"It's the fact that there is *no* semicolon before the next & that violates well-formedness"

Comment 7 Robert Burns 2007-08-14 17:34:40 PDT
(In reply to comment #5)
> 1) We should check what other XML processors do for this error.
> 2) I believe this particular document violates the well-formedness constraints.
> Its the fact that there is [no[ semicolon before the next & that violates
> well-formedness, not the fact that the entity is not declared. So I think this
> bug is invalid, but I don't have time to study the spec further right now.


The reason I so carefully went through the spec and the well-formedness rules. Here are the well-formedness constraints:

well-formedness constraint
[Definition: A rule which applies to all well-formed XML documents. Violations of well-formedness constraints are fatal errors.]

Well-formedness constraint: PEs in Internal Subset
In the internal DTD subset, parameter-entity references must not occur within markup declarations; they may occur where markup declarations can occur. (This does not apply to references that occur in external parameter entities or to the external subset.)

Well-formedness constraint: External Subset
The external subset, if any, must match the production for extSubset.

Well-formedness constraint: PE Between Declarations
The replacement text of a parameter entity reference in a DeclSep must match the production extSubsetDecl.

Well-formedness constraint: Element Type Match
The Name in an element's end-tag must match the element type in the start-tag.

Well-formedness constraint: Unique Att Spec
An attribute name must not appear more than once in the same start-tag or empty-element tag.

Well-formedness constraint: No External Entity References
Attribute values must not contain direct or indirect entity references to external entities.

Well-formedness constraint: No < in Attribute Values
The replacement text of any entity referred to directly or indirectly in an attribute value must not contain a <.

Well-formedness constraint: Legal Character
Characters referred to using character references must match the production for Char.

Well-formedness constraint: Entity Declared
In a document without any DTD, a document with only an internal DTD subset which contains no parameter entity references, or a document with "standalone='yes'", for an entity reference that does not occur within the external subset or a parameter entity, the Name given in the entity reference must match that in an entity declaration that does not occur within the external subset or a parameter entity, except that well-formed documents need not declare any of the following entities: amp, lt, gt, apos, quot. The declaration of a general entity must precede any reference to it which appears in a default value in an attribute-list declaration.

Well-formedness constraint: Parsed Entity
An entity reference must not contain the name of an unparsed entity. Unparsed entities may be referred to only in attribute values declared to be of type ENTITY or ENTITIES.

Well-formedness constraint: No Recursion
A parsed entity must not contain a recursive reference to itself, either directly or indirectly.

Well-formedness constraint: In DTD
Parameter-entity references must not appear outside the DTD.

There are other mentions of well-formedness, but nothing I see says anything about this violating well-formedness.

What the spec does say abou this is (as I quoted above):

"The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section"

On the issue of other implementations, I think this is a problem with other implementations as well. There's no damage caused by WebKit being less draconian than the other implementations. The XML recommendation never intended this level of draconian error-handling.

What could WebKit possibly be solving by processing these errors as fatal errors (when the XML recommendation doesn't call for that). The XML recommendation has this reputation of being draconian yet the implementations felt the need to be even more draconian? What's up with that?
Comment 8 Maciej Stachowiak 2007-08-14 18:13:51 PDT
Section 2.1 gives this definition:

[Definition: A textual object is a well-formed XML document if:]

* Taken as a whole, it matches the production labeled document.
* It meets all the well-formedness constraints given in this specification.
* Each of the parsed entities which is referenced directly or indirectly within the document is well-formed.

I believe the production you pasted does not match the production labelled document. I say this because the problem is in an attribute, where the AttValue production would apply:

[9]   	EntityValue	   ::=   	'"' ([^%&"] | PEReference | Reference)* '"'
|  "'" ([^%&'] | PEReference | Reference)* "'"
[10]   	AttValue	   ::=   	'"' ([^<&"] | Reference)* '"'
|  "'" ([^<&'] | Reference)* "'"

Note that & is not allowed in an attribute value except in a Reference. The production for Reference is:

[67]	       Reference	   ::=   	 EntityRef | CharRef
[68]   	EntityRef	   ::=   	'&' Name ';'
[66]   	CharRef	   ::=   	'&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

A Reference can be an EntityRef or a CharRef. Either way, it must start with & and end with ;, and cannot contain an & in the middle. Thus, the attribute value here:

href='http://www.nytimes.com/2007/08/12/books/review/Hitchens-t.html?_r=2&amp;adxnnl=1&ampladxnnlx=1186840914-OUgjhcnZejswml3KgknPNg&amp;pagewanted=all'

Does not match the AttrValue production, and as a result the document as a whole does not match the document production, and thus it is not well-formed. Resolving as INVALID.
Comment 9 Robert Burns 2007-08-14 20:01:56 PDT
(In reply to comment #8)
> Section 2.1 gives this definition:
> 
> [Definition: A textual object is a well-formed XML document if:]
> 
> * Taken as a whole, it matches the production labeled document.
> * It meets all the well-formedness constraints given in this specification.
> * Each of the parsed entities which is referenced directly or indirectly within
> the document is well-formed.
> 
> I believe the production you pasted does not match the production labelled
> document. I say this because the problem is in an attribute, where the AttValue
> production would apply:
> 
> [9]     EntityValue        ::=          '"' ([^%&"] | PEReference | Reference)*
> '"'
> |  "'" ([^%&'] | PEReference | Reference)* "'"
> [10]    AttValue           ::=          '"' ([^<&"] | Reference)* '"'
> |  "'" ([^<&'] | Reference)* "'"
> 
> Note that & is not allowed in an attribute value except in a Reference. The
> production for Reference is:
> 
> [67]           Reference           ::=           EntityRef | CharRef
> [68]    EntityRef          ::=          '&' Name ';'
> [66]    CharRef    ::=          '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
> 
> A Reference can be an EntityRef or a CharRef. Either way, it must start with &
> and end with ;, and cannot contain an & in the middle. Thus, the attribute
> value here:
> 
> href='http://www.nytimes.com/2007/08/12/books/review/Hitchens-t.html?_r=2&amp;adxnnl=1&ampladxnnlx=1186840914-OUgjhcnZejswml3KgknPNg&amp;pagewanted=all'
> 
> Does not match the AttrValue production, and as a result the document as a
> whole does not match the document production, and thus it is not well-formed.
> Resolving as INVALID.
> 

I don't think that the example unambiguously includes a charRef with an ampersand inside it. That would be a call the parser would have to make. It may be that the WebKit parser, right now, is geared toward making assumptions that lead it in that direction. It may also be the case that it would be difficult to parse it in another way. However, it is a stretch to read the spec as requiring that an XML processor treat this text in that particular way. In fact, I would say that the spec provides XML processors with an easy way out of this in that, once another ampersand is reached, the parser can assume its no longer part of the previous character reference.

Specific character references are inherently a layer on top of the XML processing. As long as the amerpsand doesn't interfere with producing an unambiguous DOM tree (and I don't think it does), then there should be no reason that the sequence following the ampersand cannot be compared to the known character references.

Again, this is an opportunity to improve XML handling in WebKit. If this bug is addressed, then WebKit will not break on pages that other implementations break on (other bug reports should be filed with those implementations for interoperability). WebKit could prominently display an error on the page: bug black bug graphic if you prefer. However, there's no reason WebKit cannot recover from this error and display the remainder of the page.

I fail to see any down-side.

Reopening to allow further discussion.
Comment 10 Robert Burns 2007-08-14 20:14:48 PDT
There is a trade-off between this bug and the other one I filed: bug#14952. If the parser is going to try to handle unknown references then once an & enters the parser it needs to treat it as a potential character reference. Once the parser reaches a semi-colon it can close the character reference (known or unknown). If it reaches another &, ', ", closing-tag, opening-tag or EOF, then there is some flexibility on how it treats the string from the initial ampersand and this point. It can either decide to close the character reference at that point (and pass it to determine if its a known or unknown character reference), or it can treat it as a misused & with no character reference at all.

After passing the string to a processor to determine whether the string represents a known character reference, that step could either replace the string with: 

1) a character referenced by the character reference
2) the Unicode replacement character
3)a more glaring markup that really draws attention to the error (if it occurs in an attribute value, the markup could replace the entire element or the start of the element)
4) the literal string passed on unchanged (with now a literal & as if it were a &amp; followed by the remainder of the string)

There's no reason to treat this as a fatal-error whatsoever.
Comment 11 Robert Burns 2007-08-14 23:24:00 PDT
(In reply to comment #8)

OK, now I see what you're saying, Maciej. There are no (2) well-formedness constraint violations nor any (3) parsed entity ill-formedness. However, the document does not meet the (1) document production because  if you follow through document production chain (http://www.w3.org/TR/xml/#sec-well-formed) 

document -> element -> STag -> Attribute -> AttValue

you'll find the following BNF notation:

[10]   	AttValue	   ::=   	'"' ([^<&"] | Reference)* '"'
                                                |  "'" ([^<&'] | Reference)* "'

where the [^<&'] indicates the exclusion of '<', '&', and '"'.

Following the other chaing:
document -> element -> content -> CharData

reveals:

[14]   	CharData	   ::=   	[^<&]* - ([^<&]* ']]>' [^<&]*)


excluding <', '&', and '"' from the element's content too.

This would override the earlier statement in the recommendation where the inclusion of a '&' or '<' were only non-fatal errors. I'm satisfied this goes against the recommendation. 

Dispatch this bug however you'd like.
Comment 12 David Kilzer (:ddkilzer) 2007-08-15 06:09:40 PDT
Re-resolving per Comment #11.

Comment 13 Robert Burns 2007-08-17 23:04:33 PDT
I'm leaving this resolved, but I don't think the issue has been adequately explored. I wouldn't suggest a recommendation should include error-recovery algorithms for this situation. However for an implementation, this is one of those areas where going against the spec, may not have negative ramifications. Recovering from an errant & character in an XML document does not have any tree construction implications. Even if the reference to an external DTD included more complex entities than simply a character (which is how this is used mostly for references to characters), this would simply mean some portion of the transcluded content was missing. None of the rest of the tree can be effect by this. That's different than any well-formedness constraint violation (which this is not), where the remainder of the tree is effected (hence the requirement to treat it as a fatal error).

More importantly, this does nothing to break content for non-errant documents. For authors who use proper syntax throughout the document, all of their references will be properly recognized even if noon-fatal-error handling is instituted for stray & characters.

I may be missing some serious consequences here, but I fail to see how treating stray & characters as non-fatal errros will break XML processing in general (or at all). Clearly marking the error on the page (even in an offensively annoying way like a big ugly bug icon) will provide the feedback necessary for authors to recognize the error, yet still allow the user to view the contents of an errant page.

It would be best to deal with this bug by brainstorming about how non-fatal-error handling here would break anything. How would it be bad for users of WebKit? The issue shouldn't be about how much we like draconian error handling on a personal level. It also shouldn't be a slavish devotion to the XML recommendation. Maybe WebKit doesn't want to be the first implementation to break from a recommendation, and that doesn't sound like a bad principle to follow. But it is worth discussing and making that explicit if that is indeed the case.

The proposed Project Goals state:

"We value real-world web compatibility, standards compliance, stability, performance, security, portability, usability and relative ease of understanding and modifying the code "

The real-world also has errant & characters: even in XML. Though perhaps XML is a sacred cow when it comes to compliance. My only concern would be that treating errant & characters as non-fatal-errors would have unintended consequences: in other words some effects to the document tree beyond the isolated area around the & character.
Comment 14 Robert Burns 2007-08-17 23:31:32 PDT
reopening to change disposition
Comment 15 Robert Burns 2007-08-17 23:32:49 PDT
Changing this to wontfix from invalid, since it is not an invalid bug. If we're not going this route, its more an indication that we won't fix this: on principle.