Bug 14952 - Unknown character reference (general reference) mistakenly treated as a fatal-error rather than a non-fatal error
Summary: Unknown character reference (general reference) mistakenly treated as a fatal...
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: XML (show other bugs)
Version: 523.x (Safari 3)
Hardware: Mac OS X 10.4
: P2 Major
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-08-12 16:17 PDT by Robert Burns
Modified: 2007-08-19 04:46 PDT (History)
2 users (show)

See Also:


Attachments
A small XHTML file including the made up character reference &somecharacter; (387 bytes, application/xhtml+xml)
2007-08-12 16:20 PDT, Robert Burns
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Burns 2007-08-12 16:17:04 PDT
XML lists the following fatal errors (http://www.w3.org/TR/xml/#dt-fatal):
 • Well-fromedness constraint violation (http://www.w3.org/TR/xml/#dt-wellformed)
 • Encoding declaration errors (http://www.w3.org/TR/xml/#dt-fatal)
   - entity in the wrong encoding
   - an encoding declaration not at the beginning of an entity
   - whenever the encoding cannot be processed
 • And under forbidden (http://www.w3.org/TR/xml/#forbidden):
   -  appearance of a reference to an unparsed entity, except in the EntityValue in an entity declaration.
   -  the appearance of any character or general-entity reference in the DTD except within an EntityValue or AttValue.
   - a reference to an external entity in an attribute value.

There is no mention, in the list of fatal errors, of character entity references (general entity references), except in an XML DTD. So errant general entities are not part of the fatal error definition.  No other errors are fatal and therefore: "Conforming software may detect and report an error and may recover from it" (http://www.w3.org/TR/xml/#dt-error).

On the other hand the recommendation says:


Unknown character entity references, or undeclared character entity references are only a
well-formedness constraint violation (a fatal error) for standalone='yes' documents. For standalone='no' documents, these are instead a validity constraint violation (a non-fatal error) (see: http://www.w3.org/TR/xml/#sec-references).

However, validity constraint violations are not fatal errors. Again, the recommendation says: "Conforming software may detect and report an error and may recover from it" This means that WebKit may report the unknown reference, but it does not have to even report the errant reference. Since the recommendation allows WebKit to  recover from the error, I think it should. Probably replacing the unknown reference with Unicode replacement character (U+FFFD) would be the most correct approach.  
 No other error reporting should be necessary as the replacement character is sufficient to indicate an error has occurred.

These sorts of bugs give XML the reputation for having more draconian error handling than it actually has. I may file a separate bug on the issue of general entities (this is also related to bug#14945)
Comment 1 Robert Burns 2007-08-12 16:20:29 PDT
Created attachment 15942 [details]
A small XHTML file including the made up character reference &somecharacter;

crashes on Safari Beta  3.0.3 (522.12.1)
Comment 2 Robert Burns 2007-08-12 16:26:03 PDT
reported follow-up bug on unknown character references as bug#14945. Since WebKit is not a validating application, I think the approach to take with reporting (and then recovering from) a stray ampersand &, would be to simply throw an exception without changing the DOM tree or the rendering. WebKit could even treat the stray & as an & as its method of recovery, as long as it reports the error.
Comment 3 Robert Burns 2007-08-12 19:28:18 PDT
^^^^^
That comment is supposed to describe the related bug where WebKit processes a fatal error when encountering a stray & character. Again that's bug#14945
Comment 4 David Kilzer (:ddkilzer) 2007-08-12 21:18:28 PDT
Confirmed that Safari 3 Public Beta v. 3.0.3 (522.12.1) with original WebKit on Mac OS X 10.4.10 (8R218) crashes.

However, using a local debug build of WebKit r25028 with Safari 3 Public Beta v. 3.0.3 (522.12.1), this does NOT crash.
Comment 5 Robert Burns 2007-08-12 21:47:05 PDT
(In reply to comment #4)
 
> However, using a local debug build of WebKit r25028 with Safari 3 Public Beta
> v. 3.0.3 (522.12.1), this does NOT crash.
> 

I should have been clearer in the bug subject line. I meant this bug to be about the treatment as a fatal error. Is that confirmed or was it just a crash issue?
Comment 6 David Kilzer (:ddkilzer) 2007-08-12 22:46:13 PDT
(In reply to comment #5)
> I should have been clearer in the bug subject line. I meant this bug to be
> about the treatment as a fatal error. Is that confirmed or was it just a crash
> issue?

I was confirming the crash specifically.  I haven't read the quoted specs, but I'm sure that will be sorted out in time.

Comment 7 Robert Burns 2007-08-12 22:52:37 PDT
(In reply to comment #4)

Using the nightly build from totday 2007-08-13, the treatment of an unknown character entity as a fatal error bug is confirmed. Just to reiterate, an unknown character entity is not well-formedness constraint violation. It is an invalidity constraint violation. It is therefore not a fatal-error and should not be treated as a fatal error. Quoting this again:

"Conforming software may detect and report an error and may recover from it" (http://www.w3.org/TR/xml/#dt-error). 

There is no requirement to even report this error ("may detect and report"). The Unicode replacement character should be sufficient notification to web developers that there is a problem. I suspect this leniency is built into the spec so that this type of draconian error-handling wouldn't show up in implementations.

Just for contrast the fatal error norm states:

"Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way)."

However, this is not a fatal error, except when processing a standalone='yes' document.
Comment 8 Robert Burns 2007-08-19 04:46:24 PDT
The following section also seems relevant to this bug:

"4.4.3 Included If Validating (http://www.w3.org/TR/xml/#include-if-valid)

"When an XML processor recognizes a reference to a parsed entity, in order to validate the document, the processor must include its replacement text. If the entity is external, and the processor is not attempting to validate the XML document, the processor may, but need not, include the entity's replacement text. If a non-validating processor does not include the replacement text, it must inform the application that it recognized, but did not read, the entity.

"This rule is based on the recognition that the automatic inclusion provided by the SGML and XML entity mechanism, primarily designed to support modularity in authoring, is not necessarily appropriate for other applications, in particular document browsing. Browsers, for example, when encountering an external parsed entity reference, might choose to provide a visual indication of the entity's presence and retrieve it for display only on demand."