Bug 53375

Summary: XML serialization should properly encode whitespace in attribute values (space character treated as newline with XSLT stylesheet when run using XSLTProcessor script API, but not when using xml-stylesheet pi)
Product: WebKit Reporter: Martin Honnen <martin.honnen>
Component: XMLAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: ahmad.saleem792, ap, cdumez, plakroon
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
URL: http://home.arcor.de/martin.honnen/safariBugs/test2011012802.xhtml

Description Martin Honnen 2011-01-29 04:38:28 PST
Over on stackoverflow (http://stackoverflow.com/questions/4822883/wrong-webkit-whitespace-handling-when-calling-xslt-transformation-through-javascr) someone reported a problem with Webkit browsers and white space treatment in XSLT stylesheet when run with the XSLTProcessor script API. A space character in the input document is treated as a newline character by XSLT/XPath functions like contains or substring-after.

This bug reports tries to isolate the problem with a test case:

The XML document at http://home.arcor.de/martin.honnen/safariBugs/test2011012802.xml has a <test>test string</test> element containing the words "test" and "string" separated by a single space " " character.
The document links to the stylesheet http://home.arcor.de/martin.honnen/safariBugs/test2011012802Xsl.xml which simply checks whether the "test" element contains a newline character, with the XPath expression contains(., '&#10;'). That XPath expression obviously should return false as there is no newline character in the input element.

When the XML document linking to the stylesheet is loaded in a browser window of a Webkit browser, the XPath expression correctly returns false.

However when running the XSLT transformation with Javascript and the XSLTProcessor API, as in the test case http://home.arcor.de/martin.honnen/safariBugs/test2011012802.xhtml, Webkit browsers for the XPath expression "contains(., '&#10;')" return true. Other XPath expressions like substring-after(., '&#10;') also treat the space character as a newline.

Other browsers like Firefox or Opera don't exhibit this behaviour

The problems happens for me with Safari 5.0.3 on Windows XP as well as with today's Webkit nightly download.
Comment 1 Alexey Proskuryakov 2011-01-29 21:39:33 PST
This is an interesting one. I'm not even sure if there's a bug to fix, although the behavior is certainly unexpected.

The issue is that other browsers apply XSL transformations to existing DOM trees, violating the spec. WebKit's XSLTProcessor serializes DOM trees to create XML documents (and XSL stylesheets)  for transformation.

Now, serializing LF in attribute value doesn't produce a character reference like &#10; - you can see it by opening <http://home.arcor.de/martin.honnen/safariBugs/test2011012802Xsl.xml> in Firefox or WebKit, and executing the following script in browser address bar:

javascript:alert((new XMLSerializer).serializeToString(document.documentElement))

And parsing XML with an actual line feed embedded produces a space in DOM tree - again, we match Firefox here.
Comment 2 Martin Honnen 2011-01-30 07:53:52 PST
(In reply to comment #1)

> The issue is that other browsers apply XSL transformations to existing DOM trees, violating the spec. WebKit's XSLTProcessor serializes DOM trees to create XML documents (and XSL stylesheets)  for transformation.

Which spec is violated by transforming a DOM tree with XSLT? Are you saying that WebKit serializes the DOM node passed to importStylesheet to follow some spec? Which one?
And if you have choosen to serialize the DOM tree, shouldn't that happen in a way so that the result round-trips correctly and the meaning of a stylesheet is not changed?

> Now, serializing LF in attribute value doesn't produce a character reference like &#10; - you can see it by opening <http://home.arcor.de/martin.honnen/safariBugs/test2011012802Xsl.xml> in Firefox or WebKit, and executing the following script in browser address bar:
> 
> javascript:alert((new XMLSerializer).serializeToString(document.documentElement))
> 
> And parsing XML with an actual line feed embedded produces a space in DOM tree - again, we match Firefox here.


To make it clear, the XSLT stylesheet contains a numeric character reference '&#10;' in an attribute value so any compliant XML parser should not normalize that to a space, rather the attribute value should contain a LF character as http://www.w3.org/TR/xml/#AVNormalize says "For a character reference, append the referenced character to the normalized value".

And I think WebKit's and Firefox DOM tree both contain a LF and not a space for the 'select' attribute so at that stage the stylesheet is fine.

If WebKit serializes the stylesheet for further processing with its XSLT processor then that step should not change the meaning of the stylesheet. I don't think the behaviour of XMLSerializer in Firefox should be used as an argument. If DOMParser and XMLSerializer results in Firefox or WebKit don't round-trip then they are not suitable for serializing a DOM node with an XSLT stylesheet to an XML document, if that is deemed necessary for executing the stylesheet.

Other DOMParser/XMLSerializer implementations do round-trip a numeric character reference '&#10;', for instance in Opera (tested with 11.01) the code

var xml = '<test att1="Line 1&#10;Line 2"/>';
var doc = new DOMParser().parseFromString(xml, 'application/xml');
new XMLSerializer().serializeToString(doc)

gives the result

<test att1="Line 1&#xa;Line 2"/>

so there it is ensured that the meaning is not changed (only the lexical representation changes from '&#10;' to '&#xa;' but that does not change the semantics of the document).

I am not sure whether XMLSerializer/DOMParser have ever been specified but there are XML serialization specifications, for instance http://www.w3.org/TR/xslt-xquery-serialization/ in "5 XML Output Method" says "characters MUST be output as character references, to ensure that they survive the round trip through serialization and parsing. (...) while CR, NL, TAB, NEL and LINE SEPARATOR characters in attribute nodes MUST be output respectively as "&#xD;", "&#xA;", "&#x9;", "&#x85;", and "&#x2028;", or their equivalents".

To summarize, I don't think serializing a stylesheet should change its meaning, thus if WebKit needs to serialize the DOM node passed to importStylesheet then it should do so in a way that the meaning of the stylesheet is not changed. For that a linefeed in an attribute value must be serialized as a character reference i.e. either '&#xA;' or '&#10;'.
Comment 3 Martin Honnen 2011-01-30 10:41:10 PST
Further on the argument that the XSLT stylesheet in a DOM is serialized and that both Mozilla and WebKit with XMLSerializer lose certain white space characters like a linefeed or a tab, Mozilla has a bug report https://bugzilla.mozilla.org/show_bug.cgi?id=398272 on that. In a discussion http://groups.google.com/group/mozilla.dev.tech.xml/browse_thread/thread/c0ad1ff4066d23fe Jonas Sicking and Boris Zbarsky acknowledge that as a bug.
Comment 4 Alexey Proskuryakov 2011-01-30 13:57:49 PST
> Which spec is violated by transforming a DOM tree with XSLT?

The XSLT spec, <http://www.w3.org/TR/xslt>. The spec is written in terms of XML documents, not browser DOM trees. An XML document is text, and DOM isn't even a good data model for XPath and XSLT, which work with XML Infoset.

Obviously, an implementation can do anything internally, as long as there is no observable difference.

> If WebKit serializes the stylesheet for further processing with its XSLT processor then that
> step should not change the meaning of the stylesheet. I don't think the behaviour
> of XMLSerializer in Firefox should be used as an argument.

Well, XMLSerializer is a non-standard Mozilla extension, so we should be very cautious about introducing intentional incompatibilities with Firefox.

It is true that we don't necessarily have to use the same serialization algorithm in XMLSerializer and in XSLTProcessor. But that's highly desirable in practice, as anything else would be horribly confusing.

Sounds like the best course of action would be for both Mozilla and us to change XML serialization to follow <http://www.w3.org/TR/xslt-xquery-serialization/>. I'll ask some Mozilla folks what they think about <https://bugzilla.mozilla.org/show_bug.cgi?id=398272>.
Comment 5 Alexey Proskuryakov 2011-02-02 13:23:03 PST
HTML spec has some early support for XMLSerializer now, referencing its own XML fragment serialization algorithm.

I now think that we should just fix XML serialization to properly encode whitespace.
Comment 6 Peter Kroon 2012-07-16 07:11:31 PDT
(In reply to comment #5)
> HTML spec has some early support for XMLSerializer now, referencing its own XML fragment serialization algorithm.
> 
> I now think that we should just fix XML serialization to properly encode whitespace.

What's the status on this?

[1] still fails.
[2] fails as well

[1]https://bug398272.bugzilla.mozilla.org/attachment.cgi?id=508260
[2]http://jsfiddle.net/4VCQN/1/

When will this be fixed?
Thanks,
Peter
Comment 7 Ahmad Saleem 2023-05-18 14:39:41 PDT
(In reply to Peter Kroon from comment #6)
> (In reply to comment #5)
> > HTML spec has some early support for XMLSerializer now, referencing its own XML fragment serialization algorithm.
> > 
> > I now think that we should just fix XML serialization to properly encode whitespace.
> 
> What's the status on this?
> 
> [1] still fails.
> [2] fails as well
> 
> [1]https://bug398272.bugzilla.mozilla.org/attachment.cgi?id=508260
> [2]http://jsfiddle.net/4VCQN/1/
> 
> When will this be fixed?
> Thanks,
> Peter

Based on testcase from Mozilla:

https://bug398272.bmoattachments.org/attachment.cgi?id=508260 <-- passes in Safari 16.5, Chrome Canary 115 and Firefox Nightly 115.

Similarly, Mozilla marked the bug as 'RESOLVED INVALID'.

https://bugzilla.mozilla.org/show_bug.cgi?id=398272#c13