Bug 107459

Summary:

WebKit should support decoding multi-byte entities in XML content

Product:

WebKit

Reporter:

Martin Robinson <mrobinson>

Component:

XML

Assignee:

Martin Robinson <mrobinson>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, ap, buildbot, dglazkov, eric, ian, ojan.autocc, rniwa, webkit.review.bot

Priority:

Version:

528+ (Nightly build)

Hardware:

Unspecified

OS:

Unspecified

See Also:

https://bugs.webkit.org/show_bug.cgi?id=162215

Bug Depends on:

74826

Bug Blocks:

Attachments:

Description	Flags
Patch	none
Patch	none
Patch	none
Patch for landing	none

Martin Robinson

Reported 2013-01-21 09:23:24 PST

Right now the set of entities WebKit can decode in XML is the same as the set that can be decoded in HTML, except for entities that resolve to code points not in the BMP and entities that resolve to multiple code points. This bug is about removing the assumption that an entity necessarily resolves to a single UChar from the code.

Attachments
Patch (9.57 KB, patch) 2013-01-21 09:32 PST, Martin Robinson	no flags	Details Formatted Diff Diff
Patch (10.59 KB, patch) 2013-01-21 23:11 PST, Martin Robinson	no flags	Details Formatted Diff Diff
Patch (11.30 KB, patch) 2013-01-22 21:04 PST, Martin Robinson	no flags	Details Formatted Diff Diff
Patch for landing (11.37 KB, patch) 2013-01-23 08:42 PST, Martin Robinson	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Martin Robinson

Comment 1 2013-01-21 09:32:16 PST

Created attachment 183800 [details] Patch

Darin Adler

Comment 2 2013-01-21 17:09:50 PST

Comment on attachment 183800 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review Code changes looks like it will function OK, but it’s not using String and CString the way they should be. It’s worth it to spend a little more time on this to get rid of all the needless memory allocation. > Source/WebCore/html/parser/HTMLEntityParser.cpp:153 > +static void appendUChar32ToString(UChar32 value, String& string) > +{ > + if (U_IS_BMP(value)) { > + UChar character = static_cast<UChar>(value); > + ASSERT(character == value); > + string.append(character); > + return; > + } > + string.append(U16_LEAD(value)); > + string.append(U16_TRAIL(value)); > +} This has terrible performance! We’ll allocate a new StringImpl for each call to String::append. We should not have code like this anywhere in WebKit. When we need to build up a string we need to use StringBuilder or a Vector. If the string is always short and fixed size then we can use a local UChar array and construct a string from it. > Source/WebCore/html/parser/HTMLEntityParser.h:37 > +String decodeNamedEntity(const char*); Returning a String from this function is not a good idea for performance; we don’t want to do memory allocation here. Since this string can only be a few characters long, then I suggest using a fixed size buffer of some sort instead. > Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1184 > - UChar c = decodeNamedEntity(reinterpret_cast<const char*>(name)); > - if (!c) > + String decodedEntity = decodeNamedEntity(reinterpret_cast<const char*>(name)); > + if (decodedEntity.isNull()) > return 0; > > - CString value = String(&c, 1).utf8(); > - ASSERT(value.length() < 5); > + CString value = decodedEntity.utf8(); > + ASSERT(value.length() < 9); For both the old and new code, doing this memory allocation is bad for performance. We can convert from UTF-16 to UTF-8 without creating String and CString objects, which result in memory allocation. We should use some short fixed size arrays for this work.

Adam Barth

Comment 3 2013-01-21 23:01:59 PST

Comment on attachment 183800 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review > Source/WebCore/html/parser/HTMLEntityParser.h:-37 > -// FIXME: Move the XML parser to an entity decoding function works for non-BMP characters! The intent behind this FIXME was to delete this function. The XML parser shouldn't be leaning on the HTML parser for this work.

Adam Barth

Comment 4 2013-01-21 23:06:56 PST

Comment on attachment 183800 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review > Source/WebCore/html/parser/HTMLEntityParser.h:34 > bool consumeHTMLEntity(SegmentedString&, StringBuilder& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0'); Notice that consumeHTMLEntity has a signature that allows for a much more efficient implementation. The right fix here is to make an efficient version for the XML parser (and to disentangle the XML parser code from the HTML parser code).

Martin Robinson

Comment 5 2013-01-21 23:11:13 PST

Created attachment 183894 [details] Patch

Martin Robinson

Comment 6 2013-01-21 23:13:11 PST

Comment on attachment 183894 [details] Patch Removing review flag after Adam's comment.

Martin Robinson

Comment 7 2013-01-21 23:17:02 PST

(In reply to comment #4) > (From update of attachment 183800 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review > > > Source/WebCore/html/parser/HTMLEntityParser.h:34 > > bool consumeHTMLEntity(SegmentedString&, StringBuilder& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0'); > > Notice that consumeHTMLEntity has a signature that allows for a much more efficient implementation. The right fix here is to make an efficient version for the XML parser (and to disentangle the XML parser code from the HTML parser code). Okay. So the patch that I just posted has a more efficient design, but still uses the entity table generated for the HTML5 parser. XHTML and XML support a much smaller set of entities than HTML, I believe. I wrote this patch because it seemed like having XML support just the BMP entities from HTML5 was doubly-wrong. What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch?

Adam Barth

Comment 8 2013-01-21 23:20:04 PST

Comment on attachment 183894 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review > Source/WebCore/html/parser/HTMLEntityParser.h:37 > +size_t decodeNamedEntityToUCharArray(const char*, UChar result[4]); This function doesn't actually look too bad. This patch might make sense as an incremental improvement. It looks like this function depends on HTMLEntitySearch, which I guess is needed if XHTML supports the HTML named entities. That means we probably won't be able to fully disentangle XML parsing from HTML parsing.

Adam Barth

Comment 9 2013-01-21 23:21:50 PST

> The XML parser shouldn't be leaning on the HTML parser for this work. ^^^ It looks like this comment is bogus. XHTML needs to use at least some of the HTML named entities.

Adam Barth

Comment 10 2013-01-21 23:24:06 PST

> Okay. So the patch that I just posted has a more efficient design, but still uses the entity table generated for the HTML5 parser. XHTML and XML support a much smaller set of entities than HTML, I believe. We should double-check that. I know XML has a very small list, but I wonder if XHTML doesn't have the full set of HTML named entities. > I wrote this patch because it seemed like having XML support just the BMP entities from HTML5 was doubly-wrong. Yes. :) > What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch? We should figure out what the correct behavior is according to the specs. If we're supposed to use the full HTML named entity table for XHTML, then something like your current patch looks reasonable. If not, we'll want to do some more work to get the right set of named entities.

Martin Robinson

Comment 11 2013-01-21 23:56:04 PST

(In reply to comment #10) (By the way, thanks for the insightful comments and reviews, Darin and Adam) > > What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch? > > We should figure out what the correct behavior is according to the specs. If we're supposed to use the full HTML named entity table for XHTML, then something like your current patch looks reasonable. If not, we'll want to do some more work to get the right set of named entities. Is it also the case that the XML parser might be used to parse SVG or MathML documents? The only entity list I can find for XHTML1 includes a small subset of the ones in HTML5: http://www.w3.org/TR/2010/REC-xhtml-modularization-20100729/schema_module_defs.html#a_smodule_XHTML_Character_Entities I cannot find a similar list for XHTML2. On the other hand, the MathML spec uses a non-BMP character entity as an example: http://www.w3.org/TR/2010/REC-MathML3-20101021/chapter7.html#chars.BMP-SMP Perhaps the appropriate spec here is the Entities one: http://www.w3.org/TR/xml-entity-names/ This spec lists a very large set of names, including many not in the BMP.

Adam Barth

Comment 12 2013-01-22 00:12:39 PST

> Is it also the case that the XML parser might be used to parse SVG or MathML documents? Yes, but I'm not sure whether that impacts the named entities that we need. > The only entity list I can find for XHTML1 includes a small subset of the ones in HTML5: http://www.w3.org/TR/2010/REC-xhtml-modularization-20100729/schema_module_defs.html#a_smodule_XHTML_Character_Entities I cannot find a similar list for XHTML2. On the other hand, the MathML spec uses a non-BMP character entity as an example: http://www.w3.org/TR/2010/REC-MathML3-20101021/chapter7.html#chars.BMP-SMP > > Perhaps the appropriate spec here is the Entities one: http://www.w3.org/TR/xml-entity-names/ This spec lists a very large set of names, including many not in the BMP. We can ask Hixie. I'm sure he'll know.

Adam Barth

Comment 13 2013-01-22 00:13:00 PST

@Hixie: What named entities should we support in XHTML? The same ones we support in HTML?

Build Bot

Comment 14 2013-01-22 00:39:56 PST

Comment on attachment 183894 [details] Patch Attachment 183894 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/16037457 New failing tests: fast/parser/entities-in-xhtml.xhtml

WebKit Review Bot

Comment 15 2013-01-22 01:23:38 PST

Comment on attachment 183894 [details] Patch Attachment 183894 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/16043434 New failing tests: fast/parser/entities-in-xhtml.xhtml

WebKit Review Bot

Comment 16 2013-01-22 01:58:44 PST

Comment on attachment 183894 [details] Patch Attachment 183894 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/16037493 New failing tests: fast/parser/entities-in-xhtml.xhtml

Ryosuke Niwa

Comment 17 2013-01-22 10:52:08 PST

Comment on attachment 183894 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review > Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1200 > + size_t numberOfCodePoints = decodeNamedEntityToUCharArray(reinterpret_cast<const char*>(name), utf16DecodedEntity); > + if (!numberOfCodePoints) > + return 0; > + > + ASSERT(numberOfCodePoints <= 4); > + size_t entityLengthInUTF8 = convertUTF16EntityToUTF8(utf16DecodedEntity, numberOfCodePoints, reinterpret_cast<char*>(sharedXHTMLEntityResult)); > + if (!entityLengthInUTF8) It's unfortunate that we have to do two conversions here.

Martin Robinson

Comment 18 2013-01-22 10:59:37 PST

Comment on attachment 183894 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review >> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1200 >> + if (!entityLengthInUTF8) > > It's unfortunate that we have to do two conversions here. That's true. Perhaps a future patch could also pre-calculate or cache a table of resolved entities in UTF-8. It's probably a bit out-of-scope for this patch though (assuming this turns out be something we want). Note that there's only one character set conversion here. The first "decode" is merely a map from the name of the entity to the UTF-16 code points.

Ian 'Hixie' Hickson

Comment 19 2013-01-22 12:34:49 PST

The HTML spec defines how named entities in XML are to be parsed here: http://whatwg.org/html/#parsing-xhtml-documents HTH.

Adam Barth

Comment 20 2013-01-22 12:46:45 PST

The spec gives a data URL containing a pile of entities. We should compare that list to the list of HTML named entities. (The spec seems to say they're the same.)

Martin Robinson

Comment 21 2013-01-22 12:55:37 PST

(In reply to comment #20) > The spec gives a data URL containing a pile of entities. We should compare that list to the list of HTML named entities. (The spec seems to say they're the same.) I compared the list in the link to the list of named entities in Source/WebCore/html/parser/HTMLEntityNames.in. They appear to be identical. I guess it makes sense to fix my patch and reupload.

Adam Barth

Comment 22 2013-01-22 12:59:19 PST

SGTM

Build Bot

Comment 23 2013-01-22 20:46:36 PST

Comment on attachment 183894 [details] Patch Attachment 183894 [details] did not pass mac-wk2-ews (mac-wk2): Output: http://queues.webkit.org/results/16045946 New failing tests: fast/parser/entities-in-xhtml.xhtml

Martin Robinson

Comment 24 2013-01-22 21:04:24 PST

Created attachment 184125 [details] Patch

Adam Barth

Comment 25 2013-01-22 22:06:21 PST

Comment on attachment 184125 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=184125&action=review This looks good, modulo the code point / code unit naming question. rniwa is right that it would be better to go straight to UTF-8, but we can leave that for a future patch. > Source/WebCore/html/parser/HTMLEntityParser.cpp:169 > + size_t numberOfCodePoints = appendUChar32ToUCharArray(search.mostRecentMatch()->firstValue, result); numberOfCodePoints -> numberOfCodeUnits, right? > Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1182 > + WTF::Unicode::ConversionResult conversionResult = WTF::Unicode::convertUTF16ToUTF8(&utf16Entity, > + utf16Entity + numberOfCodePoints, &target, target + 9); Can we get this 9 as sizeof sharedXHTMLEntityResult ?

Martin Robinson

Comment 26 2013-01-22 22:11:34 PST

Comment on attachment 184125 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=184125&action=review >> Source/WebCore/html/parser/HTMLEntityParser.cpp:169 >> + size_t numberOfCodePoints = appendUChar32ToUCharArray(search.mostRecentMatch()->firstValue, result); > > numberOfCodePoints -> numberOfCodeUnits, right? numberOfCodeUnits is definitely what I was trying to right here. Thanks for finding that. >> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1182 >> + utf16Entity + numberOfCodePoints, &target, target + 9); > > Can we get this 9 as sizeof sharedXHTMLEntityResult ? I could probably use WTF_ARRAY_LENGTH here, indeed.

Martin Robinson

Comment 27 2013-01-23 08:42:58 PST

Created attachment 184245 [details] Patch for landing

WebKit Review Bot

Comment 28 2013-01-23 16:28:02 PST

Comment on attachment 184245 [details] Patch for landing Clearing flags on attachment: 184245 Committed r140610: <http://trac.webkit.org/changeset/140610>

WebKit Review Bot

Comment 29 2013-01-23 16:28:07 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.