Bug 107459

Summary: WebKit should support decoding multi-byte entities in XML content
Product: WebKit Reporter: Martin Robinson <mrobinson>
Component: XMLAssignee: Martin Robinson <mrobinson>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, buildbot, dglazkov, eric, ian, ojan.autocc, rniwa, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=162215
Bug Depends on: 74826    
Bug Blocks:    
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch for landing none

Description Martin Robinson 2013-01-21 09:23:24 PST
Right now the set of entities WebKit can decode in XML is the same as the set that can be decoded in HTML, except for entities that resolve to code points not in the BMP and entities that resolve to multiple code points. This bug is about removing the assumption that an entity necessarily resolves to a single UChar from the code.
Comment 1 Martin Robinson 2013-01-21 09:32:16 PST
Created attachment 183800 [details]
Patch
Comment 2 Darin Adler 2013-01-21 17:09:50 PST
Comment on attachment 183800 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review

Code changes looks like it will function OK, but it’s not using String and CString the way they should be. It’s worth it to spend a little more time on this to get rid of all the needless memory allocation.

> Source/WebCore/html/parser/HTMLEntityParser.cpp:153
> +static void appendUChar32ToString(UChar32 value, String& string)
> +{
> +    if (U_IS_BMP(value)) {
> +        UChar character = static_cast<UChar>(value);
> +        ASSERT(character == value);
> +        string.append(character);
> +        return;
> +    }
> +    string.append(U16_LEAD(value));
> +    string.append(U16_TRAIL(value));
> +}

This has terrible performance! We’ll allocate a new StringImpl for each call to String::append. We should not have code like this anywhere in WebKit. When we need to build up a string we need to use StringBuilder or a Vector. If the string is always short and fixed size then we can use a local UChar array and construct a string from it.

> Source/WebCore/html/parser/HTMLEntityParser.h:37
> +String decodeNamedEntity(const char*);

Returning a String from this function is not a good idea for performance; we don’t want to do memory allocation here. Since this string can only be a few characters long, then I suggest using a fixed size buffer of some sort instead.

> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1184
> -    UChar c = decodeNamedEntity(reinterpret_cast<const char*>(name));
> -    if (!c)
> +    String decodedEntity = decodeNamedEntity(reinterpret_cast<const char*>(name));
> +    if (decodedEntity.isNull())
>          return 0;
>  
> -    CString value = String(&c, 1).utf8();
> -    ASSERT(value.length() < 5);
> +    CString value = decodedEntity.utf8();
> +    ASSERT(value.length() < 9);

For both the old and new code, doing this memory allocation is bad for performance. We can convert from UTF-16 to UTF-8 without creating String and CString objects, which result in memory allocation. We should use some short fixed size arrays for this work.
Comment 3 Adam Barth 2013-01-21 23:01:59 PST
Comment on attachment 183800 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review

> Source/WebCore/html/parser/HTMLEntityParser.h:-37
> -// FIXME: Move the XML parser to an entity decoding function works for non-BMP characters!

The intent behind this FIXME was to delete this function.  The XML parser shouldn't be leaning on the HTML parser for this work.
Comment 4 Adam Barth 2013-01-21 23:06:56 PST
Comment on attachment 183800 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review

> Source/WebCore/html/parser/HTMLEntityParser.h:34
>  bool consumeHTMLEntity(SegmentedString&, StringBuilder& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0');

Notice that consumeHTMLEntity has a signature that allows for a much more efficient implementation.  The right fix here is to make an efficient version for the XML parser (and to disentangle the XML parser code from the HTML parser code).
Comment 5 Martin Robinson 2013-01-21 23:11:13 PST
Created attachment 183894 [details]
Patch
Comment 6 Martin Robinson 2013-01-21 23:13:11 PST
Comment on attachment 183894 [details]
Patch

Removing review flag after Adam's comment.
Comment 7 Martin Robinson 2013-01-21 23:17:02 PST
(In reply to comment #4)
> (From update of attachment 183800 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review
> 
> > Source/WebCore/html/parser/HTMLEntityParser.h:34
> >  bool consumeHTMLEntity(SegmentedString&, StringBuilder& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0');
> 
> Notice that consumeHTMLEntity has a signature that allows for a much more efficient implementation.  The right fix here is to make an efficient version for the XML parser (and to disentangle the XML parser code from the HTML parser code).

Okay. So the patch that I just posted has a more efficient design, but still uses the entity table generated for the HTML5 parser.  XHTML and XML support a much smaller set of entities than HTML, I believe.  I wrote this patch because it seemed like having XML support just the BMP entities from HTML5 was doubly-wrong.  What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch?
Comment 8 Adam Barth 2013-01-21 23:20:04 PST
Comment on attachment 183894 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review

> Source/WebCore/html/parser/HTMLEntityParser.h:37
> +size_t decodeNamedEntityToUCharArray(const char*, UChar result[4]);

This function doesn't actually look too bad.  This patch might make sense as an incremental improvement.  It looks like this function depends on HTMLEntitySearch, which I guess is needed if XHTML supports the HTML named entities.  That means we probably won't be able to fully disentangle XML parsing from HTML parsing.
Comment 9 Adam Barth 2013-01-21 23:21:50 PST
> The XML parser shouldn't be leaning on the HTML parser for this work.

^^^ It looks like this comment is bogus.  XHTML needs to use at least some of the HTML named entities.
Comment 10 Adam Barth 2013-01-21 23:24:06 PST
> Okay. So the patch that I just posted has a more efficient design, but still uses the entity table generated for the HTML5 parser.  XHTML and XML support a much smaller set of entities than HTML, I believe.

We should double-check that.  I know XML has a very small list, but I wonder if XHTML doesn't have the full set of HTML named entities.

> I wrote this patch because it seemed like having XML support just the BMP entities from HTML5 was doubly-wrong.

Yes.  :)

> What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch?

We should figure out what the correct behavior is according to the specs.  If we're supposed to use the full HTML named entity table for XHTML, then something like your current patch looks reasonable.  If not, we'll want to do some more work to get the right set of named entities.
Comment 11 Martin Robinson 2013-01-21 23:56:04 PST
(In reply to comment #10)

(By the way, thanks for the insightful comments and reviews, Darin and Adam)

> > What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch?
> 
> We should figure out what the correct behavior is according to the specs.  If we're supposed to use the full HTML named entity table for XHTML, then something like your current patch looks reasonable.  If not, we'll want to do some more work to get the right set of named entities.

Is it also the case that the XML parser might be used to parse SVG or MathML documents? The only entity list I can find for XHTML1 includes a small subset of the ones in HTML5: http://www.w3.org/TR/2010/REC-xhtml-modularization-20100729/schema_module_defs.html#a_smodule_XHTML_Character_Entities I cannot find a similar list for XHTML2. On the other hand,  the MathML spec uses a non-BMP character entity as an example: http://www.w3.org/TR/2010/REC-MathML3-20101021/chapter7.html#chars.BMP-SMP

Perhaps the appropriate spec here is the Entities one: http://www.w3.org/TR/xml-entity-names/ This spec lists a very large set of names, including many not in the BMP.
Comment 12 Adam Barth 2013-01-22 00:12:39 PST
> Is it also the case that the XML parser might be used to parse SVG or MathML documents?

Yes, but I'm not sure whether that impacts the named entities that we need.

> The only entity list I can find for XHTML1 includes a small subset of the ones in HTML5: http://www.w3.org/TR/2010/REC-xhtml-modularization-20100729/schema_module_defs.html#a_smodule_XHTML_Character_Entities I cannot find a similar list for XHTML2. On the other hand,  the MathML spec uses a non-BMP character entity as an example: http://www.w3.org/TR/2010/REC-MathML3-20101021/chapter7.html#chars.BMP-SMP
> 
> Perhaps the appropriate spec here is the Entities one: http://www.w3.org/TR/xml-entity-names/ This spec lists a very large set of names, including many not in the BMP.

We can ask Hixie.  I'm sure he'll know.
Comment 13 Adam Barth 2013-01-22 00:13:00 PST
@Hixie: What named entities should we support in XHTML?  The same ones we support in HTML?
Comment 14 Build Bot 2013-01-22 00:39:56 PST
Comment on attachment 183894 [details]
Patch

Attachment 183894 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/16037457

New failing tests:
fast/parser/entities-in-xhtml.xhtml
Comment 15 WebKit Review Bot 2013-01-22 01:23:38 PST
Comment on attachment 183894 [details]
Patch

Attachment 183894 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/16043434

New failing tests:
fast/parser/entities-in-xhtml.xhtml
Comment 16 WebKit Review Bot 2013-01-22 01:58:44 PST
Comment on attachment 183894 [details]
Patch

Attachment 183894 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/16037493

New failing tests:
fast/parser/entities-in-xhtml.xhtml
Comment 17 Ryosuke Niwa 2013-01-22 10:52:08 PST
Comment on attachment 183894 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review

> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1200
> +    size_t numberOfCodePoints = decodeNamedEntityToUCharArray(reinterpret_cast<const char*>(name), utf16DecodedEntity);
> +    if (!numberOfCodePoints)
> +        return 0;
> +
> +    ASSERT(numberOfCodePoints <= 4);
> +    size_t entityLengthInUTF8 = convertUTF16EntityToUTF8(utf16DecodedEntity, numberOfCodePoints, reinterpret_cast<char*>(sharedXHTMLEntityResult));
> +    if (!entityLengthInUTF8)

It's unfortunate that we have to do two conversions here.
Comment 18 Martin Robinson 2013-01-22 10:59:37 PST
Comment on attachment 183894 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review

>> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1200
>> +    if (!entityLengthInUTF8)
> 
> It's unfortunate that we have to do two conversions here.

That's true. Perhaps a future patch could also pre-calculate or cache a table of resolved entities in UTF-8. It's probably a bit out-of-scope for this patch though (assuming this turns out be something we want).

Note that there's only one character set conversion here. The first "decode" is merely a map from the name of the entity to the UTF-16 code points.
Comment 19 Ian 'Hixie' Hickson 2013-01-22 12:34:49 PST
The HTML spec defines how named entities in XML are to be parsed here:

   http://whatwg.org/html/#parsing-xhtml-documents

HTH.
Comment 20 Adam Barth 2013-01-22 12:46:45 PST
The spec gives a data URL containing a pile of entities.  We should compare that list to the list of HTML named entities.  (The spec seems to say they're the same.)
Comment 21 Martin Robinson 2013-01-22 12:55:37 PST
(In reply to comment #20)
> The spec gives a data URL containing a pile of entities.  We should compare that list to the list of HTML named entities.  (The spec seems to say they're the same.)

I compared the list in the link to the list of named entities in Source/WebCore/html/parser/HTMLEntityNames.in. They appear to be identical. I guess it makes sense to fix my patch and reupload.
Comment 22 Adam Barth 2013-01-22 12:59:19 PST
SGTM
Comment 23 Build Bot 2013-01-22 20:46:36 PST
Comment on attachment 183894 [details]
Patch

Attachment 183894 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://queues.webkit.org/results/16045946

New failing tests:
fast/parser/entities-in-xhtml.xhtml
Comment 24 Martin Robinson 2013-01-22 21:04:24 PST
Created attachment 184125 [details]
Patch
Comment 25 Adam Barth 2013-01-22 22:06:21 PST
Comment on attachment 184125 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=184125&action=review

This looks good, modulo the code point / code unit naming question.  rniwa is right that it would be better to go straight to UTF-8, but we can leave that for a future patch.

> Source/WebCore/html/parser/HTMLEntityParser.cpp:169
> +    size_t numberOfCodePoints = appendUChar32ToUCharArray(search.mostRecentMatch()->firstValue, result);

numberOfCodePoints -> numberOfCodeUnits, right?

> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1182
> +    WTF::Unicode::ConversionResult conversionResult = WTF::Unicode::convertUTF16ToUTF8(&utf16Entity,
> +        utf16Entity + numberOfCodePoints, &target, target + 9);

Can we get this 9 as sizeof sharedXHTMLEntityResult ?
Comment 26 Martin Robinson 2013-01-22 22:11:34 PST
Comment on attachment 184125 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=184125&action=review

>> Source/WebCore/html/parser/HTMLEntityParser.cpp:169
>> +    size_t numberOfCodePoints = appendUChar32ToUCharArray(search.mostRecentMatch()->firstValue, result);
> 
> numberOfCodePoints -> numberOfCodeUnits, right?

numberOfCodeUnits is definitely what I was trying to right here. Thanks for finding that.

>> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1182
>> +        utf16Entity + numberOfCodePoints, &target, target + 9);
> 
> Can we get this 9 as sizeof sharedXHTMLEntityResult ?

I could probably use WTF_ARRAY_LENGTH here, indeed.
Comment 27 Martin Robinson 2013-01-23 08:42:58 PST
Created attachment 184245 [details]
Patch for landing
Comment 28 WebKit Review Bot 2013-01-23 16:28:02 PST
Comment on attachment 184245 [details]
Patch for landing

Clearing flags on attachment: 184245

Committed r140610: <http://trac.webkit.org/changeset/140610>
Comment 29 WebKit Review Bot 2013-01-23 16:28:07 PST
All reviewed patches have been landed.  Closing bug.