Right now the set of entities WebKit can decode in XML is the same as the set that can be decoded in HTML, except for entities that resolve to code points not in the BMP and entities that resolve to multiple code points. This bug is about removing the assumption that an entity necessarily resolves to a single UChar from the code.
Created attachment 183800 [details] Patch
Comment on attachment 183800 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review Code changes looks like it will function OK, but it’s not using String and CString the way they should be. It’s worth it to spend a little more time on this to get rid of all the needless memory allocation. > Source/WebCore/html/parser/HTMLEntityParser.cpp:153 > +static void appendUChar32ToString(UChar32 value, String& string) > +{ > + if (U_IS_BMP(value)) { > + UChar character = static_cast<UChar>(value); > + ASSERT(character == value); > + string.append(character); > + return; > + } > + string.append(U16_LEAD(value)); > + string.append(U16_TRAIL(value)); > +} This has terrible performance! We’ll allocate a new StringImpl for each call to String::append. We should not have code like this anywhere in WebKit. When we need to build up a string we need to use StringBuilder or a Vector. If the string is always short and fixed size then we can use a local UChar array and construct a string from it. > Source/WebCore/html/parser/HTMLEntityParser.h:37 > +String decodeNamedEntity(const char*); Returning a String from this function is not a good idea for performance; we don’t want to do memory allocation here. Since this string can only be a few characters long, then I suggest using a fixed size buffer of some sort instead. > Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1184 > - UChar c = decodeNamedEntity(reinterpret_cast<const char*>(name)); > - if (!c) > + String decodedEntity = decodeNamedEntity(reinterpret_cast<const char*>(name)); > + if (decodedEntity.isNull()) > return 0; > > - CString value = String(&c, 1).utf8(); > - ASSERT(value.length() < 5); > + CString value = decodedEntity.utf8(); > + ASSERT(value.length() < 9); For both the old and new code, doing this memory allocation is bad for performance. We can convert from UTF-16 to UTF-8 without creating String and CString objects, which result in memory allocation. We should use some short fixed size arrays for this work.
Comment on attachment 183800 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review > Source/WebCore/html/parser/HTMLEntityParser.h:-37 > -// FIXME: Move the XML parser to an entity decoding function works for non-BMP characters! The intent behind this FIXME was to delete this function. The XML parser shouldn't be leaning on the HTML parser for this work.
Comment on attachment 183800 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review > Source/WebCore/html/parser/HTMLEntityParser.h:34 > bool consumeHTMLEntity(SegmentedString&, StringBuilder& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0'); Notice that consumeHTMLEntity has a signature that allows for a much more efficient implementation. The right fix here is to make an efficient version for the XML parser (and to disentangle the XML parser code from the HTML parser code).
Created attachment 183894 [details] Patch
Comment on attachment 183894 [details] Patch Removing review flag after Adam's comment.
(In reply to comment #4) > (From update of attachment 183800 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=183800&action=review > > > Source/WebCore/html/parser/HTMLEntityParser.h:34 > > bool consumeHTMLEntity(SegmentedString&, StringBuilder& decodedEntity, bool& notEnoughCharacters, UChar additionalAllowedCharacter = '\0'); > > Notice that consumeHTMLEntity has a signature that allows for a much more efficient implementation. The right fix here is to make an efficient version for the XML parser (and to disentangle the XML parser code from the HTML parser code). Okay. So the patch that I just posted has a more efficient design, but still uses the entity table generated for the HTML5 parser. XHTML and XML support a much smaller set of entities than HTML, I believe. I wrote this patch because it seemed like having XML support just the BMP entities from HTML5 was doubly-wrong. What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch?
Comment on attachment 183894 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review > Source/WebCore/html/parser/HTMLEntityParser.h:37 > +size_t decodeNamedEntityToUCharArray(const char*, UChar result[4]); This function doesn't actually look too bad. This patch might make sense as an incremental improvement. It looks like this function depends on HTMLEntitySearch, which I guess is needed if XHTML supports the HTML named entities. That means we probably won't be able to fully disentangle XML parsing from HTML parsing.
> The XML parser shouldn't be leaning on the HTML parser for this work. ^^^ It looks like this comment is bogus. XHTML needs to use at least some of the HTML named entities.
> Okay. So the patch that I just posted has a more efficient design, but still uses the entity table generated for the HTML5 parser. XHTML and XML support a much smaller set of entities than HTML, I believe. We should double-check that. I know XML has a very small list, but I wonder if XHTML doesn't have the full set of HTML named entities. > I wrote this patch because it seemed like having XML support just the BMP entities from HTML5 was doubly-wrong. Yes. :) > What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch? We should figure out what the correct behavior is according to the specs. If we're supposed to use the full HTML named entity table for XHTML, then something like your current patch looks reasonable. If not, we'll want to do some more work to get the right set of named entities.
(In reply to comment #10) (By the way, thanks for the insightful comments and reviews, Darin and Adam) > > What would you like to see for this code ultimately? Should I generate a separate table with just the entities from XML, try to extract out the entity table from the HTML5 parser, or stick with something like my latest patch? > > We should figure out what the correct behavior is according to the specs. If we're supposed to use the full HTML named entity table for XHTML, then something like your current patch looks reasonable. If not, we'll want to do some more work to get the right set of named entities. Is it also the case that the XML parser might be used to parse SVG or MathML documents? The only entity list I can find for XHTML1 includes a small subset of the ones in HTML5: http://www.w3.org/TR/2010/REC-xhtml-modularization-20100729/schema_module_defs.html#a_smodule_XHTML_Character_Entities I cannot find a similar list for XHTML2. On the other hand, the MathML spec uses a non-BMP character entity as an example: http://www.w3.org/TR/2010/REC-MathML3-20101021/chapter7.html#chars.BMP-SMP Perhaps the appropriate spec here is the Entities one: http://www.w3.org/TR/xml-entity-names/ This spec lists a very large set of names, including many not in the BMP.
> Is it also the case that the XML parser might be used to parse SVG or MathML documents? Yes, but I'm not sure whether that impacts the named entities that we need. > The only entity list I can find for XHTML1 includes a small subset of the ones in HTML5: http://www.w3.org/TR/2010/REC-xhtml-modularization-20100729/schema_module_defs.html#a_smodule_XHTML_Character_Entities I cannot find a similar list for XHTML2. On the other hand, the MathML spec uses a non-BMP character entity as an example: http://www.w3.org/TR/2010/REC-MathML3-20101021/chapter7.html#chars.BMP-SMP > > Perhaps the appropriate spec here is the Entities one: http://www.w3.org/TR/xml-entity-names/ This spec lists a very large set of names, including many not in the BMP. We can ask Hixie. I'm sure he'll know.
@Hixie: What named entities should we support in XHTML? The same ones we support in HTML?
Comment on attachment 183894 [details] Patch Attachment 183894 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/16037457 New failing tests: fast/parser/entities-in-xhtml.xhtml
Comment on attachment 183894 [details] Patch Attachment 183894 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/16043434 New failing tests: fast/parser/entities-in-xhtml.xhtml
Comment on attachment 183894 [details] Patch Attachment 183894 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/16037493 New failing tests: fast/parser/entities-in-xhtml.xhtml
Comment on attachment 183894 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review > Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1200 > + size_t numberOfCodePoints = decodeNamedEntityToUCharArray(reinterpret_cast<const char*>(name), utf16DecodedEntity); > + if (!numberOfCodePoints) > + return 0; > + > + ASSERT(numberOfCodePoints <= 4); > + size_t entityLengthInUTF8 = convertUTF16EntityToUTF8(utf16DecodedEntity, numberOfCodePoints, reinterpret_cast<char*>(sharedXHTMLEntityResult)); > + if (!entityLengthInUTF8) It's unfortunate that we have to do two conversions here.
Comment on attachment 183894 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=183894&action=review >> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1200 >> + if (!entityLengthInUTF8) > > It's unfortunate that we have to do two conversions here. That's true. Perhaps a future patch could also pre-calculate or cache a table of resolved entities in UTF-8. It's probably a bit out-of-scope for this patch though (assuming this turns out be something we want). Note that there's only one character set conversion here. The first "decode" is merely a map from the name of the entity to the UTF-16 code points.
The HTML spec defines how named entities in XML are to be parsed here: http://whatwg.org/html/#parsing-xhtml-documents HTH.
The spec gives a data URL containing a pile of entities. We should compare that list to the list of HTML named entities. (The spec seems to say they're the same.)
(In reply to comment #20) > The spec gives a data URL containing a pile of entities. We should compare that list to the list of HTML named entities. (The spec seems to say they're the same.) I compared the list in the link to the list of named entities in Source/WebCore/html/parser/HTMLEntityNames.in. They appear to be identical. I guess it makes sense to fix my patch and reupload.
SGTM
Comment on attachment 183894 [details] Patch Attachment 183894 [details] did not pass mac-wk2-ews (mac-wk2): Output: http://queues.webkit.org/results/16045946 New failing tests: fast/parser/entities-in-xhtml.xhtml
Created attachment 184125 [details] Patch
Comment on attachment 184125 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=184125&action=review This looks good, modulo the code point / code unit naming question. rniwa is right that it would be better to go straight to UTF-8, but we can leave that for a future patch. > Source/WebCore/html/parser/HTMLEntityParser.cpp:169 > + size_t numberOfCodePoints = appendUChar32ToUCharArray(search.mostRecentMatch()->firstValue, result); numberOfCodePoints -> numberOfCodeUnits, right? > Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1182 > + WTF::Unicode::ConversionResult conversionResult = WTF::Unicode::convertUTF16ToUTF8(&utf16Entity, > + utf16Entity + numberOfCodePoints, &target, target + 9); Can we get this 9 as sizeof sharedXHTMLEntityResult ?
Comment on attachment 184125 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=184125&action=review >> Source/WebCore/html/parser/HTMLEntityParser.cpp:169 >> + size_t numberOfCodePoints = appendUChar32ToUCharArray(search.mostRecentMatch()->firstValue, result); > > numberOfCodePoints -> numberOfCodeUnits, right? numberOfCodeUnits is definitely what I was trying to right here. Thanks for finding that. >> Source/WebCore/xml/parser/XMLDocumentParserLibxml2.cpp:1182 >> + utf16Entity + numberOfCodePoints, &target, target + 9); > > Can we get this 9 as sizeof sharedXHTMLEntityResult ? I could probably use WTF_ARRAY_LENGTH here, indeed.
Created attachment 184245 [details] Patch for landing
Comment on attachment 184245 [details] Patch for landing Clearing flags on attachment: 184245 Committed r140610: <http://trac.webkit.org/changeset/140610>
All reviewed patches have been landed. Closing bug.