RESOLVED FIXED 74826
WebKit should support HTML entities that expand to more than one character
https://bugs.webkit.org/show_bug.cgi?id=74826
Summary WebKit should support HTML entities that expand to more than one character
Adam Barth
Reported 2011-12-18 16:22:30 PST
WebKit should support HTML entities that expand to more than one character
Attachments
Patch (23.03 KB, patch)
2011-12-18 16:30 PST, Adam Barth
no flags
Patch for landing (22.45 KB, patch)
2011-12-18 22:53 PST, Adam Barth
no flags
Patch for landing (22.40 KB, patch)
2011-12-19 09:13 PST, Adam Barth
no flags
Adam Barth
Comment 1 2011-12-18 16:30:15 PST
Darin Adler
Comment 2 2011-12-18 17:49:13 PST
Comment on attachment 119787 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=119787&action=review > Source/WebCore/html/parser/HTMLEntityParser.cpp:160 > + if (U16_LENGTH(entityValue) != 1 || search.mostRecentMatch()->rightValue) { It’s fine to have one of these checks here at runtime. But the other should be an assertion. There’s no need to check both. Of course, you plan to delete this whole function eventually, I assume. > Source/WebCore/html/parser/HTMLEntityParser.h:37 > +// FIXME: Move the XML parser to an entity decoding function that actually works! The “that actually works” here is unnecessarily mysterious. You could say something specific like “that works for non-BMP characters”. > Source/WebCore/html/parser/HTMLEntityTable.h:39 > + UChar32 leftValue; > + UChar32 rightValue; Using left and right here is curious. I would have called these first and second or something that implies ordering rather than direction. > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:382 > + UChar32 entityValue = search.mostRecentMatch()->leftValue; > + // FIXME: We need to account for rightValue if any XML entities are longer > + // than one unicode character. > + ASSERT(!search.mostRecentMatch()->rightValue); What prevents this assertion from firing? Is there a test case demonstrating that? > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:385 > if (entityValue <= 0xFFFF) > appendToText(reinterpret_cast<UChar*>(&entityValue), 1); > else { You can see given the code above that this else is dead code. This code is in a strange state. And the reinterpret_cast to UChar* makes the code little-endian-specific. That is not good!
Adam Barth
Comment 3 2011-12-18 17:55:00 PST
Thanks for the review. I'll go through your comments carefully before landing. I can answer one of your questions quickly though: (In reply to comment #2) > (From update of attachment 119787 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=119787&action=review > > > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:382 > > + UChar32 entityValue = search.mostRecentMatch()->leftValue; > > + // FIXME: We need to account for rightValue if any XML entities are longer > > + // than one unicode character. > > + ASSERT(!search.mostRecentMatch()->rightValue); > > What prevents this assertion from firing? Is there a test case demonstrating that? This code is part of ENABLE(NEWXML), which was an intern project this past summer. Currently it's impossible to execute this code without enabling NEWXML. It might be better to change this to an ASSERT_NOT_REACHED to remind whoever tries to enable this code in the future that this issue needs to be addressed (in addition to the other remaining issues with the new XML parser).
Adam Barth
Comment 4 2011-12-18 22:52:43 PST
> Of course, you plan to delete this whole function eventually, I assume. Yes. The function is deeply broken and needs to be deleted. > > Source/WebCore/html/parser/HTMLEntityParser.h:37 > > +// FIXME: Move the XML parser to an entity decoding function that actually works! > > The “that actually works” here is unnecessarily mysterious. You could say something specific like “that works for non-BMP characters”. Fixed. > > Source/WebCore/html/parser/HTMLEntityTable.h:39 > > + UChar32 leftValue; > > + UChar32 rightValue; > > Using left and right here is curious. I would have called these first and second or something that implies ordering rather than direction. Done. > > Source/WebCore/xml/parser/XMLTreeBuilder.cpp:385 > > if (entityValue <= 0xFFFF) > > appendToText(reinterpret_cast<UChar*>(&entityValue), 1); > > else { > > You can see given the code above that this else is dead code. This code is in a strange state. > > And the reinterpret_cast to UChar* makes the code little-endian-specific. That is not good! I've added an ASSERT_NOT_REACHED() to this function to ensure that future developers working on bringing up the XML parser read these comments. I've also added your notes as comments.
Adam Barth
Comment 5 2011-12-18 22:53:59 PST
Created attachment 119818 [details] Patch for landing
WebKit Review Bot
Comment 6 2011-12-19 00:40:01 PST
Comment on attachment 119818 [details] Patch for landing Rejecting attachment 119818 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 2 Last 500 characters of output: git/webkit-commit-queue/Source/WebKit/chromium/ui --revision 114686 --non-interactive --force --accept theirs-conflict --ignore-externals' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' 45>At revision 114686. ________ running '/usr/bin/python tools/clang/scripts/update.py --mac-only' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' ________ running '/usr/bin/python gyp_webkit' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' Updating webkit projects from gyp files... Full output: http://queues.webkit.org/results/10930530
Adam Barth
Comment 7 2011-12-19 09:13:46 PST
Created attachment 119869 [details] Patch for landing
WebKit Review Bot
Comment 8 2011-12-19 10:30:40 PST
Comment on attachment 119869 [details] Patch for landing Clearing flags on attachment: 119869 Committed r103246: <http://trac.webkit.org/changeset/103246>
WebKit Review Bot
Comment 9 2011-12-19 10:30:45 PST
All reviewed patches have been landed. Closing bug.
Mathias Bynens
Comment 10 2012-05-24 23:32:06 PDT
With this change, http://mathias.html5.org/tests/named-character-references/ only lists a single error: > &AElig doesn’t match Æ (Even though the entry in Source/WebCore/html/parser/HTMLEntityNames.in looks alright: http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLEntityNames.in#L2) See bug 87465 for that.
Note You need to log in before you can comment on or make changes to this bug.