RESOLVED FIXED 4301
Support HTML entities on pages parsed as XHTML
https://bugs.webkit.org/show_bug.cgi?id=4301
Summary Support HTML entities on pages parsed as XHTML
Denis Defreyne
Reported 2005-08-06 11:25:39 PDT
OVERVIEW DESCRIPTION: Some entities in documents parsed using WebKit's XML parser are not recognised. Examples of unrecognised entities include 'copy' and 'raquo'. STEPS TO REPRODUCE: Load an XML page containing entities such as 'copy' and 'raquo'. ACTUAL RESULTS: The entities are not displayed. The top of the page shows the following error message: This page contains the following errors: error on line 7 at column 56: Entity 'copy' not defined Below is a rendering of the page up to the first error. EXPECTED RESULTS: The entities should show up and there should be no error message. ADDITIONAL BUILDS: This bug can be reproduced in 412 and up. I believe I encountered this bug in 312 as well, but I'm not 100% sure. ADDITIONAL INFORMATION: Darin said: "we broke it by moving from expat to libxml2".
Attachments
First pass at fixing this (7.22 KB, patch)
2005-10-17 00:53 PDT, Eric Seidel (no email)
no flags
Updated patch that handles named entities correctly (3.56 KB, patch)
2005-11-01 05:19 PST, Mark Rowe (bdash)
no flags
Patch with assertion to ensure data will fit in static buffer (3.63 KB, patch)
2005-11-01 16:36 PST, Mark Rowe (bdash)
eric: review-
Updated patch including layout test + comment change (367.93 KB, patch)
2005-12-28 07:41 PST, Mark Rowe (bdash)
eric: review+
Patch addressing darin's comments. (7.17 KB, patch)
2005-12-29 15:21 PST, Eric Seidel (no email)
no flags
now including kentities.h (9.00 KB, patch)
2005-12-29 15:27 PST, Eric Seidel (no email)
no flags
Mark Rowe (bdash)
Comment 1 2005-08-06 22:16:05 PDT
Confirmed with ToT.
Eric Seidel (no email)
Comment 2 2005-09-08 00:01:07 PDT
This one is mine.
Eric Seidel (no email)
Comment 3 2005-10-05 15:44:31 PDT
This is fixed in TOT.
Eric Seidel (no email)
Comment 4 2005-10-05 16:10:23 PDT
Oops.
Eric Seidel (no email)
Comment 5 2005-10-05 16:23:58 PDT
We're only supposed to support these entities: http://www.w3.org/TR/REC-xml/#sec-predefined-ent this xml file seems to depend on an html entity "copy", which xml as far as I can tell, is not supposed to support: http://www.w3.org/TR/REC-html40/sgml/entities.html
Eric Seidel (no email)
Comment 6 2005-10-16 21:36:51 PDT
*** Bug 3861 has been marked as a duplicate of this bug. ***
Eric Seidel (no email)
Comment 7 2005-10-16 21:47:03 PDT
I was incorrect. XHTML documents require us to support HTML entities. So this is a valid bug: http://www.w3.org/TR/xhtml1/#h-A2
Eric Seidel (no email)
Comment 8 2005-10-17 00:53:40 PDT
Created attachment 4378 [details] First pass at fixing this Ok, so I took a stab at fixing this tonight. Unfortunately there are some other minor changes for XSLTProcessor support included in this patch. Named entities still don't work correctly (I'm not sure why?), but numerical entities work just fine. I'll hopefully get around to finishing this later this week, unless somebody beats me to it.
Eric Seidel (no email)
Comment 9 2005-10-17 00:54:52 PDT
FYI I've been using this *fabulous* set of test pages: http://www.w3.org/People/mimasa/test/xhtml/entities/
Mark Rowe (bdash)
Comment 10 2005-11-01 05:19:36 PST
Created attachment 4550 [details] Updated patch that handles named entities correctly The thing that was preventing named entities from working was that the values were being passed through as utf-16 while libxml was expecting utf-8. It's also worth noting that character references ("numerical entities") are handled internally by libxml so getXHTMLEntity doesn't need to handle them.
Eric Seidel (no email)
Comment 11 2005-11-01 11:37:21 PST
Comment on attachment 4550 [details] Updated patch that handles named entities correctly Patch looks great. You should add an ASSERT(value.length() < 5) to make sure we don't screw ourselves by changing the entity lookup code some day.
Mark Rowe (bdash)
Comment 12 2005-11-01 16:36:11 PST
Created attachment 4556 [details] Patch with assertion to ensure data will fit in static buffer
Eric Seidel (no email)
Comment 13 2005-11-01 16:43:02 PST
Comment on attachment 4556 [details] Patch with assertion to ensure data will fit in static buffer Looks great. r=me. Darin or mjs could look at it, but I don't think they're gonna have anything else to say.
Eric Seidel (no email)
Comment 14 2005-11-01 16:45:58 PST
Comment on attachment 4556 [details] Patch with assertion to ensure data will fit in static buffer Oops. Almost forgot. This needs a test case. I suggest just landing the w3c test page w/ text results.
Eric Seidel (no email)
Comment 15 2005-11-01 17:05:33 PST
As darin pointed out, the code comment should be changed: Using a global variable entity and marking it XML_INTERNAL_PREDEFINED_ENTITY is a hack to avoid malloc/free. However using a global variable like this could cause trouble if libxml implementation details were to change.
Eric Seidel (no email)
Comment 16 2005-11-30 01:37:54 PST
Mark and I sorta left this one hanging...
Mark Rowe (bdash)
Comment 17 2005-12-28 07:41:45 PST
Created attachment 5337 [details] Updated patch including layout test + comment change
Darin Adler
Comment 18 2005-12-28 10:40:45 PST
Comment on attachment 5337 [details] Updated patch including layout test + comment change I think getXHTMLEntity should use memcpy instead of a loop to copy the data.
Eric Seidel (no email)
Comment 19 2005-12-28 21:38:56 PST
Comment on attachment 5337 [details] Updated patch including layout test + comment change I agree with darin, it shoudl be changed to memcpy. I'll change it as I land. r=me.
Darin Adler
Comment 20 2005-12-29 09:02:21 PST
Comment on attachment 5337 [details] Updated patch including layout test + comment change Including "kentities.c" like this results in two copies of the HTML entities table. It would be much better to instead find a way to share a single copy of the table.
Eric Seidel (no email)
Comment 21 2005-12-29 15:21:37 PST
Created attachment 5358 [details] Patch addressing darin's comments.
Eric Seidel (no email)
Comment 22 2005-12-29 15:27:52 PST
Created attachment 5359 [details] now including kentities.h
Eric Seidel (no email)
Comment 23 2005-12-29 15:49:08 PST
Addressed darin's concerns. Landed.
Ian 'Hixie' Hickson
Comment 24 2005-12-29 16:53:36 PST
TESTCASE: http://www.hixie.ch/tests/adhoc/xml/parsing/010.xml You should get a well-formedness error on that page. If you see the word "fail" then the test has failed (and that's a regression).
Ian 'Hixie' Hickson
Comment 25 2005-12-29 16:56:55 PST
Filed regression as bug 6290.
Note You need to log in before you can comment on or make changes to this bug.