Bug 4301 - Support HTML entities on pages parsed as XHTML
Summary: Support HTML entities on pages parsed as XHTML
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Eric Seidel (no email)
URL: http://amonre.org/pub/entity.xml
Keywords:
Depends on: 6290
Blocks: 4979 6054
  Show dependency treegraph
 
Reported: 2005-08-06 11:25 PDT by Denis Defreyne
Modified: 2005-12-29 16:56 PST (History)
4 users (show)

See Also:


Attachments
First pass at fixing this (7.22 KB, patch)
2005-10-17 00:53 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Updated patch that handles named entities correctly (3.56 KB, patch)
2005-11-01 05:19 PST, Mark Rowe (bdash)
no flags Details | Formatted Diff | Diff
Patch with assertion to ensure data will fit in static buffer (3.63 KB, patch)
2005-11-01 16:36 PST, Mark Rowe (bdash)
eric: review-
Details | Formatted Diff | Diff
Updated patch including layout test + comment change (367.93 KB, patch)
2005-12-28 07:41 PST, Mark Rowe (bdash)
eric: review+
Details | Formatted Diff | Diff
Patch addressing darin's comments. (7.17 KB, patch)
2005-12-29 15:21 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
now including kentities.h (9.00 KB, patch)
2005-12-29 15:27 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Denis Defreyne 2005-08-06 11:25:39 PDT
OVERVIEW DESCRIPTION:

Some entities in documents parsed using WebKit's XML parser are not recognised. Examples of 
unrecognised entities include 'copy' and 'raquo'.

STEPS TO REPRODUCE:

Load an XML page containing entities such as 'copy' and 'raquo'.

ACTUAL RESULTS:

The entities are not displayed. The top of the page shows the following error message:

This page contains the following errors:
error on line 7 at column 56: Entity 'copy' not defined
Below is a rendering of the page up to the first error.

EXPECTED RESULTS:

The entities should show up and there should be no error message.

ADDITIONAL BUILDS:

This bug can be reproduced in 412 and up. I believe I encountered this bug in 312 as well, but I'm not 
100% sure.

ADDITIONAL INFORMATION:

Darin said: "we broke it by moving from expat to libxml2".
Comment 1 Mark Rowe (bdash) 2005-08-06 22:16:05 PDT
Confirmed with ToT.
Comment 2 Eric Seidel (no email) 2005-09-08 00:01:07 PDT
This one is mine.
Comment 3 Eric Seidel (no email) 2005-10-05 15:44:31 PDT
This is fixed in TOT.
Comment 4 Eric Seidel (no email) 2005-10-05 16:10:23 PDT
Oops.
Comment 5 Eric Seidel (no email) 2005-10-05 16:23:58 PDT
We're only supposed to support these entities:
http://www.w3.org/TR/REC-xml/#sec-predefined-ent

this xml file seems to depend on an html entity "copy", which xml as far as I can tell, is not supposed to 
support:
http://www.w3.org/TR/REC-html40/sgml/entities.html
Comment 6 Eric Seidel (no email) 2005-10-16 21:36:51 PDT
*** Bug 3861 has been marked as a duplicate of this bug. ***
Comment 7 Eric Seidel (no email) 2005-10-16 21:47:03 PDT
I was incorrect.  XHTML documents require us to support HTML entities.  So this is a valid bug:

http://www.w3.org/TR/xhtml1/#h-A2
Comment 8 Eric Seidel (no email) 2005-10-17 00:53:40 PDT
Created attachment 4378 [details]
First pass at fixing this

Ok, so I took a stab at fixing this tonight.  Unfortunately there are some
other minor changes for XSLTProcessor support included in this patch.  Named
entities still don't work correctly (I'm not sure why?), but numerical entities
work just fine.  I'll hopefully get around to finishing this later this week,
unless somebody beats me to it.
Comment 9 Eric Seidel (no email) 2005-10-17 00:54:52 PDT
FYI I've been using this *fabulous* set of test pages:
http://www.w3.org/People/mimasa/test/xhtml/entities/
Comment 10 Mark Rowe (bdash) 2005-11-01 05:19:36 PST
Created attachment 4550 [details]
Updated patch that handles named entities correctly

The thing that was preventing named entities from working was that the values
were being passed through as utf-16 while libxml was expecting utf-8.  It's
also worth noting that character references ("numerical entities") are handled
internally by libxml so getXHTMLEntity doesn't need to handle them.
Comment 11 Eric Seidel (no email) 2005-11-01 11:37:21 PST
Comment on attachment 4550 [details]
Updated patch that handles named entities correctly

Patch looks great.  You should add an ASSERT(value.length() < 5) to make sure
we don't screw ourselves by changing the entity lookup code some day.
Comment 12 Mark Rowe (bdash) 2005-11-01 16:36:11 PST
Created attachment 4556 [details]
Patch with assertion to ensure data will fit in static buffer
Comment 13 Eric Seidel (no email) 2005-11-01 16:43:02 PST
Comment on attachment 4556 [details]
Patch with assertion to ensure data will fit in static buffer

Looks great.  r=me.  Darin or mjs could look at it, but I don't think they're
gonna have anything else to say.
Comment 14 Eric Seidel (no email) 2005-11-01 16:45:58 PST
Comment on attachment 4556 [details]
Patch with assertion to ensure data will fit in static buffer

Oops.  Almost forgot.  This needs a test case.	I suggest just landing the w3c
test page w/ text results.
Comment 15 Eric Seidel (no email) 2005-11-01 17:05:33 PST
As darin pointed out, the code comment should be changed:

Using a global variable entity and marking it XML_INTERNAL_PREDEFINED_ENTITY is a hack to avoid 
malloc/free. However using a global variable like this could cause trouble if libxml implementation details 
were to change.
Comment 16 Eric Seidel (no email) 2005-11-30 01:37:54 PST
Mark and I sorta left this one hanging...
Comment 17 Mark Rowe (bdash) 2005-12-28 07:41:45 PST
Created attachment 5337 [details]
Updated patch including layout test + comment change
Comment 18 Darin Adler 2005-12-28 10:40:45 PST
Comment on attachment 5337 [details]
Updated patch including layout test + comment change

I think getXHTMLEntity should use memcpy instead of a loop to copy the data.
Comment 19 Eric Seidel (no email) 2005-12-28 21:38:56 PST
Comment on attachment 5337 [details]
Updated patch including layout test + comment change

I agree with darin, it shoudl be changed to memcpy.  I'll change it as I land. 
r=me.
Comment 20 Darin Adler 2005-12-29 09:02:21 PST
Comment on attachment 5337 [details]
Updated patch including layout test + comment change

Including "kentities.c" like this results in two copies of the HTML entities
table. It would be much better to instead find a way to share a single copy of
the table.
Comment 21 Eric Seidel (no email) 2005-12-29 15:21:37 PST
Created attachment 5358 [details]
Patch addressing darin's comments.
Comment 22 Eric Seidel (no email) 2005-12-29 15:27:52 PST
Created attachment 5359 [details]
now including kentities.h
Comment 23 Eric Seidel (no email) 2005-12-29 15:49:08 PST
Addressed darin's concerns.  Landed.
Comment 24 Ian 'Hixie' Hickson 2005-12-29 16:53:36 PST
TESTCASE: http://www.hixie.ch/tests/adhoc/xml/parsing/010.xml
You should get a well-formedness error on that page. If you see the word "fail"
then the test has failed (and that's a regression).
Comment 25 Ian 'Hixie' Hickson 2005-12-29 16:56:55 PST
Filed regression as bug 6290.