Bug 201545

Summary: [GTK] Fails to render page with inconsistent content type
Product: WebKit Reporter: Xi Ruoyao <xry111>
Component: WebKitGTKAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, bfulgham, bugs-noreply, clopez, lantw44, mcatanzaro, pierre.labastie, simon.fraser, zalan
Priority: P2    
Version: WebKit Local Build   
Hardware: PC   
OS: Linux   
URL: https://gitlab.gnome.org/GNOME/epiphany/issues/910
See Also: https://bugs.webkit.org/show_bug.cgi?id=201295
https://bugs.webkit.org/show_bug.cgi?id=202321

Description Xi Ruoyao 2019-09-06 08:51:50 PDT
From https://gitlab.gnome.org/GNOME/epiphany/issues/910, and http://wiki.linuxfromscratch.org/blfs/ticket/12481:

```
curl -o index.html http://www.linuxfromscratch.org/blfs/view/systemd/index.html
epiphany index.html
```

The following error is produced:

```
This page contains the following errors:
error on line 7 at column 12: Encoding error
Below is a rendering of the page up to the first error.
```

However the following works fine:

```
epiphany http://www.linuxfromscratch.org/blfs/view/systemd/index.html
```

WebKitGTK MiniBrowser also has the same issue so this may be a WebKit bug.  Chrome renders the page correctly.
Comment 1 Michael Catanzaro 2019-09-06 09:20:54 PDT
WebKit is picking the wrong encoding here (UTF-8 I think). This page renders fine when forcing ISO 8895-1 via API request. The document does specify ISO 8895-1 with a meta charset attribute.
Comment 2 Alexey Proskuryakov 2019-09-06 09:27:50 PDT
I think that the root cause is that it's erroneously handled as XML. If it's .html, it should be HTML. It's even served as text/html from the server (not that it has any effect after downloading of course).

Probably a dupe of bug 201295.
Comment 3 Michael Catanzaro 2019-09-06 11:00:11 PDT
Hm, I think this bug is different though.

Bug #201295 is an HTML file being processed as SVG by mistake. I can see shared-mime-info thinks it's an SVG file, not an HTML file at all.

But this bug is an XHTML file encoded with ISO-8895-1 being decoded as UTF-8 by mistake. shared-mime-info detects it as XHTML, so there's no mistake about content type. afaik it's normal for XHTML files to use the .html file extension.
Comment 4 Alexey Proskuryakov 2019-09-06 13:08:44 PDT
I don't think that we can should this document XHTML. It has an .html extension, and it's served by the original server as text/html. Clearly no one expects it to be valid XML.
Comment 5 Michael Catanzaro 2019-09-06 13:50:45 PDT
(In reply to Alexey Proskuryakov from comment #4)
> I don't think that we can should this document XHTML. It has an .html
> extension, and it's served by the original server as text/html.

Ah, I suppose that is a linuxfromscratch.org bug, yes.

(In reply to Alexey Proskuryakov from comment #4)
> Clearly no
> one expects it to be valid XML.

Hm I dunno, this works as a local file in Chrome (and Firefox, though I don't consider Firefox a useful example here since it does its own custom charset detection). And like I said, WebKit handles it just fine if we manually set the charset to ISO-8895-1.
Comment 6 Alexey Proskuryakov 2019-09-06 15:13:20 PDT
> Hm I dunno, this works as a local file in Chrome

Doesn't Chrome parse the document as HTML anyway?
Comment 7 Michael Catanzaro 2019-09-06 16:28:34 PDT
Whether Chrome treats it as HTML or XHTML, the document displays fine in Chrome. (Same for Firefox.)
Comment 8 Pierre Labastie 2019-09-11 05:09:20 PDT
(In reply to Alexey Proskuryakov from comment #4)
> I don't think that we can should this document XHTML. It has an .html
> extension, and it's served by the original server as text/html. Clearly no
> one expects it to be valid XML.

but the meta has content="application/xhtml+xml; charset=iso-8859-1". Isn't this enough?
Comment 9 Alexey Proskuryakov 2019-09-11 09:17:04 PDT
> Whether Chrome treats it as HTML or XHTML, the document displays fine in Chrome. (Same for Firefox.)

That's the point of this bug - it's about incorrect content type being used in Gtk, not about XHTML parsing.

If you change this file's extension to .xhtml, then all browsers start hitting this error. With .html, it's only Epiphany, because it incorrectly parses the file as XML.

> but the meta has content="application/xhtml+xml; charset=iso-8859-1". Isn't this enough?

Not with XML, see <https://www.w3.org/International/questions/qa-html-encoding-declarations> (or relevant specs):

"An XHTML5 document is served as XML and has XML syntax. XML parsers do not recognise the encoding declarations in meta elements. They only recognise the XML declaration."
Comment 10 Michael Catanzaro 2019-09-11 12:08:14 PDT
OK, that makes sense. So it's three separate problem:

 * The document http://www.linuxfromscratch.org/blfs/view/systemd/index.html is broken for attempting to declare a custom encoding in an XHTML document. The XHTML document is invalid.
 * The website http://www.linuxfromscratch.org is further broken since it uses HTML content type for an XHTML document with HTML content type. (It so happens that using the wrong content type is required to avoid the first problem.)
 * Finally, WebKitGTK is (arguably) ill-advised in sniffing the document contents to determine content type. It seems Safari and other browsers consider only the file extension, .html. WebKitGTK should aim to match the behavior of other browsers.
Comment 11 Michael Catanzaro 2019-09-11 12:15:29 PDT
Well, maybe not; that page says:

XHTML 1.x served as text/html: Also needs the pragma directive for full conformance with HTML 4.01, rather than the charset attribute. You do not need to use the XML declaration, since the file is being served as HTML.

XHTML 1.x served as XML: Use the encoding declaration of the XML declaration on the first line of the page. Ensure there is nothing before it, including spaces (although a byte-order mark is OK).
Comment 12 Alexey Proskuryakov 2019-09-11 13:47:46 PDT
>  * The document http://www.linuxfromscratch.org/blfs/view/systemd/index.html is broken for attempting to declare a custom encoding in an XHTML document. The XHTML document is invalid.

I didn't check if this particular document is valid XHML other than having an incorrect character encoding. It's not very important for two reasons:

1. One error is enough to break everything in XML.

2. There are lots of HTML documents on the web that have pieces of XML in them, but are very broken in many ways were they to ever be parsed as XML. It's OK, and HTML5 defines error handling that prevents any functional differences between browsers.

>  * The website http://www.linuxfromscratch.org is further broken since it uses HTML content type for an XHTML document with HTML content type. (It so happens that using the wrong content type is required to avoid the first problem.)

We should never ask websites to move from HTML to XHTML. That would regress behavior for customers (no incremental rendering, very high chance of catastrophic failure like of this sort).

So server behavior is correct and desirable.

>  * Finally, WebKitGTK is (arguably) ill-advised in sniffing the document contents to determine content type. It seems Safari and other browsers consider only the file extension, .html. WebKitGTK should aim to match the behavior of other browsers.

Yes.
Comment 13 Pierre Labastie 2019-09-12 02:00:56 PDT
Not sure it has something to do with the bug, but:

on debian sid, epiphany displays the page correctly.
on lfs/blfs, it doesn't.

And if I run:
xdg-mime query filetype index.html
I get:

text/html on debian
application/xhtml+xml on lfs/blfs

So looks like webkit somehow uses the mime database, while other browsers use a different method for guessing the file type.
Comment 14 Pierre Labastie 2019-09-12 03:25:42 PDT
FWIIW, the difference between debian and lfs is shared-mime-info version:
In 1.10, .html and .htm are not recognized as extensions (glob) for xhtml,
while this has been added for 1.12 (https://gitlab.freedesktop.org/xdg/shared-mime-info/commit/8ae13a589577e9bda12fb16465a03cd81b1cd349)

The conclusion is that shared-mime-info should enforce the presence of an <?xml?> header for recognizing the file as application/xhtml+xml.

The index.html file from lfs seems to be valid if served as text/html (see Comment 11, but I am not sure the http-equiv content should be application/xhtml+xml). How a local file is "served" is up to the browser, I suppose. But the right guessing here is text/html, and if using the mime database for that, IMO the mime database should never allow a file not starting with <?xml ... ?> to be considered application/xhtml+xml.

Well, at lfs, we have added the <?xml?> header, and all is fine anyway...
Comment 15 Carlos Alberto Lopez Perez 2019-10-11 08:37:52 PDT
(In reply to Pierre Labastie from comment #14)
> FWIIW, the difference between debian and lfs is shared-mime-info version:
> In 1.10, .html and .htm are not recognized as extensions (glob) for xhtml,
> while this has been added for 1.12
> (https://gitlab.freedesktop.org/xdg/shared-mime-info/commit/
> 8ae13a589577e9bda12fb16465a03cd81b1cd349)
> 

Indeed, I think 8ae13a589577e9bda12fb16465a03cd81b1cd349 causes this issue, but AFAIK this have been fixed in shared-mime-info master already (not sure if in any release post 1.12). Can you try if that its the case? See also bug 202321
Comment 16 Michael Catanzaro 2019-10-11 08:55:57 PDT
(In reply to Carlos Alberto Lopez Perez from comment #15)
> Indeed, I think 8ae13a589577e9bda12fb16465a03cd81b1cd349 causes this issue,
> but AFAIK this have been fixed in shared-mime-info master already (not sure
> if in any release post 1.12). Can you try if that its the case? See also bug
> 202321

You're right, this is fixed in 1.14.