Bug 232471

Summary:

REGRESSION(r284521): [SOUP] Several tests parsed as HTML because of ".html" extensions on other platforms are now parsed as XHTML because they "sniff" as XHTML (and some are invalid XHTML!)

Product:

WebKit

Reporter:

Diego Pino <dpino>

Component:

WebKit2

Assignee:

Carlos Alberto Lopez Perez <clopez>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

ap, berto, cgarcia, crzwdjk, darin, kkinnunen, lmoura, mcatanzaro, terry, webkit-bug-importer

Priority:

Keywords:

InRadar

Version:

WebKit Nightly Build

Hardware:

Unspecified

OS:

Unspecified

See Also:

https://bugs.webkit.org/show_bug.cgi?id=232386
https://bugs.webkit.org/show_bug.cgi?id=232913

Attachments:

Description	Flags
Patch	ap: review-

Diego Pino

Reported 2021-10-29 00:24:17 PDT

Several tests written in XHTML are not valid XHTML

Attachments
Patch (20.84 KB, patch) 2021-10-29 00:25 PDT, Diego Pino	ap: review-	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Diego Pino

Comment 1 2021-10-29 00:25:22 PDT

Created attachment 442790 [details] Patch

Darin Adler

Comment 2 2021-10-29 11:48:55 PDT

Comment on attachment 442790 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=442790&action=review > LayoutTests/ChangeLog:3 > + Several tests written in XHTML are not valid XHTML What does "written in XHTML" mean here? As far as I can tell, none of these are XHTML. They have ".html" extensions causing them to be processed as HTML, not XHTML, and the first one last least starts with: <!DOCTYPE html> That means HTML 5, not XHTML.

Alexey Proskuryakov

Comment 3 2021-10-30 11:34:05 PDT

Comment on attachment 442790 [details] Patch Marking r-, it doesn’t seem like there is anything that needs to be done here. I don’t think that we want to remove xmlns either, we are not striving to have our tests be absolutely minimal or even pass validation in general.

Diego Pino

Comment 4 2021-10-31 18:38:37 PDT

(In reply to Alexey Proskuryakov from comment #3) > Comment on attachment 442790 [details] > Patch > > Marking r-, it doesn’t seem like there is anything that needs to be done > here. > > I don’t think that we want to remove xmlns either, we are not striving to > have our tests be absolutely minimal or even pass validation in general. There has been a change recently in WebKitGTK/WPE (https://trac.webkit.org/changeset/284521/webkit), that makes several tests fail because they're now detected as MIME type XHTML. The tests were marked as failure in webkit.org/b/232386. It seems to me other WebKit ports use the file extension to determine whether a file is HTML or XHTML. These tests that are failing all have .html as extension, but they claim to be XHTML (through a DTD header or xmlns). Having these tests fixed as valid XHTML makes several tests pass in WebKitGTK/WPE, and it doesn't hurt any other WebKit ports. An alternatively solution, and possibly better, it's to actually drop the DTD header and xmlns declaration in these failing tests as they do not need to be XHTML documents.

Alexey Proskuryakov

Comment 5 2021-11-01 08:53:20 PDT

While this is not what these particular tests are explicitly intended for, it is important that we do test how .html files of this form are parsed by WebKit. There is a lot of content on the web that might be misinterpreted as XHTML due to having some pieces of XML syntax it it, but is parsed as HTML. I don't quite understand why content sniffing for local .html files is desirable for the Gtk port. Does any other browser engine behave like that, failing to display local .html files if there are vestiges of XHTML in them? My understanding is that this Gtk behavior is an exception, not the norm. If this is truly what Gtk needs, then WebKitTestRunner probably should disable content sniffing for most tests, possibly enabling it for some Gtk-specific tests that explicitly verify that it works, not the other way around.

Michael Catanzaro

Comment 6 2021-11-01 10:07:08 PDT

(In reply to Alexey Proskuryakov from comment #5) > I don't quite understand why content sniffing for local .html files is > desirable for the Gtk port. Does any other browser engine behave like that, > failing to display local .html files if there are vestiges of XHTML in them? > My understanding is that this Gtk behavior is an exception, not the norm. I agree it's certainly not desirable. Our desired behavior is "do exactly what Safari does." That said, we cannot simply revert r284521 (the most recent commit that broke these tests) because that commit itself fixes a regression that is more severe. Looking into how Apple ports behave, it looks pretty similar: NetworkDataTaskCocoa.mm decides whether to content sniff or not, and if so it uses CFNetwork to do that. It's possible that CFNetwork is just stricter about detecting XHTML than shared-mime-info is. But also, NetworkDataTaskCocoa.mm disables content sniffing altogether for local files. I don't know why. I guess we should do that, but I'm not sure if making the same change for the soup backend would actually be enough to fix our tests, though, because the code we changed here was NetworkDataTaskSoup::didGetFileInfo which (a) is calling WebCore::ResourceResponse::setMIMEType, something NetworkDataTaskCocoa does not do at all, and (b) is clearly not controlled by whether shouldContentSniff is enabled or not, it seems to be entirely unrelated to the content sniffing setting, even though it is content sniffing. This needs a closer look to figure out what the desired behavior really is.

Michael Catanzaro

Comment 7 2021-11-01 10:13:54 PDT

(In reply to Alexey Proskuryakov from comment #5) > While this is not what these particular tests are explicitly intended for, > it is important that we do test how .html files of this form are parsed by > WebKit. There is a lot of content on the web that might be misinterpreted as > XHTML due to having some pieces of XML syntax it it, but is parsed as HTML. I am actually OK with Diego's commit fixing these tests, because testing HTML vs. XHTML is not the intended purpose of these tests. They seem to be failure to copy/paste the right boilerplate, which is worth fixing. That said, if we do fix these tests, then we should also add new tests that fail in the same ways that these tests are currently failing now, so that we can continue testing to ensure different WebKit ports implement the same behavior, like Alexey suggests.

Alexey Proskuryakov

Comment 8 2021-11-01 16:12:37 PDT

These tests are intended to be parsed as HTML, and always have been. If anything, removing xmlns would be small cleanup, but again, we usually don't strive to clean up tests too much - the real web is messy.

Radar WebKit Bug Importer

Comment 9 2021-11-05 00:25:18 PDT

<rdar://problem/85049435>

Michael Catanzaro

Comment 10 2021-11-09 13:15:01 PST

(Darin, your new title for this issue is actually more like the opposite of what's going on. I've retitled it.)

Michael Catanzaro

Comment 11 2021-11-10 06:14:45 PST

*** Bug 232386 has been marked as a duplicate of this bug. ***

Michael Catanzaro

Comment 12 2023-05-03 07:07:17 PDT

*** Bug 239597 has been marked as a duplicate of this bug. ***

Michael Catanzaro

Comment 13 2023-05-03 07:07:23 PDT

*** Bug 256253 has been marked as a duplicate of this bug. ***

Michael Catanzaro

Comment 14 2023-05-03 07:13:35 PDT

It seems we do give higher weight to the file extension than to the content: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues/98 But this fails because we don't consider them both at the same time. NetworkDataTaskSoup::didGetFileInfo first tries to guess the MIME type for the content, and only if that fails does it guess the MIME type for the filename. Any changes here risk bringing back bug #230797 so caution is required. What we do here will also affect how we approach bug #252263.

Michael Catanzaro

Comment 15 2023-05-03 07:17:00 PDT

(In reply to Michael Catanzaro from comment #14) > What we do here will also affect how we approach bug #252263. Actually, in that bug I proposed using g_content_type_guess() to replace xdg_mime_get_mime_type_from_file_name(). g_content_type_guess() allows us to pass both the file extension and the contents at the same time so that they can both be considered. That *might* solve everything. Maybe.

Carlos Alberto Lopez Perez

Comment 16 2025-11-10 17:15:49 PST

Pull request: https://github.com/WebKit/WebKit/pull/53712

EWS

Comment 17 2025-11-11 11:37:14 PST

Committed 302855@main (08c24bea8030): <https://commits.webkit.org/302855@main> Reviewed commits have been landed. Closing PR #53712 and removing active labels.

Note You need to log in before you can comment on or make changes to this bug.