| Summary: | REGRESSION(r284521): [SOUP] Several tests parsed as HTML because of ".html" extensions on other platforms are now parsed as XHTML because they "sniff" as XHTML (and some are invalid XHTML!) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | WebKit | Reporter: | Diego Pino <dpino> | ||||
| Component: | WebKit2 | Assignee: | Diego Pino <dpino> | ||||
| Status: | NEW --- | ||||||
| Severity: | Normal | CC: | ap, berto, cgarcia, crzwdjk, darin, kkinnunen, lmoura, mcatanzaro, terry, webkit-bug-importer | ||||
| Priority: | P2 | Keywords: | InRadar | ||||
| Version: | WebKit Nightly Build | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| See Also: |
https://bugs.webkit.org/show_bug.cgi?id=232386 https://bugs.webkit.org/show_bug.cgi?id=232913 |
||||||
| Attachments: |
|
||||||
|
Description
Diego Pino
2021-10-29 00:24:17 PDT
Created attachment 442790 [details]
Patch
Comment on attachment 442790 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=442790&action=review > LayoutTests/ChangeLog:3 > + Several tests written in XHTML are not valid XHTML What does "written in XHTML" mean here? As far as I can tell, none of these are XHTML. They have ".html" extensions causing them to be processed as HTML, not XHTML, and the first one last least starts with: <!DOCTYPE html> That means HTML 5, not XHTML. Comment on attachment 442790 [details]
Patch
Marking r-, it doesn’t seem like there is anything that needs to be done here.
I don’t think that we want to remove xmlns either, we are not striving to have our tests be absolutely minimal or even pass validation in general.
(In reply to Alexey Proskuryakov from comment #3) > Comment on attachment 442790 [details] > Patch > > Marking r-, it doesn’t seem like there is anything that needs to be done > here. > > I don’t think that we want to remove xmlns either, we are not striving to > have our tests be absolutely minimal or even pass validation in general. There has been a change recently in WebKitGTK/WPE (https://trac.webkit.org/changeset/284521/webkit), that makes several tests fail because they're now detected as MIME type XHTML. The tests were marked as failure in webkit.org/b/232386. It seems to me other WebKit ports use the file extension to determine whether a file is HTML or XHTML. These tests that are failing all have .html as extension, but they claim to be XHTML (through a DTD header or xmlns). Having these tests fixed as valid XHTML makes several tests pass in WebKitGTK/WPE, and it doesn't hurt any other WebKit ports. An alternatively solution, and possibly better, it's to actually drop the DTD header and xmlns declaration in these failing tests as they do not need to be XHTML documents. While this is not what these particular tests are explicitly intended for, it is important that we do test how .html files of this form are parsed by WebKit. There is a lot of content on the web that might be misinterpreted as XHTML due to having some pieces of XML syntax it it, but is parsed as HTML. I don't quite understand why content sniffing for local .html files is desirable for the Gtk port. Does any other browser engine behave like that, failing to display local .html files if there are vestiges of XHTML in them? My understanding is that this Gtk behavior is an exception, not the norm. If this is truly what Gtk needs, then WebKitTestRunner probably should disable content sniffing for most tests, possibly enabling it for some Gtk-specific tests that explicitly verify that it works, not the other way around. (In reply to Alexey Proskuryakov from comment #5) > I don't quite understand why content sniffing for local .html files is > desirable for the Gtk port. Does any other browser engine behave like that, > failing to display local .html files if there are vestiges of XHTML in them? > My understanding is that this Gtk behavior is an exception, not the norm. I agree it's certainly not desirable. Our desired behavior is "do exactly what Safari does." That said, we cannot simply revert r284521 (the most recent commit that broke these tests) because that commit itself fixes a regression that is more severe. Looking into how Apple ports behave, it looks pretty similar: NetworkDataTaskCocoa.mm decides whether to content sniff or not, and if so it uses CFNetwork to do that. It's possible that CFNetwork is just stricter about detecting XHTML than shared-mime-info is. But also, NetworkDataTaskCocoa.mm disables content sniffing altogether for local files. I don't know why. I guess we should do that, but I'm not sure if making the same change for the soup backend would actually be enough to fix our tests, though, because the code we changed here was NetworkDataTaskSoup::didGetFileInfo which (a) is calling WebCore::ResourceResponse::setMIMEType, something NetworkDataTaskCocoa does not do at all, and (b) is clearly not controlled by whether shouldContentSniff is enabled or not, it seems to be entirely unrelated to the content sniffing setting, even though it is content sniffing. This needs a closer look to figure out what the desired behavior really is. (In reply to Alexey Proskuryakov from comment #5) > While this is not what these particular tests are explicitly intended for, > it is important that we do test how .html files of this form are parsed by > WebKit. There is a lot of content on the web that might be misinterpreted as > XHTML due to having some pieces of XML syntax it it, but is parsed as HTML. I am actually OK with Diego's commit fixing these tests, because testing HTML vs. XHTML is not the intended purpose of these tests. They seem to be failure to copy/paste the right boilerplate, which is worth fixing. That said, if we do fix these tests, then we should also add new tests that fail in the same ways that these tests are currently failing now, so that we can continue testing to ensure different WebKit ports implement the same behavior, like Alexey suggests. These tests are intended to be parsed as HTML, and always have been. If anything, removing xmlns would be small cleanup, but again, we usually don't strive to clean up tests too much - the real web is messy. (Darin, your new title for this issue is actually more like the opposite of what's going on. I've retitled it.) *** Bug 232386 has been marked as a duplicate of this bug. *** *** Bug 239597 has been marked as a duplicate of this bug. *** *** Bug 256253 has been marked as a duplicate of this bug. *** It seems we do give higher weight to the file extension than to the content: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues/98 But this fails because we don't consider them both at the same time. NetworkDataTaskSoup::didGetFileInfo first tries to guess the MIME type for the content, and only if that fails does it guess the MIME type for the filename. Any changes here risk bringing back bug #230797 so caution is required. What we do here will also affect how we approach bug #252263. (In reply to Michael Catanzaro from comment #14) > What we do here will also affect how we approach bug #252263. Actually, in that bug I proposed using g_content_type_guess() to replace xdg_mime_get_mime_type_from_file_name(). g_content_type_guess() allows us to pass both the file extension and the contents at the same time so that they can both be considered. That *might* solve everything. Maybe. |