NEW 184444
[GTK] webkit_web_view_load_html() garbages linked CSS content
https://bugs.webkit.org/show_bug.cgi?id=184444
Summary [GTK] webkit_web_view_load_html() garbages linked CSS content
Milan Crha
Reported 2018-04-10 01:14:16 PDT
Created attachment 337598 [details] wk2-css.c This looks like bug #127481, but that one is meant to be fixed. Maybe something had been missing. My test with 2.18.6 and git checkout at git-svn-id: http://svn.webkit.org/repository/webkit/trunk@230032 268f45cc-cd09-0410-ab3c-d52691b4dbfc fails. The trick here is to have the HTML content contain UTF-8 letters, it doesn't matter which. Using webkit_web_view_load_bytes (web_view, bytes, NULL, NULL, "file://"); on exactly the same content of loading that with webkit_web_view_load_uri (WEBKIT_WEB_VIEW (wk), "file:///tmp/a.html"); doesn't exhibit the issue, it's only the webkit_web_view_load_html (web_view, html, "file://"); misbehaving. There doesn't change anything whether the HTML content contains <meta http-equiv="content-type" content="text/html; charset=utf-8"> or not, everything is about the content using only ASCII or UTF-8 letters. Attached is my test program, which references webview.css file provided by 'evolution' program/package. The first line contains a command line how to compile and run it. Some more details are shown in the dialog itself.
Attachments
wk2-css.c (3.74 KB, text/plain)
2018-04-10 01:14 PDT, Milan Crha
no flags
trvial patch (just to show it) (1.42 KB, text/plain)
2018-04-18 01:25 PDT, Milan Crha
no flags
Jérémy Lal
Comment 1 2018-04-10 01:41:38 PDT
Did you try webkit_web_view_load_bytes (web_view, bytes, "text/html", "utf-8", "file://"); ?
Milan Crha
Comment 2 2018-04-10 02:13:38 PDT
Well, I wrote I did and I wrote it _doesn't exhibit the issue_. Okay, there's a typo which might confuse you, the text: content of loading should be content or loading . I hope it's clearer now. Anyway, from bug #127481, it looks like WebKit does some expectation on file content encoding from parent's file content encoding. I think it's a bad idea. I do not see any reason why I could not have an HTML page in iso-8859-2, which links a CSS file in UTF-8 or even the main document in UTF-16, while the CSS file in UTF-8 or ASCII. Imagine it as a localized HTML pages in various encodings, where all these localized pages share the same CSS file.
Milan Crha
Comment 3 2018-04-10 02:25:10 PDT
I would go with the comment in WebPageproxy.cpp // FIXME: Get rid of loadHTMLString and just use loadData instead. but it doesn't cover loadAlternateHTMLString(), thus such change would a) require twice as memory (due to copying the content into a GBytes structure), b) it would be incomplete anyway (due to this alternate function). I have a trivial patch for webkit_web_view_load_html() and webkit_web_view_load_plain_text(), if you are interested.
Michael Catanzaro
Comment 4 2018-04-17 19:55:14 PDT
WebKit doesn't try to guess your file encoding, like Firefox does, because in practice that is probabilistic and going to fail regularly for encodings that matter in practice (e.g. GB-18030). So instead you have to specify it manually if you want something other than the default, which is ISO-8859-1 for web compat. Sadly you cannot feed UTF-8 into WebKit and expect that to work without declaring the encoding. HTML is the easiest case, because HTML allows you to specify the encoding as part of the document. It's when you're trying to use text files that you're really going to have a hard time, as your only options there are (a) HTTP headers (doesn't work for local files), or (b) API request (webkit_web_view_load_bytes()). (In reply to Milan Crha from comment #3) > I would go with the comment in WebPageproxy.cpp > // FIXME: Get rid of loadHTMLString and just use loadData instead. > but it doesn't cover loadAlternateHTMLString(), thus such change would > a) require twice as memory (due to copying the content into a GBytes > structure), > b) it would be incomplete anyway (due to this alternate function). What does this have to do with the bug? > I have a trivial patch for webkit_web_view_load_html() and > webkit_web_view_load_plain_text(), if you are interested. It'd at least be good to see.
Milan Crha
Comment 5 2018-04-18 01:25:43 PDT
Created attachment 338201 [details] trvial patch (just to show it) (In reply to Michael Catanzaro from comment #4) > WebKit doesn't try to guess your file encoding, ... Maybe I did not use the 'encoding' word properly. If the bug #127481 is right, and it seems it is, then WebKit has some expectation about HTML and its CSS files "encodings", which is wrong, from my point of view. The above test proves it and it's all about HTML and its CSS sub-file. I'll try to rephrase, but I'm afraid it'll not help much. The above wk2-css.c loads an HTML document which contains: <meta http-equiv="content-type" content="text/html; charset=utf-8"> and <link type="text/css" rel="stylesheet" href="file:///usr/.....css"> using webkit_web_view_load_html() function. The webview itself has also set utf-8 as its default encoding. Whether the .css file is loaded properly solely depends on the actual content of the HTML file, which is wrong from my point of view. When the HTML file contains non-ASCII letters, then the .css file is read as UTF-16 (thus it looks like a garbage), when the HTML file contains only ASCII letters, then the .css file is read as UTF-8 (or some other single-byte encoding, it doesn't matter, it's not that important which single-byte encoding it is, because it's only ASCII there). I can mix ASCII HTML with UTF-16 CSS, the same as UTF-8/UTF-16 HTML with ASCII CSS, there should not be any issue with it. Furthermore, I believe most (if not all) UTF-16 files contain the Unicode marker (0xFEFF), thus it's easily detectable that the file is in UTF-16. When the marker is not there, then you can add a bit more heuristic there, but even then I'd expect the CSS is in the default encoding, if no other is passed by the caller. > What does this have to do with the bug? Everything and nothing. The loadData() can specify the encoding which the loadHTML/loadPlainText cannot. And you said you want to be explicit about encodings. > It'd at least be good to see. Here you are. It uses UTF-8 encoding, but it can use the default encoding from the WebKitSettings, it depends whether you'd want to extend the documentation for the two functions too. I do not think the patch is good for production, though, due to the reasons in comment #3.
Michael Catanzaro
Comment 6 2018-04-18 09:19:31 PDT
Hmmm... gchar* strings are always UTF-8 in GLib APIs, always, that's a GLib convention, not a WebKit convention. So WebKit treats it as such, and the caller is responsible for converting to UTF-8 before calling webkit_web_view_load_html() or webkit_web_view_load_plain_text(). This usually works, because applications can usually assume that files on disk are always UTF-8. But it breaks down badly for HTML and CSS files with WebKit. We have the WebKit API assuming the input is UTF-8 and converts from UTF-8 unconditionally. But then it goes ahead and treats it as ISO-8859-1 by default, or whatever other encoding is specified in WebKitSettings. And WebKit uses UTF-16 for everything internally (WTF::String), so if anything is being treated as UTF-16, that's probably a bug related to a missing conversion somewhere. So while on one hand, it seems natural that input should be UTF-8, it does seem more than a little unfriendly to ignore the default encoding that was set with WebKitSettings. Carlos, what do you think?
Milan Crha
Comment 7 2018-04-18 09:50:49 PDT
(In reply to Michael Catanzaro from comment #6) > gchar* strings are always UTF-8 in GLib APIs, always, that's a GLib > convention Not general gchar *strings/variables, see for example: https://developer.gnome.org/glib/2.54/glib-Character-Set-Conversion.html#g-filename-to-utf8 But I agree with you, it's a reasonable expectation that gchar * is a single-byte encoding and I'd go even further and expect, in case of webkit_web_view_load_html() and webkit_web_view_load_plain_text() and the one for alternative representation, that the passed-in string is in UTF-8, or better the WebKitSettings' default-encoding. It won't fix the issue of mixed HTML and CSS on its own, but it's a good start. See the mentioned bug #127481 where the guessing on CSS from the actual content of the HTML is possibly done (aka WebPage::loadString() according to the attached patch there, but it is more than 4 years old now, thus it could change heavily meanwhile).
Note You need to log in before you can comment on or make changes to this bug.