Bug 166322 - [GTK] Consider probabilistically guessing text encoding if unspecified by document
Summary: [GTK] Consider probabilistically guessing text encoding if unspecified by doc...
Status: RESOLVED WONTFIX
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKitGTK (show other bugs)
Version: WebKit Nightly Build
Hardware: PC Linux
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-21 08:07 PST by Michael Catanzaro
Modified: 2016-12-21 13:28 PST (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Catanzaro 2016-12-21 08:07:00 PST
If an HTML document doesn't specify its text encoding nor do the HTTP headers, we should try to detect encoding probabilistically like Firefox does, instead of just getting it wrong. Unlike Firefox, we probably don't want to have to maintain our own encoding detectors. We already depend on ICU, so let's use ICU's character set detection API. [1]

[1] http://userguide.icu-project.org/conversion/detection
Comment 1 Alexey Proskuryakov 2016-12-21 09:27:19 PST
WebKit supports this (see WKPreferencesSetEncodingDetectorEnabled), so clients can enable if desired.

With charset detection, one needs to either buffer a lot of data before rendering a page, which impacts performance, or to switch encodings in the middle, which is crazy. Even so, the result is probabilistic. I don't recommend enabling charset auto-detection.

RESOLVED/INVALID, as this already exists.
Comment 2 Michael Catanzaro 2016-12-21 10:02:06 PST
Moving to GTK then, as we don't have API for this right now.

(In reply to comment #1)
> With charset detection, one needs to either buffer a lot of data before
> rendering a page, which impacts performance, or to switch encodings in the
> middle, which is crazy. Even so, the result is probabilistic. I don't
> recommend enabling charset auto-detection.

We should investigate to see what Firefox is doing.
Comment 3 Alexey Proskuryakov 2016-12-21 10:08:35 PST
In Firefox, the user has to manually choose which language to detect encoding for, and also, there are heuristics based on browsing history.
Comment 4 Michael Catanzaro 2016-12-21 10:13:26 PST
Also Alexey, could you tell me what encoding Safari uses if it's unspecified? I presume it's either ISO 8859-1 or UTF-8? In WebKitGTK+ we use ISO 8859-1, and think we can't change it because using UTF-8 breaks some websites (e.g. some Brazillian sites, I think Gustavo can provide an example link).

(In reply to comment #3)
> In Firefox, the user has to manually choose which language to detect
> encoding for, and also, there are heuristics based on browsing history.

Hmmm. [1] is displayed properly by Firefox at least for me, but not in Epiphany. (Scroll down a bit to see broken names.) There is no initial FEFF byte to indicate the right Unicode encoding, so either Firefox is able to detect the right encoding probabilistically, or it must assume UTF-8 by default.

[1] http://ftp-nyc.osuosl.org/pub/gnome/core/3.23/3.23.3/NEWS
Comment 5 Alexey Proskuryakov 2016-12-21 10:29:54 PST
Safari picks a default encoding based on user's primary language. On the Mac, there is a preference to change the default, and a menu option to override the encoding for the current page. On iOS, it's not configurable at all.

This gnome changelog page isn't displayed correctly in Firefox for me, it uses a Cyrillic encoding of UTF-8. I can't even figure out why - I don't see a preference for that anywhere, so perhaps they are falling back on system language's encoding too.

A funny observation - the page only displays correctly if I choose Japanese auto-detection, not if I choose Russian or Ukrainian.
Comment 6 Michael Catanzaro 2016-12-21 13:28:03 PST
Alexey has convinced me that this is a bad idea.

Maybe we should consider setting a default encoding based on locale, but I'm not really confident in that either, it sounds like a recipe for locale-specific bugs. That could be a topic for another bug report, another day. Thanks for the tips anyway.