Summary: | UMBRELLA: text encoding oddities | ||
---|---|---|---|
Product: | WebKit | Reporter: | Maciej Stachowiak <mjs> |
Component: | Text | Assignee: | Nobody <webkit-unassigned> |
Status: | NEW --- | ||
Severity: | Normal | CC: | annevk, ap, darin, mail, mmaxfield, naruse |
Priority: | P2 | ||
Version: | Safari Technology Preview | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
See Also: | https://bugs.webkit.org/show_bug.cgi?id=178207 | ||
Bug Depends on: | 54444, 235307, 265261, 179305, 179307, 179309, 179312, 179412, 179416, 179435, 179460, 179582, 180356, 233612 | ||
Bug Blocks: | |||
Attachments: |
Description
Maciej Stachowiak
2017-11-05 15:20:19 PST
Created attachment 326080 [details]
Cross-platform WebKit encodings
Created attachment 326081 [details]
Mac-only WebKit encodings
Created attachment 326082 [details]
iOS-only WebKit encodings
Created attachment 326083 [details]
Script for checking encoding consistency
Created attachment 326084 [details]
Results of the consistency script, identifying numerous inconsistencies
Cc'd some people that I hope know about WebKit's text encodings and/or encodings in general. There's probably multiple separate bugs to be filed here. I have noticed these anomalies, too. I was looking into this myself recently after the work I did on bug 178207. Many can be explained by some of this history: - The TEC decoder was the first one in WebKit. We created that decoder before the rest of them, and at the time we created it our goal was to support all encodings that TEC knew how to decode rather than necessarily selecting an appropriate set for the web. Many strange things about encodings are due to the fact that we have treated that as the "main" decoder on Mac, rather than using it only for a few special purposes. Seems likely we would not need it at all since ICU should have the support we need. - Most of the character set names used by the TEC decoder were based on the IANA character set assignments <https://www.iana.org/assignments/character-sets/>. A snapshot of the IANA assignments (about 11 years old) still exists in the source tree at Source/WebCore/platform/text/mac/character-sets.txt and is used by the script that generates the character set name table used by the TEC decoder. This file is where most of the aliases came from. - There is a separate list of additional encoding names at Source/WebCore/platform/text/mac/mac-encodings.txt that are also used for the TEC encoder on the Mac. Even back when this was last modified in 2009, the status of this file was "we would like to get rid of it". I think we should eliminate as many encoding names as we can, and synchronize with the encoding specification. Any encoding names that we decide to continue to support that is not mentioned in the specification needs a really good rationale; perhaps such encoding names can be limited to the context where they are required or, alternatively, added to the encoding specification and to other web browser engines. I don’t know how to best determine how eliminating support for certain encoding names or changing canonical names (which I think mainly affects encoding names in form submission?) will affect website and app compatibility. I suspect that if we eliminate those unneeded encoding names, we will find that we can easily eliminate the TEC decoder entirely, and many of the anomalies above will simply disappear if we remove the encoding names. When doing this work and removing the TEC decoder we should be aware that it’s possible for a decoder to add aliases that affect even encodings that are actually supported only by other decoders. As background for why item (2) might currently be as it is, there are three Big5-encoding-related changes in 2003, all done by me and reviewed by you, Maciej; easy to find by searching for "Big5" in Source/WebCore/ChangeLog-2003-10-25. If Chromium and WebKit still exchange code, it might be worthwhile to look at how Chromium invokes ICU and what patches they apply on top. These issues are probably also of interest: https://github.com/whatwg/encoding/issues?q=is%3Aissue+is%3Aopen+label%3Atests (note that each of them links to a WebKit bug; those should probably become dependencies). (In reply to Anne van Kesteren from comment #8) > If Chromium and WebKit still exchange code, it might be worthwhile to look > at how Chromium invokes ICU and what patches they apply on top. You’re implying that since Chromium made a branch from WebKit they fixed some of these issues. If true I am sure we can take advantage of what’s done in that project. On the other hand, while researching bug 178207 I learned that one set of changes to codecs done in Chromium after branching from WebKit created a new bug. So while we can look, we need to be careful in assuming it’s all correct. It’s also possible that Chromium has made changes that would cause problems for Mac or iOS apps that use WebKit; a different domain than websites. For the fewcases where we don't support encodings from the standard, it's pretty clear what to do. (Well, not entirely for Big5... we might need a handwritten encoder to support the union of ICU's Big5 and Big5-HKSCS supported characters (if I'm interpreting the test results correctly). We also have many miscellaneous bugs in the behavior of our encoders per WPT tests, but with this umbrella I'm only trying to fix the labels, not the encoders themselves. For the cases where we have extra encodings or extra labels for them, it's a little less obvious. As Darin said, some of it might be for compatibility with native apps or local files. I think that Chromium includes (used to include?) a custom build of ICU with modified codecs, but I don't know any details. *** Bug 55441 has been marked as a duplicate of this bug. *** |