Bug 179303

Summary: UMBRELLA: text encoding oddities
Product: WebKit Reporter: Maciej Stachowiak <mjs>
Component: TextAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: annevk, ap, darin, mail, mmaxfield, naruse
Priority: P2    
Version: Safari Technology Preview   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=178207
Bug Depends on: 54444, 235307, 265261, 179305, 179307, 179309, 179312, 179412, 179416, 179435, 179460, 179582, 180356, 233612    
Bug Blocks:    
Attachments:
Description Flags
The Encoding Standard's set of encodings
none
Cross-platform WebKit encodings
none
Mac-only WebKit encodings
none
iOS-only WebKit encodings
none
Script for checking encoding consistency
none
Results of the consistency script, identifying numerous inconsistencies none

Description Maciej Stachowiak 2017-11-05 15:20:19 PST
Created attachment 326079 [details]
The Encoding Standard's set of encodings

WebKit has numerous oddities in its text encoding support, comparing to the definitions in the <https://encoding.spec.whatwg.org> standard and comparing between platforms.

I made a script to spot inconsistencies, and I'll attach both that and its output.

Here are some of the odd things I noticed:

(1) WebKit thinks the canonical name for EUC-KR/windows-949 is windows-949, but the standard thinks the canonical name is EUC-KR. Not sure if this matters. Probably WebKit should be consistent with the standard unless there is some reason not to.

(2) WebKit thinks big5-hkscs is Big5-HKSCS, while the standard thinks it is Big5. (WebKit has a bunch of other names for Big5-HKSCS which the standard doesn't seem to recognize. If Big5 and Big5-HKSCS are not the same, then this is almost surely an interop bug and must be fixed either in WebKit or in the Encoding standard.

(3) WebKit recognizes various extra names for encodings that are in the standard. For example, we recognize "windows-10007" and "maccyrillic" as "x-maccyrillic", but the Encoding Standard only recognizes "x-maccyrillic" and "x-mac-ukrainian" as names for that encoding. There are many more like this. It's not clear if these are WebKit bugs or spec bugs.

(4) WebKit on all platforms knows some encodings that the Encoding Standard doesn't recognize at all: x-mac-turkish, UTF-32BE, UTF-32LE, UTF-32, Big5-HKSCS, x-mac-centraleurroman and x-mac-greek. At least UTF-32 is likely to be an interop and security problem. Big5-HKCS is a problem for reasons noted in (2). I don't know if the others are WebKit bugs or spec bugs.

(5) macOS WebKit knows extra aliases for encodingings that are known to both cross-platform WebKit and the Encoding Standard. For example, macOS WebKit recognizes ['iso-8859-14', 'iso8859141998', 'isoceltic', 'isoir199', 'l8', 'latin8'] as extra names for "ISO-8859-14", but the standard only recognizes ["iso-8859-14", "iso8859-14", "iso885914"]. There are quite a few like this. It's not clear why these names are needed, and why only on macOS.

(6) macOS WebKit knows a number of extra encodings that aren't known to other WebKit ports and which are not in the Encoding standard. There are about 30 of these so I won't list them all here. These are all implemented via TEC rather than ICU, though it's possible some have ICU implementations available. Maybe some of these are required for legacy reasons but it's not at all clear which.

(7) macOS WebKit uses TEC for some encodings that use ICU on all other platforms (including iOS), for example iso-8859-16. It's not clear why. These should probably be switched to use the ICU implementation on macOS.

(8) iOS WebKit supports a number of extra encodings via ICU, with a comment claiming this is due to lack of TEC on the iOS platform. However, these seocndings aren't supported in any other WebKit port, not even macOS via TEC. It's not clear if these are required for anything.
Comment 1 Maciej Stachowiak 2017-11-05 15:20:49 PST
Created attachment 326080 [details]
Cross-platform WebKit encodings
Comment 2 Maciej Stachowiak 2017-11-05 15:21:14 PST
Created attachment 326081 [details]
Mac-only WebKit encodings
Comment 3 Maciej Stachowiak 2017-11-05 15:21:39 PST
Created attachment 326082 [details]
iOS-only WebKit encodings
Comment 4 Maciej Stachowiak 2017-11-05 15:22:05 PST
Created attachment 326083 [details]
Script for checking encoding consistency
Comment 5 Maciej Stachowiak 2017-11-05 15:22:39 PST
Created attachment 326084 [details]
Results of the consistency script, identifying numerous inconsistencies
Comment 6 Maciej Stachowiak 2017-11-05 15:25:35 PST
Cc'd some people that I hope know about WebKit's text encodings and/or encodings in general. There's probably multiple separate bugs to be filed here.
Comment 7 Darin Adler 2017-11-05 17:02:22 PST
I have noticed these anomalies, too. I was looking into this myself recently after the work I did on bug 178207. Many can be explained by some of this history:

- The TEC decoder was the first one in WebKit. We created that decoder before the rest of them, and at the time we created it our goal was to support all encodings that TEC knew how to decode rather than necessarily selecting an appropriate set for the web. Many strange things about encodings are due to the fact that we have treated that as the "main" decoder on Mac, rather than using it only for a few special purposes. Seems likely we would not need it at all since ICU should have the support we need.

- Most of the character set names used by the TEC decoder were based on the IANA character set assignments <https://www.iana.org/assignments/character-sets/>. A snapshot of the IANA assignments (about 11 years old) still exists in the source tree at Source/WebCore/platform/text/mac/character-sets.txt and is used by the script that generates the character set name table used by the TEC decoder. This file is where most of the aliases came from.

- There is a separate list of additional encoding names at Source/WebCore/platform/text/mac/mac-encodings.txt that are also used for the TEC encoder on the Mac. Even back when this was last modified in 2009, the status of this file was "we would like to get rid of it".

I think we should eliminate as many encoding names as we can, and synchronize with the encoding specification. Any encoding names that we decide to continue to support that is not mentioned in the specification needs a really good rationale; perhaps such encoding names can be limited to the context where they are required or, alternatively, added to the encoding specification and to other web browser engines.

I don’t know how to best determine how eliminating support for certain encoding names or changing canonical names (which I think mainly affects encoding names in form submission?) will affect website and app compatibility.

I suspect that if we eliminate those unneeded encoding names, we will find that we can easily eliminate the TEC decoder entirely, and many of the anomalies above will simply disappear if we remove the encoding names. When doing this work and removing the TEC decoder we should be aware that it’s possible for a decoder to add aliases that affect even encodings that are actually supported only by other decoders.

As background for why item (2) might currently be as it is, there are three Big5-encoding-related changes in 2003, all done by me and reviewed by you, Maciej; easy to find by searching for "Big5" in Source/WebCore/ChangeLog-2003-10-25.
Comment 8 Anne van Kesteren 2017-11-06 04:08:51 PST
If Chromium and WebKit still exchange code, it might be worthwhile to look at how Chromium invokes ICU and what patches they apply on top.

These issues are probably also of interest: https://github.com/whatwg/encoding/issues?q=is%3Aissue+is%3Aopen+label%3Atests (note that each of them links to a WebKit bug; those should probably become dependencies).
Comment 9 Darin Adler 2017-11-06 07:29:52 PST
(In reply to Anne van Kesteren from comment #8)
> If Chromium and WebKit still exchange code, it might be worthwhile to look
> at how Chromium invokes ICU and what patches they apply on top.

You’re implying that since Chromium made a branch from WebKit they fixed some of these issues. If true I am sure we can take advantage of what’s done in that project. On the other hand, while researching bug 178207 I learned that one set of changes to codecs done in Chromium after branching from WebKit created a new bug. So while we can look, we need to be careful in assuming it’s all correct.

It’s also possible that Chromium has made changes that would cause problems for Mac or iOS apps that use WebKit; a different domain than websites.
Comment 10 Maciej Stachowiak 2017-11-06 09:38:53 PST
For the fewcases where we don't support encodings from the standard, it's pretty clear what to do. (Well, not entirely for Big5... we might need a handwritten encoder to support the union of ICU's Big5 and Big5-HKSCS supported characters (if I'm interpreting the test results correctly).

We also have many miscellaneous bugs in the behavior of our encoders per WPT tests, but with this umbrella I'm only trying to fix the labels, not the encoders themselves.

For the cases where we have extra encodings or extra labels for them, it's a little less obvious. As Darin said, some of it might be for compatibility with native apps or local files.
Comment 11 Alexey Proskuryakov 2017-11-06 10:28:55 PST
I think that Chromium includes (used to include?) a custom build of ICU with modified codecs, but I don't know any details.
Comment 12 Anne van Kesteren 2022-09-27 06:28:48 PDT
*** Bug 55441 has been marked as a duplicate of this bug. ***