159891 – [encoding] Support for GB18030

Addison Phillips

Reported 2016-07-18 12:04:56 PDT

The W3C I18N WG in cooperation with WhatWG has created tests for the Encoding specification. In testing the encoding: GB18030 Our tests found 948 total errors. In addition 3 of the 5 decoder tests for this encoding failed. Please see: https://github.com/whatwg/encoding/issues/57 Please comment on the above github issue in addition to addressing this bug here. If there are errors in our tests or in the specification, we would very much like to know! [filed for W3C I18N WG]

Myles C. Maxfield

Comment 1 2016-10-20 00:59:44 PDT

I started looking at this and I found that in TextEncoding::encode() we will normalize the strings to NFC before encoding them. Commenting out this normalization causes many more of the tests to pass.

Myles C. Maxfield

Comment 2 2016-10-20 01:12:18 PDT

(In reply to comment #1) > I started looking at this and I found that in TextEncoding::encode() we will > normalize the strings to NFC before encoding them. Commenting out this > normalization causes many more of the tests to pass. This was done in Blink in https://chromiumcodereview.appspot.com/19845004. We should consider porting this to WebKit.

Myles C. Maxfield

Comment 3 2016-10-20 01:14:54 PDT

(In reply to comment #2) > (In reply to comment #1) > > I started looking at this and I found that in TextEncoding::encode() we will > > normalize the strings to NFC before encoding them. Commenting out this > > normalization causes many more of the tests to pass. > > This was done in Blink in https://chromiumcodereview.appspot.com/19845004. > We should consider porting this to WebKit. The NFC normalization was eventually removed from Blink in https://codereview.chromium.org/1424303002

Alexey Proskuryakov

Comment 4 2016-10-20 12:16:21 PDT

This normalization is intentional, and is done to avoid hitting issues on sites that have not been extensively tested on Macs. It may or may not be reasonable to stop doing this, but passing tests is not a good reason to change the behavior here.

r12a

Comment 5 2016-10-20 12:26:49 PDT

Alexey, the issue is to do with interoperability and consistency across browsers, per the Encoding specification. https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030 shows that Webkit exhibits different behaviour here than Gecko, Blink and Edge.

Alexey Proskuryakov

Comment 6 2016-10-20 20:51:13 PDT

Interoperability means that users get identical results when they perform identical steps. Currently, when a user types a character with an accent on Windows, and the browser sends it to the server, the data on the wire is different than the data sent when they do the same on Mac. That's a bug in Blink.

Alexey Proskuryakov

Comment 7 2016-10-20 20:52:59 PDT

...and in Firefox, obviously. Edge doesn't have the bug because it doesn't work on platforms where the system uses decomposed Unicode.

r12a

Comment 8 2016-10-21 05:14:15 PDT

For me, interoperability means that users get identical results when they perform identical steps regardless of the platform or browser they are using. This particular issue is about what happens when a Unicode code point is converted to a GB18030 sequence of bytes. For example, U+FA5C 臭 is converted by Firefox, Chrome and Edge to the GB18030 byte sequence %84%30%A1%38. Safari, however, converts that Unicode character to the byte sequence %B3%F4 (which is the GB18030 character that corresponds to U+81ED 臭. In other words, suppose i'm writing the following: "There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents." If that document gets converted to the GB18030 encoding by any of the other major browser engines it will look the same, since GB18030 has code points for both forms. If however the conversion happens in Webkit the sentence will be changed from the original and become rather confusing for the reader, since both ideographs shown are now the same. The text no longer says what the writer intended. "There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents."

r12a

Comment 9 2016-10-21 05:16:15 PDT

I forgot to change the second example, which should have said: "There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents."

Myles C. Maxfield

Comment 10 2016-10-21 10:24:38 PDT

Simply commenting out the normalization causes these tests to fail: fast/forms/form-data-encoding-2.html [ Failure ] fast/forms/form-data-encoding.html [ Failure ] http/tests/security/contentSecurityPolicy/1.1/scripthash-tests.html [ Failure ] http/tests/security/contentSecurityPolicy/1.1/scripthash-unicode-normalization.html [ Failure ] inspector/dom/csp-hash.html [ Failure ]

Alexey Proskuryakov

Comment 11 2016-10-21 15:57:20 PDT

> For me, interoperability means that users get identical results when they perform identical steps regardless of the platform or browser they are using. I think that this is exactly what I said, and that's what WebKit behavior achieves. On the other hand, neither Chrome nor Firefox are not interoperable when you compare how they behave on different platforms.

Alexey Proskuryakov

Comment 12 2016-10-21 15:57:46 PDT

neither Chrome nor Firefox are interoperable*

Alexey Proskuryakov

Comment 13 2016-10-22 09:13:35 PDT

Here is a live example of what happens when input isn't normalized to NFC: <http://bash.im/quote/441781>. I don't now what exactly happened there, but it seems very very likely that Chrome or Firefox on Mac were at the start of the chain that ended up with "найти".

Myles C. Maxfield

Comment 14 2016-10-25 10:52:06 PDT

It sounds like we are discussing two different things here: Alexey says: > when a user types a character with an accent Richard says: > Unicode code point is converted As I understand it, the chain of events is: 1. User presses some keys on their keyboard 2. IME stuff happens 3. A sequence of Unicode code points exists somewhere in memory which has some relation with the keys the user pressed 4. Something somewhere triggers form submission 5. This sequence of code points gets to converted to a sequence of bytes for the wire 6. Sockets are written to The form encoding tests start this process at step number 3. It's also relevant that the code points in memory in step 3 are visible to Javascript, and are therefore important for interoperability. It sounds to me that, because step #3 is visible to Javascript, and step #6 is visible to Javascript (by way of the GET url), that a conceptual function from one to the other should be interoperable between all browsers on all platforms. When the user types a character with an accent on Windows, perhaps the problem lies with the processing which converts the keystroke into a sequence of Unicode code points. Conceptually, this would seem to occur during step number 2. Richard, Alexey: what are your thoughts?

Myles C. Maxfield

Comment 15 2016-10-25 10:55:19 PDT

Sorry, I could have stated this better. When I said: > When the user types a character with an accent on Windows, perhaps the > problem lies What I meant was: When we try to match the behavior of a user typing a character with an accent on Windows, perhaps our problem lies

Martin Dürst

Comment 16 2016-10-25 16:51:26 PDT

Two comments: 1) WebKit may be more sensitive than Firefox/Blink/... to differences between Windows and Mac because it's mostly used for Safari on Mac. 2) If normalization (NFC) is necessary (or at least desirable) for GB 18030, then it should also be necessary (or at least desirable) for UTF-8. Is normalization actually used for UTF-8. If it is not, why not. If it is, that might create potentially much bigger interoperability problems.

Alexey Proskuryakov

Comment 17 2016-10-25 17:17:46 PDT

Myles: typing is not the only entry point for decomposed test on Mac - other examples include file names and pasteboard. Martin: Yes, we normalize regardless of target encoding. So this is not really the right bug to discuss it, but I couldn't quickly find one where we had this discussion in the past (it might be marked as WONTFIX or INVALID, or maybe that discussion was also in a tangentially related bug).

Myles C. Maxfield

Comment 18 2016-10-25 23:59:29 PDT

(In reply to comment #17) > Myles: typing is not the only entry point for decomposed test on Mac - other > examples include file names and pasteboard. If I'm understanding you correctly, it sounds like these should be updated too.

r12a

Comment 19 2016-10-28 04:11:05 PDT

Myles, Alexey, Martin, this isn't about input at all. It's about the browser's encoder and decoder algorithms when it needs to convert between different character encodings. There happen to be two handy ways to expose the behaviour of the encoder (in this case, going from Unicode to GB18030) so that it can be tested: by writing characters to form output or to an href value which expect the encoding to be GB18030. That's what these tests do (programmatically). The example i gave above uses an actual character from the tests that doesn't go through the Safari encoder as expected (ie. without change) per the Encoding spec. Note btw that NFC transformations would never change the character in that example, since the character used is a Compatability equivalent for the Unicode character it is converted to. Such characters are not affected by NFC. So in summary, the test is only checking the behaviour of the browser's encoder/decoder when converting between one character encoding and another, and in the case shown, where equivalents exist in both Unicode and GB 18030, the i18n WG and the WhatWG believe that normalization is not relevant. Note, btw, that when *decoding* text, ie. from GB 18030 to Unicode, Safari performs all the conversions as expected by the Encoding spec (including the character in the example). In other words, there is a discrepancy between the way the encoder and decoder work. Does that help make things clearer?

r12a

Comment 20 2016-10-28 04:35:13 PDT

btw, it may be worth pointing directly at the tests themselves. See https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#zhhans click on the link in the left column to run the test.

r12a

Comment 21 2016-10-28 04:42:08 PDT

> Note btw that NFC transformations would never change the character in that example, since the character used is a Compatability equivalent for the Unicode character it is converted to. Such characters are not affected by NFC. Actually that's incorrect. This is a NFC sensitive conversion.

Alexey Proskuryakov

Comment 22 2016-10-28 13:48:28 PDT

What we need to achieve is (roughly speaking) that data sent to servers by Safari is identical to data sent by Edge, for identical user actions. Currently, we achieve that by normalizing strings when encoding them. It may be possible to achieve that in other ways (that would make the Encoding API behave consistently). But simply removing normalization would break interoperability in the more important use case (typing test into a form and submitting it).

Anne van Kesteren

Comment 23 2016-11-18 10:17:45 PST

Rather than normalizing at the encoding layer, it might make more sense to normalize at the user-input layer. That way API usage is not affected. It seems that would actually be more cross-platform than what you do currently.

Alexey Proskuryakov

Comment 24 2016-11-18 10:34:30 PST

This may be worth trying. There are substantial risks though - if we normalize text coming from input methods, text offsets will change, and input methods will get confused in various ways. Changing the spec and other browser engines to send predictably normalized data over the wire seems like a safer and more complete solution to me.

Anne van Kesteren

Comment 25 2016-11-18 10:41:50 PST

I suppose you could file an issue at https://github.com/whatwg/html/issues/new to get everyone to consider changing form submission, but thus far we've avoided a hard dependency on NFC in the platform (other than String.prototype.normalize()). I'm personally not opposed per se, but I think chances are slim it'll succeed.

Maciej Stachowiak

Comment 26 2017-03-15 00:56:33 PDT

(In reply to comment #24) > This may be worth trying. There are substantial risks though - if we > normalize text coming from input methods, text offsets will change, and > input methods will get confused in various ways. We could normalize only when input method input is accepted (including the possible premature accept when changing focus or submitting while an input method marked region is still active). That would probably avoid confusing input methods. > Changing the spec and other browser engines to send predictably normalized > data over the wire seems like a safer and more complete solution to me. I'm guessing this is a hard sell since for Windows browsers this is more likely to create than resolve compat issues, since it would only make a difference in the programmatic entry case.

Myles C. Maxfield

Comment 27 2019-11-15 15:21:02 PST

> data sent to servers by Safari is identical to data sent by Edge In light of recent events, this should probably no longer be a goal.

Radar WebKit Bug Importer

Comment 28 2019-11-15 17:04:26 PST

<rdar://problem/57245905>

Alexey Proskuryakov

Comment 29 2019-11-18 09:47:46 PST

> > data sent to servers by Safari is identical to data sent by Edge > In light of recent events, this should probably no longer be a goal. Sure. Replace it with "Chromium on Windows", because that's the same behavior.

Anne van Kesteren

Comment 30 2022-09-27 08:01:50 PDT

This was fixed in bug 215970. Not entirely sure if there are callers that pass NFCNormalize::Yes left, but that is still supported in theory. *** This bug has been marked as a duplicate of bug 215970 ***