Bug 159891 - [encoding] Support for GB18030
Summary: [encoding] Support for GB18030
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-18 12:04 PDT by Addison Phillips
Modified: 2017-03-15 00:56 PDT (History)
8 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Addison Phillips 2016-07-18 12:04:56 PDT
The W3C I18N WG in cooperation with WhatWG has created tests for the Encoding specification. In testing the encoding:

   GB18030

Our tests found 948 total errors. In addition 3 of the 5 decoder tests for this encoding failed. 

Please see:

https://github.com/whatwg/encoding/issues/57

Please comment on the above github issue in addition to addressing this bug here. If there are errors in our tests or in the specification, we would very much like to know!

[filed for W3C I18N WG]
Comment 1 Myles C. Maxfield 2016-10-20 00:59:44 PDT
I started looking at this and I found that in TextEncoding::encode() we will normalize the strings to NFC before encoding them. Commenting out this normalization causes many more of the tests to pass.
Comment 2 Myles C. Maxfield 2016-10-20 01:12:18 PDT
(In reply to comment #1)
> I started looking at this and I found that in TextEncoding::encode() we will
> normalize the strings to NFC before encoding them. Commenting out this
> normalization causes many more of the tests to pass.

This was done in Blink in https://chromiumcodereview.appspot.com/19845004. We should consider porting this to WebKit.
Comment 3 Myles C. Maxfield 2016-10-20 01:14:54 PDT
(In reply to comment #2)
> (In reply to comment #1)
> > I started looking at this and I found that in TextEncoding::encode() we will
> > normalize the strings to NFC before encoding them. Commenting out this
> > normalization causes many more of the tests to pass.
> 
> This was done in Blink in https://chromiumcodereview.appspot.com/19845004.
> We should consider porting this to WebKit.

The NFC normalization was eventually removed from Blink in https://codereview.chromium.org/1424303002
Comment 4 Alexey Proskuryakov 2016-10-20 12:16:21 PDT
This normalization is intentional, and is done to avoid hitting issues on sites that have not been extensively tested on Macs. It may or may not be reasonable to stop doing this, but passing tests is not a good reason to change the behavior here.
Comment 5 r12a 2016-10-20 12:26:49 PDT
Alexey, the issue is to do with interoperability and consistency across browsers, per the Encoding specification.  
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030 
shows that Webkit exhibits different behaviour here than Gecko, Blink and Edge.
Comment 6 Alexey Proskuryakov 2016-10-20 20:51:13 PDT
Interoperability means that users get identical results when they perform identical steps.

Currently, when a user types a character with an accent on Windows, and the browser sends it to the server, the data on the wire is different than the data sent when they do the same on Mac. That's a bug in Blink.
Comment 7 Alexey Proskuryakov 2016-10-20 20:52:59 PDT
...and in Firefox, obviously. Edge doesn't have the bug because it doesn't work on platforms where the system uses decomposed Unicode.
Comment 8 r12a 2016-10-21 05:14:15 PDT
For me, interoperability means that users get identical results when they perform identical steps regardless of the platform or browser they are using.

This particular issue is about what happens when a Unicode code point is converted to a GB18030 sequence of bytes. For example, U+FA5C 臭 is converted by Firefox, Chrome and Edge to the GB18030 byte sequence %84%30%A1%38.  Safari, however, converts that Unicode character to the byte sequence %B3%F4 (which is the GB18030 character that corresponds to U+81ED 臭.

In other words, suppose i'm writing the following: 
"There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents."

If that document gets converted to the GB18030 encoding by any of the other major browser engines it will look the same, since GB18030 has code points for both forms.  If however the conversion happens in Webkit the sentence will be changed from the original and become rather confusing for the reader, since both ideographs shown are now the same.  The text no longer says what the writer intended.

"There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents."
Comment 9 r12a 2016-10-21 05:16:15 PDT
I forgot to change the second example, which should have said:

"There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents."
Comment 10 Myles C. Maxfield 2016-10-21 10:24:38 PDT
Simply commenting out the normalization causes these tests to fail:

  fast/forms/form-data-encoding-2.html [ Failure ]
  fast/forms/form-data-encoding.html [ Failure ]
  http/tests/security/contentSecurityPolicy/1.1/scripthash-tests.html [ Failure ]
  http/tests/security/contentSecurityPolicy/1.1/scripthash-unicode-normalization.html [ Failure ]
  inspector/dom/csp-hash.html [ Failure ]
Comment 11 Alexey Proskuryakov 2016-10-21 15:57:20 PDT
> For me, interoperability means that users get identical results when they perform identical steps regardless of the platform or browser they are using.

I think that this is exactly what I said, and that's what WebKit behavior achieves. On the other hand, neither Chrome nor Firefox are not interoperable when you compare how they behave on different platforms.
Comment 12 Alexey Proskuryakov 2016-10-21 15:57:46 PDT
neither Chrome nor Firefox are interoperable*
Comment 13 Alexey Proskuryakov 2016-10-22 09:13:35 PDT
Here is a live example of what happens when input isn't normalized to NFC: <http://bash.im/quote/441781>. I don't now what exactly happened there, but it seems very very likely that Chrome or Firefox on Mac were at the start of the chain that ended up with "наи&#774;ти".
Comment 14 Myles C. Maxfield 2016-10-25 10:52:06 PDT
It sounds like we are discussing two different things here:

Alexey says:
> when a user types a character with an accent

Richard says:
> Unicode code point is converted

As I understand it, the chain of events is:

1. User presses some keys on their keyboard
2. IME stuff happens
3. A sequence of Unicode code points exists somewhere in memory which has some relation with the keys the user pressed
4. Something somewhere triggers form submission
5. This sequence of code points gets to converted to a sequence of bytes for the wire
6. Sockets are written to

The form encoding tests start this process at step number 3.

It's also relevant that the code points in memory in step 3 are visible to Javascript, and are therefore important for interoperability.

It sounds to me that, because step #3 is visible to Javascript, and step #6 is visible to Javascript (by way of the GET url), that a conceptual function from one to the other should be interoperable between all browsers on all platforms.

When the user types a character with an accent on Windows, perhaps the problem lies with the processing which converts the keystroke into a sequence of Unicode code points. Conceptually, this would seem to occur during step number 2.

Richard, Alexey: what are your thoughts?
Comment 15 Myles C. Maxfield 2016-10-25 10:55:19 PDT
Sorry, I could have stated this better.

When I said:
> When the user types a character with an accent on Windows, perhaps the
> problem lies

What I meant was:
When we try to match the behavior of a user typing a character with an accent on Windows, perhaps our problem lies
Comment 16 Martin Dürst 2016-10-25 16:51:26 PDT
Two comments:

1) WebKit may be more sensitive than Firefox/Blink/... to differences between Windows and Mac because it's mostly used for Safari on Mac.

2) If normalization (NFC) is necessary (or at least desirable) for GB 18030, then it should also be necessary (or at least desirable) for UTF-8. Is normalization actually used for UTF-8. If it is not, why not. If it is, that might create potentially much bigger interoperability problems.
Comment 17 Alexey Proskuryakov 2016-10-25 17:17:46 PDT
Myles: typing is not the only entry point for decomposed test on Mac - other examples include file names and pasteboard.

Martin: Yes, we normalize regardless of target encoding. So this is not really the right bug to discuss it, but I couldn't quickly find one where we had this discussion in the past (it might be marked as WONTFIX or INVALID, or maybe that discussion was also in a tangentially related bug).
Comment 18 Myles C. Maxfield 2016-10-25 23:59:29 PDT
(In reply to comment #17)
> Myles: typing is not the only entry point for decomposed test on Mac - other
> examples include file names and pasteboard.

If I'm understanding you correctly, it sounds like these should be updated too.
Comment 19 r12a 2016-10-28 04:11:05 PDT
Myles, Alexey, Martin, this isn't about input at all. It's about the browser's encoder and decoder algorithms when it needs to convert between different character encodings. There happen to be two handy ways to expose the behaviour of the encoder (in this case, going from Unicode to GB18030) so that it can be tested: by writing characters to form output or to an href value which expect the encoding to be GB18030.  That's what these tests do (programmatically).

The example i gave above uses an actual character from the tests that doesn't go through the Safari encoder as expected (ie. without change) per the Encoding spec.

Note btw that NFC transformations would never change the character in that example, since the character used is a Compatability equivalent for the Unicode character it is converted to. Such characters are not affected by NFC.

So in summary, the test is only checking the behaviour of the browser's encoder/decoder when converting between one character encoding and another, and in the case shown, where equivalents exist in both Unicode and GB 18030, the i18n WG and the WhatWG believe that normalization is not relevant.

Note, btw, that when *decoding* text, ie. from GB 18030 to Unicode, Safari performs all the conversions as expected by the Encoding spec (including the character in the example). In other words, there is a discrepancy between the way the encoder and decoder work.

Does that help make things clearer?
Comment 20 r12a 2016-10-28 04:35:13 PDT
btw, it may be worth pointing directly at the tests themselves. See https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#zhhans

click on the link in the left column to run the test.
Comment 21 r12a 2016-10-28 04:42:08 PDT
> Note btw that NFC transformations would never change the character in that example, since the character used is a Compatability equivalent for the Unicode character it is converted to. Such characters are not affected by NFC.

Actually that's incorrect. This is a NFC sensitive conversion.
Comment 22 Alexey Proskuryakov 2016-10-28 13:48:28 PDT
What we need to achieve is (roughly speaking) that data sent to servers by Safari is identical to data sent by Edge, for identical user actions.

Currently, we achieve that by normalizing strings when encoding them.

It may be possible to achieve that in other ways (that would make the Encoding API behave consistently). But simply removing normalization would break interoperability in the more important use case (typing test into a form and submitting it).
Comment 23 Anne van Kesteren 2016-11-18 10:17:45 PST
Rather than normalizing at the encoding layer, it might make more sense to normalize at the user-input layer. That way API usage is not affected. It seems that would actually be more cross-platform than what you do currently.
Comment 24 Alexey Proskuryakov 2016-11-18 10:34:30 PST
This may be worth trying. There are substantial risks though - if we normalize text coming from input methods, text offsets will change, and input methods will get confused in various ways.

Changing the spec and other browser engines to send predictably normalized data over the wire seems like a safer and more complete solution to me.
Comment 25 Anne van Kesteren 2016-11-18 10:41:50 PST
I suppose you could file an issue at https://github.com/whatwg/html/issues/new to get everyone to consider changing form submission, but thus far we've avoided a hard dependency on NFC in the platform (other than String.prototype.normalize()). I'm personally not opposed per se, but I think chances are slim it'll succeed.
Comment 26 Maciej Stachowiak 2017-03-15 00:56:33 PDT
(In reply to comment #24)
> This may be worth trying. There are substantial risks though - if we
> normalize text coming from input methods, text offsets will change, and
> input methods will get confused in various ways.

We could normalize only when input method input is accepted (including the possible premature accept when changing focus or submitting while an input method marked region is still active). That would probably avoid confusing input methods.

> Changing the spec and other browser engines to send predictably normalized
> data over the wire seems like a safer and more complete solution to me.

I'm guessing this is a hard sell since for Windows browsers this is more likely to create than resolve compat issues, since it would only make a difference in the programmatic entry case.