The W3C I18N WG in cooperation with WhatWG has created tests for the Encoding specification. In testing the encoding: GB18030 Our tests found 948 total errors. In addition 3 of the 5 decoder tests for this encoding failed. Please see: https://github.com/whatwg/encoding/issues/57 Please comment on the above github issue in addition to addressing this bug here. If there are errors in our tests or in the specification, we would very much like to know! [filed for W3C I18N WG]
I started looking at this and I found that in TextEncoding::encode() we will normalize the strings to NFC before encoding them. Commenting out this normalization causes many more of the tests to pass.
(In reply to comment #1) > I started looking at this and I found that in TextEncoding::encode() we will > normalize the strings to NFC before encoding them. Commenting out this > normalization causes many more of the tests to pass. This was done in Blink in https://chromiumcodereview.appspot.com/19845004. We should consider porting this to WebKit.
(In reply to comment #2) > (In reply to comment #1) > > I started looking at this and I found that in TextEncoding::encode() we will > > normalize the strings to NFC before encoding them. Commenting out this > > normalization causes many more of the tests to pass. > > This was done in Blink in https://chromiumcodereview.appspot.com/19845004. > We should consider porting this to WebKit. The NFC normalization was eventually removed from Blink in https://codereview.chromium.org/1424303002
This normalization is intentional, and is done to avoid hitting issues on sites that have not been extensively tested on Macs. It may or may not be reasonable to stop doing this, but passing tests is not a good reason to change the behavior here.
Alexey, the issue is to do with interoperability and consistency across browsers, per the Encoding specification. https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030 shows that Webkit exhibits different behaviour here than Gecko, Blink and Edge.
Interoperability means that users get identical results when they perform identical steps. Currently, when a user types a character with an accent on Windows, and the browser sends it to the server, the data on the wire is different than the data sent when they do the same on Mac. That's a bug in Blink.
...and in Firefox, obviously. Edge doesn't have the bug because it doesn't work on platforms where the system uses decomposed Unicode.
For me, interoperability means that users get identical results when they perform identical steps regardless of the platform or browser they are using. This particular issue is about what happens when a Unicode code point is converted to a GB18030 sequence of bytes. For example, U+FA5C 臭 is converted by Firefox, Chrome and Edge to the GB18030 byte sequence %84%30%A1%38. Safari, however, converts that Unicode character to the byte sequence %B3%F4 (which is the GB18030 character that corresponds to U+81ED 臭. In other words, suppose i'm writing the following: "There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents." If that document gets converted to the GB18030 encoding by any of the other major browser engines it will look the same, since GB18030 has code points for both forms. If however the conversion happens in Webkit the sentence will be changed from the original and become rather confusing for the reader, since both ideographs shown are now the same. The text no longer says what the writer intended. "There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents."
I forgot to change the second example, which should have said: "There are two similar han ideographs which represent the sound xiù, 臭 and 臭. In Unicode these are compatilibity equivalents."
Simply commenting out the normalization causes these tests to fail: fast/forms/form-data-encoding-2.html [ Failure ] fast/forms/form-data-encoding.html [ Failure ] http/tests/security/contentSecurityPolicy/1.1/scripthash-tests.html [ Failure ] http/tests/security/contentSecurityPolicy/1.1/scripthash-unicode-normalization.html [ Failure ] inspector/dom/csp-hash.html [ Failure ]
> For me, interoperability means that users get identical results when they perform identical steps regardless of the platform or browser they are using. I think that this is exactly what I said, and that's what WebKit behavior achieves. On the other hand, neither Chrome nor Firefox are not interoperable when you compare how they behave on different platforms.
neither Chrome nor Firefox are interoperable*
Here is a live example of what happens when input isn't normalized to NFC: <http://bash.im/quote/441781>. I don't now what exactly happened there, but it seems very very likely that Chrome or Firefox on Mac were at the start of the chain that ended up with "найти".
It sounds like we are discussing two different things here: Alexey says: > when a user types a character with an accent Richard says: > Unicode code point is converted As I understand it, the chain of events is: 1. User presses some keys on their keyboard 2. IME stuff happens 3. A sequence of Unicode code points exists somewhere in memory which has some relation with the keys the user pressed 4. Something somewhere triggers form submission 5. This sequence of code points gets to converted to a sequence of bytes for the wire 6. Sockets are written to The form encoding tests start this process at step number 3. It's also relevant that the code points in memory in step 3 are visible to Javascript, and are therefore important for interoperability. It sounds to me that, because step #3 is visible to Javascript, and step #6 is visible to Javascript (by way of the GET url), that a conceptual function from one to the other should be interoperable between all browsers on all platforms. When the user types a character with an accent on Windows, perhaps the problem lies with the processing which converts the keystroke into a sequence of Unicode code points. Conceptually, this would seem to occur during step number 2. Richard, Alexey: what are your thoughts?
Sorry, I could have stated this better. When I said: > When the user types a character with an accent on Windows, perhaps the > problem lies What I meant was: When we try to match the behavior of a user typing a character with an accent on Windows, perhaps our problem lies
Two comments: 1) WebKit may be more sensitive than Firefox/Blink/... to differences between Windows and Mac because it's mostly used for Safari on Mac. 2) If normalization (NFC) is necessary (or at least desirable) for GB 18030, then it should also be necessary (or at least desirable) for UTF-8. Is normalization actually used for UTF-8. If it is not, why not. If it is, that might create potentially much bigger interoperability problems.
Myles: typing is not the only entry point for decomposed test on Mac - other examples include file names and pasteboard. Martin: Yes, we normalize regardless of target encoding. So this is not really the right bug to discuss it, but I couldn't quickly find one where we had this discussion in the past (it might be marked as WONTFIX or INVALID, or maybe that discussion was also in a tangentially related bug).
(In reply to comment #17) > Myles: typing is not the only entry point for decomposed test on Mac - other > examples include file names and pasteboard. If I'm understanding you correctly, it sounds like these should be updated too.
Myles, Alexey, Martin, this isn't about input at all. It's about the browser's encoder and decoder algorithms when it needs to convert between different character encodings. There happen to be two handy ways to expose the behaviour of the encoder (in this case, going from Unicode to GB18030) so that it can be tested: by writing characters to form output or to an href value which expect the encoding to be GB18030. That's what these tests do (programmatically). The example i gave above uses an actual character from the tests that doesn't go through the Safari encoder as expected (ie. without change) per the Encoding spec. Note btw that NFC transformations would never change the character in that example, since the character used is a Compatability equivalent for the Unicode character it is converted to. Such characters are not affected by NFC. So in summary, the test is only checking the behaviour of the browser's encoder/decoder when converting between one character encoding and another, and in the case shown, where equivalents exist in both Unicode and GB 18030, the i18n WG and the WhatWG believe that normalization is not relevant. Note, btw, that when *decoding* text, ie. from GB 18030 to Unicode, Safari performs all the conversions as expected by the Encoding spec (including the character in the example). In other words, there is a discrepancy between the way the encoder and decoder work. Does that help make things clearer?
btw, it may be worth pointing directly at the tests themselves. See https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#zhhans click on the link in the left column to run the test.
> Note btw that NFC transformations would never change the character in that example, since the character used is a Compatability equivalent for the Unicode character it is converted to. Such characters are not affected by NFC. Actually that's incorrect. This is a NFC sensitive conversion.
What we need to achieve is (roughly speaking) that data sent to servers by Safari is identical to data sent by Edge, for identical user actions. Currently, we achieve that by normalizing strings when encoding them. It may be possible to achieve that in other ways (that would make the Encoding API behave consistently). But simply removing normalization would break interoperability in the more important use case (typing test into a form and submitting it).
Rather than normalizing at the encoding layer, it might make more sense to normalize at the user-input layer. That way API usage is not affected. It seems that would actually be more cross-platform than what you do currently.
This may be worth trying. There are substantial risks though - if we normalize text coming from input methods, text offsets will change, and input methods will get confused in various ways. Changing the spec and other browser engines to send predictably normalized data over the wire seems like a safer and more complete solution to me.
I suppose you could file an issue at https://github.com/whatwg/html/issues/new to get everyone to consider changing form submission, but thus far we've avoided a hard dependency on NFC in the platform (other than String.prototype.normalize()). I'm personally not opposed per se, but I think chances are slim it'll succeed.
(In reply to comment #24) > This may be worth trying. There are substantial risks though - if we > normalize text coming from input methods, text offsets will change, and > input methods will get confused in various ways. We could normalize only when input method input is accepted (including the possible premature accept when changing focus or submitting while an input method marked region is still active). That would probably avoid confusing input methods. > Changing the spec and other browser engines to send predictably normalized > data over the wire seems like a safer and more complete solution to me. I'm guessing this is a hard sell since for Windows browsers this is more likely to create than resolve compat issues, since it would only make a difference in the programmatic entry case.
> data sent to servers by Safari is identical to data sent by Edge In light of recent events, this should probably no longer be a goal.
<rdar://problem/57245905>
> > data sent to servers by Safari is identical to data sent by Edge > In light of recent events, this should probably no longer be a goal. Sure. Replace it with "Chromium on Windows", because that's the same behavior.
This was fixed in bug 215970. Not entirely sure if there are callers that pass NFCNormalize::Yes left, but that is still supported in theory. *** This bug has been marked as a duplicate of bug 215970 ***