RESOLVED FIXED Bug 215970
Remove NFC normalization when submitting forms and encoding URL queries and fix EUC-JP encoding
https://bugs.webkit.org/show_bug.cgi?id=215970
Summary Remove NFC normalization when submitting forms and encoding URL queries and f...
Alex Christensen
Reported 2020-08-28 21:32:01 PDT
Remove NFC normalization when submitting forms and encoding URL queries and fix EUC-JP encoding
Attachments
Patch (2.34 MB, patch)
2020-08-28 21:54 PDT, Alex Christensen
no flags
Patch (2.35 MB, patch)
2020-08-28 22:50 PDT, Alex Christensen
darin: review+
patch (3.15 MB, patch)
2020-08-29 17:46 PDT, Alex Christensen
no flags
Alex Christensen
Comment 1 2020-08-28 21:54:59 PDT
Alex Christensen
Comment 2 2020-08-28 22:50:34 PDT
Darin Adler
Comment 3 2020-08-29 12:15:44 PDT
Comment on attachment 407530 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=407530&action=review Looks like there are more tests to rebase. > Source/WebCore/platform/text/ICUExtras.h:32 > +extern const UChar jis0208[15448]; It doesn’t make sense to me to call this "ICU" extras. This is our own encoder/decoder for a particular encoding, and not part of ICU. I suggest a different name for this file. > Source/WebCore/platform/text/TextCodecICU.cpp:436 > + auto pointerRange = std::equal_range(reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208), reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208 + WTF_ARRAY_LENGTH(jis0208)), std::pair<UChar, UChar>(c, 0), [](const auto& a, const auto& b) { Why are we casting an array of UChar into an array of pairs? Why not actually compile an array of pairs? That should work just as well, just with more braces in the source file, and without the reinterpret_cast. > Source/WebCore/platform/text/TextCodecICU.cpp:454 > + CString decimal = makeString(static_cast<unsigned>(c)).utf8(); > + for (size_t i = 0; i < decimal.length(); i++) > + result.uncheckedAppend(decimal.data()[i]); We have functions that will do this more efficiently, without allocating memory: uint8_t buffer[5]; writeIntegerToBuffer(character, buffer); result.append(buffer, lengthOfIntegerAsString(character)); Or some variation on this. > Source/WebCore/platform/text/TextCodecICU.cpp:504 > + if (!strcmp(m_canonicalConverterName, "euc-jp-2007")) > + return eucJPEncode(string, unencodableHandler(handling)); The way our codec classes are designed, you would normally create and register a separate class derived TextCodec rather than adding a non-ICU implementation to the ICU one. Did you consider that? Why did you reject it? > Source/WebCore/platform/text/TextEncoding.cpp:72 > +Vector<uint8_t> TextEncoding::encode(StringView string, UnencodableHandling handling, NFCNormalize nfcNormalize) const I suggest naming the local variable just "normalize". > Source/WebCore/platform/text/TextEncoding.cpp:82 > + auto normalizedString = normalizedNFC(string); > + return newTextCodec(*this)->encode(normalizedString.view, handling); I suggest making this just a one-liner.
Alex Christensen
Comment 4 2020-08-29 15:49:29 PDT
(In reply to Darin Adler from comment #3) > > Source/WebCore/platform/text/ICUExtras.h:32 > > +extern const UChar jis0208[15448]; > > It doesn’t make sense to me to call this "ICU" extras. This is our own > encoder/decoder for a particular encoding, and not part of ICU. I suggest a > different name for this file. I called it ICU extras because I think these should be in ICU, and the ICU that chromium uses has this in it. I could call it "EncodingTables.{h, cpp}" > > Source/WebCore/platform/text/TextCodecICU.cpp:436 > > + auto pointerRange = std::equal_range(reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208), reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208 + WTF_ARRAY_LENGTH(jis0208)), std::pair<UChar, UChar>(c, 0), [](const auto& a, const auto& b) { > > Why are we casting an array of UChar into an array of pairs? Why not > actually compile an array of pairs? That should work just as well, just with > more braces in the source file, and without the reinterpret_cast. I was thinking that a const array of UChar would definitely pack as tight as possible in the binary to only increase by 30k. I wasn't so sure about a const array of std::pair<UChar, UChar>, but now that I think about it it should pack it just as tight, especially in a release build.
Darin Adler
Comment 5 2020-08-29 16:27:44 PDT
(In reply to Alex Christensen from comment #4) > I could call it "EncodingTables.{h, cpp}" I think that’s better. > now that I think about it it > should pack it just as tight, especially in a release build. Think of it this way: if it wasn’t going to pack as tight, then the reinterpret_cast wouldn’t work.
Alex Christensen
Comment 6 2020-08-29 17:26:27 PDT
(In reply to Darin Adler from comment #3) > Looks like there are more tests to rebase. Ah, yes. Changing the NFC normalization changed the encoding of the code points that would have been normalized, which added more PASSes outside of euc-jp. I'll rebase and make them all PASS soon.
Alex Christensen
Comment 7 2020-08-29 17:46:52 PDT
Alex Christensen
Comment 8 2020-08-29 18:49:21 PDT
Comment on attachment 407558 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=407558&action=review > Source/WebCore/platform/text/TextCodecICU.cpp:496 > + } > +} I added ASSERT_NOT_REACHED(); return entityUnencodableHandler; here to make MSVC happy. http://trac.webkit.org/r266330
Radar WebKit Bug Importer
Comment 9 2020-08-29 18:50:18 PDT
Anne van Kesteren
Comment 10 2022-09-27 08:01:50 PDT
*** Bug 159891 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.