215970 – Remove NFC normalization when submitting forms and encoding URL queries and fix EUC-JP encoding

RESOLVED FIXED 215970

Remove NFC normalization when submitting forms and encoding URL queries and fix EUC-JP encoding

https://bugs.webkit.org/show_bug.cgi?id=215970

Summary Remove NFC normalization when submitting forms and encoding URL queries and f...

Alex Christensen

Reported 2020-08-28 21:32:01 PDT

Remove NFC normalization when submitting forms and encoding URL queries and fix EUC-JP encoding

Attachments
Patch (2.34 MB, patch) 2020-08-28 21:54 PDT, Alex Christensen	no flags	Details Formatted Diff Diff
Patch (2.35 MB, patch) 2020-08-28 22:50 PDT, Alex Christensen	darin: review+	Details Formatted Diff Diff
patch (3.15 MB, patch) 2020-08-29 17:46 PDT, Alex Christensen	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Alex Christensen

Comment 1 2020-08-28 21:54:59 PDT

Created attachment 407524 [details] Patch

Alex Christensen

Comment 2 2020-08-28 22:50:34 PDT

Created attachment 407530 [details] Patch

Darin Adler

Comment 3 2020-08-29 12:15:44 PDT

Comment on attachment 407530 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=407530&action=review Looks like there are more tests to rebase. > Source/WebCore/platform/text/ICUExtras.h:32 > +extern const UChar jis0208[15448]; It doesn’t make sense to me to call this "ICU" extras. This is our own encoder/decoder for a particular encoding, and not part of ICU. I suggest a different name for this file. > Source/WebCore/platform/text/TextCodecICU.cpp:436 > + auto pointerRange = std::equal_range(reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208), reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208 + WTF_ARRAY_LENGTH(jis0208)), std::pair<UChar, UChar>(c, 0), [](const auto& a, const auto& b) { Why are we casting an array of UChar into an array of pairs? Why not actually compile an array of pairs? That should work just as well, just with more braces in the source file, and without the reinterpret_cast. > Source/WebCore/platform/text/TextCodecICU.cpp:454 > + CString decimal = makeString(static_cast<unsigned>(c)).utf8(); > + for (size_t i = 0; i < decimal.length(); i++) > + result.uncheckedAppend(decimal.data()[i]); We have functions that will do this more efficiently, without allocating memory: uint8_t buffer[5]; writeIntegerToBuffer(character, buffer); result.append(buffer, lengthOfIntegerAsString(character)); Or some variation on this. > Source/WebCore/platform/text/TextCodecICU.cpp:504 > + if (!strcmp(m_canonicalConverterName, "euc-jp-2007")) > + return eucJPEncode(string, unencodableHandler(handling)); The way our codec classes are designed, you would normally create and register a separate class derived TextCodec rather than adding a non-ICU implementation to the ICU one. Did you consider that? Why did you reject it? > Source/WebCore/platform/text/TextEncoding.cpp:72 > +Vector<uint8_t> TextEncoding::encode(StringView string, UnencodableHandling handling, NFCNormalize nfcNormalize) const I suggest naming the local variable just "normalize". > Source/WebCore/platform/text/TextEncoding.cpp:82 > + auto normalizedString = normalizedNFC(string); > + return newTextCodec(*this)->encode(normalizedString.view, handling); I suggest making this just a one-liner.

Alex Christensen

Comment 4 2020-08-29 15:49:29 PDT

(In reply to Darin Adler from comment #3) > > Source/WebCore/platform/text/ICUExtras.h:32 > > +extern const UChar jis0208[15448]; > > It doesn’t make sense to me to call this "ICU" extras. This is our own > encoder/decoder for a particular encoding, and not part of ICU. I suggest a > different name for this file. I called it ICU extras because I think these should be in ICU, and the ICU that chromium uses has this in it. I could call it "EncodingTables.{h, cpp}" > > Source/WebCore/platform/text/TextCodecICU.cpp:436 > > + auto pointerRange = std::equal_range(reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208), reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208 + WTF_ARRAY_LENGTH(jis0208)), std::pair<UChar, UChar>(c, 0), [](const auto& a, const auto& b) { > > Why are we casting an array of UChar into an array of pairs? Why not > actually compile an array of pairs? That should work just as well, just with > more braces in the source file, and without the reinterpret_cast. I was thinking that a const array of UChar would definitely pack as tight as possible in the binary to only increase by 30k. I wasn't so sure about a const array of std::pair<UChar, UChar>, but now that I think about it it should pack it just as tight, especially in a release build.

Darin Adler

Comment 5 2020-08-29 16:27:44 PDT

(In reply to Alex Christensen from comment #4) > I could call it "EncodingTables.{h, cpp}" I think that’s better. > now that I think about it it > should pack it just as tight, especially in a release build. Think of it this way: if it wasn’t going to pack as tight, then the reinterpret_cast wouldn’t work.

Alex Christensen

Comment 6 2020-08-29 17:26:27 PDT

(In reply to Darin Adler from comment #3) > Looks like there are more tests to rebase. Ah, yes. Changing the NFC normalization changed the encoding of the code points that would have been normalized, which added more PASSes outside of euc-jp. I'll rebase and make them all PASS soon.

Alex Christensen

Comment 7 2020-08-29 17:46:52 PDT

Created attachment 407558 [details] patch

Alex Christensen

Comment 8 2020-08-29 18:49:21 PDT

Comment on attachment 407558 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=407558&action=review > Source/WebCore/platform/text/TextCodecICU.cpp:496 > + } > +} I added ASSERT_NOT_REACHED(); return entityUnencodableHandler; here to make MSVC happy. http://trac.webkit.org/r266330

Radar WebKit Bug Importer

Comment 9 2020-08-29 18:50:18 PDT

<rdar://problem/68011503>

Anne van Kesteren

Comment 10 2022-09-27 08:01:50 PDT

*** Bug 159891 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Alex Christensen

Reported

2020-08-28 21:32 PDT

Modified

2022-09-27 08:01 PDT History

CC List

4 users Show

URL

Keywords InRadar

Duplicates (1)

159891 View as bug list

Depends on

Blocks