Bug 215970 - Remove NFC normalization when submitting forms and encoding URL queries and fix EUC-JP encoding
Summary: Remove NFC normalization when submitting forms and encoding URL queries and f...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords: InRadar
: 159891 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-08-28 21:32 PDT by Alex Christensen
Modified: 2022-09-27 08:01 PDT (History)
4 users (show)

See Also:


Attachments
Patch (2.34 MB, patch)
2020-08-28 21:54 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (2.35 MB, patch)
2020-08-28 22:50 PDT, Alex Christensen
darin: review+
Details | Formatted Diff | Diff
patch (3.15 MB, patch)
2020-08-29 17:46 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2020-08-28 21:32:01 PDT
Remove NFC normalization when submitting forms and encoding URL queries and fix EUC-JP encoding
Comment 1 Alex Christensen 2020-08-28 21:54:59 PDT
Created attachment 407524 [details]
Patch
Comment 2 Alex Christensen 2020-08-28 22:50:34 PDT
Created attachment 407530 [details]
Patch
Comment 3 Darin Adler 2020-08-29 12:15:44 PDT
Comment on attachment 407530 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407530&action=review

Looks like there are more tests to rebase.

> Source/WebCore/platform/text/ICUExtras.h:32
> +extern const UChar jis0208[15448];

It doesn’t make sense to me to call this "ICU" extras. This is our own encoder/decoder for a particular encoding, and not part of ICU. I suggest a different name for this file.

> Source/WebCore/platform/text/TextCodecICU.cpp:436
> +        auto pointerRange = std::equal_range(reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208), reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208 + WTF_ARRAY_LENGTH(jis0208)), std::pair<UChar, UChar>(c, 0), [](const auto& a, const auto& b) {

Why are we casting an array of UChar into an array of pairs? Why not actually compile an array of pairs? That should work just as well, just with more braces in the source file, and without the reinterpret_cast.

> Source/WebCore/platform/text/TextCodecICU.cpp:454
> +    CString decimal = makeString(static_cast<unsigned>(c)).utf8();
> +    for (size_t i = 0; i < decimal.length(); i++)
> +        result.uncheckedAppend(decimal.data()[i]);

We have functions that will do this more efficiently, without allocating memory:

    uint8_t buffer[5];
    writeIntegerToBuffer(character, buffer);
    result.append(buffer, lengthOfIntegerAsString(character));

Or some variation on this.

> Source/WebCore/platform/text/TextCodecICU.cpp:504
> +    if (!strcmp(m_canonicalConverterName, "euc-jp-2007"))
> +        return eucJPEncode(string, unencodableHandler(handling));

The way our codec classes are designed, you would normally create and register a separate class derived TextCodec rather than adding a non-ICU implementation to the ICU one. Did you consider that? Why did you reject it?

> Source/WebCore/platform/text/TextEncoding.cpp:72
> +Vector<uint8_t> TextEncoding::encode(StringView string, UnencodableHandling handling, NFCNormalize nfcNormalize) const

I suggest naming the local variable just "normalize".

> Source/WebCore/platform/text/TextEncoding.cpp:82
> +        auto normalizedString = normalizedNFC(string);
> +        return newTextCodec(*this)->encode(normalizedString.view, handling);

I suggest making this just a one-liner.
Comment 4 Alex Christensen 2020-08-29 15:49:29 PDT
(In reply to Darin Adler from comment #3)
> > Source/WebCore/platform/text/ICUExtras.h:32
> > +extern const UChar jis0208[15448];
> 
> It doesn’t make sense to me to call this "ICU" extras. This is our own
> encoder/decoder for a particular encoding, and not part of ICU. I suggest a
> different name for this file.
I called it ICU extras because I think these should be in ICU, and the ICU that chromium uses has this in it.  I could call it "EncodingTables.{h, cpp}"

> > Source/WebCore/platform/text/TextCodecICU.cpp:436
> > +        auto pointerRange = std::equal_range(reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208), reinterpret_cast<const std::pair<UChar, UChar>*>(jis0208 + WTF_ARRAY_LENGTH(jis0208)), std::pair<UChar, UChar>(c, 0), [](const auto& a, const auto& b) {
> 
> Why are we casting an array of UChar into an array of pairs? Why not
> actually compile an array of pairs? That should work just as well, just with
> more braces in the source file, and without the reinterpret_cast.
I was thinking that a const array of UChar would definitely pack as tight as possible in the binary to only increase by 30k.  I wasn't so sure about a const array of std::pair<UChar, UChar>, but now that I think about it it should pack it just as tight, especially in a release build.
Comment 5 Darin Adler 2020-08-29 16:27:44 PDT
(In reply to Alex Christensen from comment #4)
> I could call it "EncodingTables.{h, cpp}"

I think that’s better.

> now that I think about it it
> should pack it just as tight, especially in a release build.

Think of it this way: if it wasn’t going to pack as tight, then the reinterpret_cast wouldn’t work.
Comment 6 Alex Christensen 2020-08-29 17:26:27 PDT
(In reply to Darin Adler from comment #3)
> Looks like there are more tests to rebase.
Ah, yes.  Changing the NFC normalization changed the encoding of the code points that would have been normalized, which added more PASSes outside of euc-jp.  I'll rebase and make them all PASS soon.
Comment 7 Alex Christensen 2020-08-29 17:46:52 PDT
Created attachment 407558 [details]
patch
Comment 8 Alex Christensen 2020-08-29 18:49:21 PDT
Comment on attachment 407558 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407558&action=review

> Source/WebCore/platform/text/TextCodecICU.cpp:496
> +    }
> +}

I added ASSERT_NOT_REACHED(); return entityUnencodableHandler; here to make MSVC happy.

http://trac.webkit.org/r266330
Comment 9 Radar WebKit Bug Importer 2020-08-29 18:50:18 PDT
<rdar://problem/68011503>
Comment 10 Anne van Kesteren 2022-09-27 08:01:50 PDT
*** Bug 159891 has been marked as a duplicate of this bug. ***