Bug 216081 - Align EUC-KR encoding with Chrome, Firefox, and specification
Summary: Align EUC-KR encoding with Chrome, Firefox, and specification
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-09-02 10:30 PDT by Alex Christensen
Modified: 2020-09-03 09:07 PDT (History)
4 users (show)

See Also:


Attachments
Patch (429.79 KB, patch)
2020-09-02 10:32 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (431.56 KB, patch)
2020-09-02 14:10 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (431.11 KB, patch)
2020-09-02 14:26 PDT, Alex Christensen
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2020-09-02 10:30:35 PDT
Align EUC-KR encoding with Chrome, Firefox, and specification
Comment 1 Alex Christensen 2020-09-02 10:32:23 PDT
Created attachment 407776 [details]
Patch
Comment 2 Alex Christensen 2020-09-02 14:10:44 PDT
Created attachment 407807 [details]
Patch
Comment 3 Alex Christensen 2020-09-02 14:26:34 PDT
Created attachment 407815 [details]
Patch
Comment 4 Alexey Proskuryakov 2020-09-02 18:28:45 PDT
Comment on attachment 407815 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407815&action=review

> Source/WebCore/ChangeLog:3
> +        Align EUC-KR encoding with Chrome, Firefox, and specification

Will we be using ICU codecs at all once you are done? :-O
Comment 5 Alex Christensen 2020-09-02 18:59:10 PDT
Yes.  Most of them are compliant with modern standards, and https://bugs.webkit.org/show_bug.cgi?id=216094 shows evidence that they are making a nonzero amount of changes to comply with modern standards (windows-1255 changed to be "correct" on Big Sur and iOS 14), but I'm not going to just wait for them to move at the speed of browsers.
Comment 6 Darin Adler 2020-09-02 21:38:29 PDT
I think we might want to create an automated way to try ICU in the future so we it’s easy enough to know when to turn these back off and rely on ICU again. Worth thinking about making this very easy to notice and fix.
Comment 7 Darin Adler 2020-09-02 21:40:57 PDT
Comment on attachment 407815 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407815&action=review

> Source/WebCore/platform/text/TextCodecCJK.cpp:340
> +    static auto* table = [] {
> +        auto table = new EUCKREncodingIndex();
> +        for (size_t i = 0; i < WTF_ARRAY_LENGTH(eucKRDecodingIndex); i++)
> +            (*table)[i] = { eucKRDecodingIndex[i].second, eucKRDecodingIndex[i].first };
> +        std::sort(table->begin(), table->end(), [] (auto& a, auto& b) {
> +            return a.first < b.first;
> +        });
> +        return table;
> +    }();
> +    return *table;

Since we allocate these tables on the heap, can we figure out a way to make these smaller, some kind of simple compression scheme? Not about this particular one, but all the giant tables.

> Source/WebCore/platform/text/TextCodecCJK.cpp:372
> +        result.append(pointer / 190 + 0x81);
> +        result.append(pointer % 190 + 0x41);

Where do these magic numbers come from?

> Source/WebCore/platform/text/TextCodecCJK.cpp:385
> +                uint16_t pointer = (lead - 0x81) * 190 + byte - 0x41;

Where do these magic numbers come from?
Comment 8 Alex Christensen 2020-09-02 22:34:22 PDT
Comment on attachment 407815 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407815&action=review

>> Source/WebCore/platform/text/TextCodecCJK.cpp:340
>> +    return *table;
> 
> Since we allocate these tables on the heap, can we figure out a way to make these smaller, some kind of simple compression scheme? Not about this particular one, but all the giant tables.

If we're using them then we want them in memory somewhere in this form.  I think we should compress them then decompress them if needed to reduce the size of the WebCore dylib

>> Source/WebCore/platform/text/TextCodecCJK.cpp:372
>> +        result.append(pointer % 190 + 0x41);
> 
> Where do these magic numbers come from?

https://encoding.spec.whatwg.org/#euc-kr-encoder

>> Source/WebCore/platform/text/TextCodecCJK.cpp:385
>> +                uint16_t pointer = (lead - 0x81) * 190 + byte - 0x41;
> 
> Where do these magic numbers come from?

https://encoding.spec.whatwg.org/#euc-kr-decoder
Comment 9 Darin Adler 2020-09-03 00:10:50 PDT
Comment on attachment 407815 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407815&action=review

>>> Source/WebCore/platform/text/TextCodecCJK.cpp:340
>>> +    return *table;
>> 
>> Since we allocate these tables on the heap, can we figure out a way to make these smaller, some kind of simple compression scheme? Not about this particular one, but all the giant tables.
> 
> If we're using them then we want them in memory somewhere in this form.  I think we should compress them then decompress them if needed to reduce the size of the WebCore dylib

Yes, that’s what was talking about. The tables in the binary, the ones we use to as input to a function to allocate the tables we actually use. I was suggesting that we compress those.
Comment 10 Alex Christensen 2020-09-03 09:05:13 PDT
fast/block/margin-collapse/103.html did not change with this patch, but some of the iOS failures (also not from this patch) are things I need to rebase.
Comment 11 Alex Christensen 2020-09-03 09:06:35 PDT
http://trac.webkit.org/r266520
Comment 12 Radar WebKit Bug Importer 2020-09-03 09:07:18 PDT
<rdar://problem/68282300>