Bug 216168 - Align EUC-JP, ISO-2022-JP, and Shift_JIS decoding with Chrome, Firefox, and the specification
Summary: Align EUC-JP, ISO-2022-JP, and Shift_JIS decoding with Chrome, Firefox, and t...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords:
: 179881 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-09-03 22:26 PDT by Alex Christensen
Modified: 2020-09-07 10:52 PDT (History)
6 users (show)

See Also:


Attachments
Patch (221.76 KB, patch)
2020-09-03 22:29 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (223.25 KB, patch)
2020-09-04 10:03 PDT, Alex Christensen
ews-feeder: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2020-09-03 22:26:28 PDT
Align EUC-JP, ISO-2022-JP, and Shift_JIS decoding with Chrome, Firefox, and the specification
Comment 1 Alex Christensen 2020-09-03 22:29:24 PDT
Created attachment 407944 [details]
Patch
Comment 2 EWS Watchlist 2020-09-03 22:30:29 PDT
This patch modifies the imported WPT tests. Please ensure that any changes on the tests (not coming from a WPT import) are exported to WPT. Please see https://trac.webkit.org/wiki/WPTExportProcess
Comment 3 youenn fablet 2020-09-04 08:36:07 PDT
Comment on attachment 407944 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407944&action=review

> Source/WebCore/platform/text/TextCodecCJK.cpp:127
> +    return *table;

Why not using NeverDestroyed instead?

> Source/WebCore/platform/text/TextCodecCJK.cpp:130
> +String TextCodecCJK::decodeCommon(const uint8_t* bytes, size_t length, bool flush, bool stopOnError, bool& sawError, Function<void(uint8_t, StringBuilder&, bool&)>&& byteParser)

Could be a const Function&.
Also it is not really great to have a bool&.
How about returning a pair or a struct instead?

> Source/WebCore/platform/text/TextCodecCJK.cpp:132
> +    StringBuilder result;

Is there a way to reserveCapacity?  Hopefully error decoding will be rare.

> Source/WebCore/platform/text/TextCodecCJK.cpp:289
> +    auto byteParser = [&] (uint8_t byte, StringBuilder& result, bool& sawError) {

Would be more natural for byteParser to return a bool instead of changing sawError.

> Source/WebCore/platform/text/TextCodecCJK.cpp:422
> +    StringBuilder result;

Ditto for preallocating.

> Source/WebCore/platform/text/TextCodecCJK.cpp:467
> +        }

I guess these two code paths are done for optimisation.
I would use a template with a boolean template parameter to just write it once and let the compiler optimise this pattern.

> Source/WebCore/platform/text/TextCodecCJK.cpp:516
> +    parseCodePoint = [&] (UChar32 codePoint) {

One liner.

> Source/WebCore/platform/text/TextCodecCJK.h:59
> +    String big5Decode(const uint8_t*, size_t, bool, bool, bool&);

Now I see there is a pattern for bool& sawError.
Still not a fan of it.

> LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp_chars-csiso2022jp.html:35
> +<span data-cp="3B1" data-bytes="1B 24 42 26 41 1B 28 42">$B&A(B</span>

You are modifying files, it is best to land the WPT PR landing this patch.
Comment 4 Alex Christensen 2020-09-04 08:43:24 PDT
(In reply to youenn fablet from comment #3)
> > LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp_chars-csiso2022jp.html:35
> > +<span data-cp="3B1" data-bytes="1B 24 42 26 41 1B 28 42">$B&A(B</span>
> 
> You are modifying files, it is best to land the WPT PR landing this patch.

This updates our copy of those files to match what is currently in wpt.  I'll mention that in the ChangeLog.
Comment 5 Alex Christensen 2020-09-04 08:47:50 PDT
Comment on attachment 407944 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407944&action=review

>> Source/WebCore/platform/text/TextCodecCJK.cpp:127
>> +    return *table;
> 
> Why not using NeverDestroyed instead?

NeverDestroyed is required for things that require an exit time destructor.  This is just a raw pointer, which the compiler is ok with.

>> Source/WebCore/platform/text/TextCodecCJK.cpp:130
>> +String TextCodecCJK::decodeCommon(const uint8_t* bytes, size_t length, bool flush, bool stopOnError, bool& sawError, Function<void(uint8_t, StringBuilder&, bool&)>&& byteParser)
> 
> Could be a const Function&.
> Also it is not really great to have a bool&.
> How about returning a pair or a struct instead?

We should be trying to move away from using bool&.  I'll make my byte parsers return a 2-state enum.  To start that transition from the bottom up.

>> Source/WebCore/platform/text/TextCodecCJK.cpp:516
>> +    parseCodePoint = [&] (UChar32 codePoint) {
> 
> One liner.

This has to be on two lines because otherwise a lambda can't call itself.
Comment 6 Alex Christensen 2020-09-04 08:55:12 PDT
Comment on attachment 407944 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=407944&action=review

>> Source/WebCore/platform/text/TextCodecCJK.cpp:467
>> +        }
> 
> I guess these two code paths are done for optimisation.
> I would use a template with a boolean template parameter to just write it once and let the compiler optimise this pattern.

Returning a 2-state enum from the byte parser makes it so we already have a branch here, so there's no need for two separate code paths here.
Comment 7 Alex Christensen 2020-09-04 09:37:58 PDT
*** Bug 179881 has been marked as a duplicate of this bug. ***
Comment 8 Alex Christensen 2020-09-04 10:03:50 PDT
Created attachment 407987 [details]
Patch
Comment 9 EWS 2020-09-04 10:40:42 PDT
ChangeLog entry in LayoutTests/ChangeLog contains OOPS!.
Comment 10 Alex Christensen 2020-09-04 10:46:35 PDT
http://trac.webkit.org/r266620