216052 – Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification

RESOLVED FIXED 216052

Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification

https://bugs.webkit.org/show_bug.cgi?id=216052

Summary Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the speci...

Alex Christensen

Reported 2020-09-01 14:06:27 PDT

Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification

Attachments
Patch (478.89 KB, patch) 2020-09-01 14:08 PDT, Alex Christensen	no flags	Details Formatted Diff Diff
Patch (479.51 KB, patch) 2020-09-01 17:52 PDT, Alex Christensen	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Alex Christensen

Comment 1 2020-09-01 14:08:59 PDT

Created attachment 407712 [details] Patch

Darin Adler

Comment 2 2020-09-01 15:05:43 PDT

Comment on attachment 407712 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=407712&action=review > Source/WebCore/platform/text/TextCodecCJK.cpp:244 > + auto characters = string.upconvertedCharacters(); > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) { I didn’t notice this on previous passes. Why not just do this? for (auto codePoint : string.codePoints()) I don’t understand why it is better to use StringView::upconvertedCharacters and WTF::CodePointIterator<UChar> instead. > Source/WebCore/platform/text/TextCodecCJK.cpp:262 > + auto characters = string.upconvertedCharacters(); > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) { Ditto. > Source/WebCore/platform/text/TextCodecCJK.h:40 > enum class Encoding : uint8_t { > EUC_JP, > + ISO2022JP, > + Shift_JIS, > Big5 > }; I guess you don’t agree, but I find that "all on one line" nicer for simple enumerations like this one. > Source/WebCore/platform/text/TextCodecCJK.h:60 > + enum class ISO2022JPEncoderState : uint8_t { > + ASCII, > + Roman, > + jis0208 > + }; Ditto.

Alex Christensen

Comment 3 2020-09-01 15:20:02 PDT

Looks like I've got a few regressions to fix still. Will fix before landing. (In reply to Darin Adler from comment #2) > Comment on attachment 407712 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=407712&action=review > > > Source/WebCore/platform/text/TextCodecCJK.cpp:244 > > + auto characters = string.upconvertedCharacters(); > > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) { > > I didn’t notice this on previous passes. Why not just do this? > > for (auto codePoint : string.codePoints()) > > I don’t understand why it is better to use StringView::upconvertedCharacters > and WTF::CodePointIterator<UChar> instead. StringView::CodePoints::Iterator has to check whether the StringView is 8-bit or 16-bit every time it increments to the next code unit. It also stores a pointer to the StringView which has a pointer to the code units, so there is lots of dereferencing. We should probably optimize it, but here CodePointIterator is faster. StringView::upconvertedCharacters is probably free if you are using CJK encodings because the string almost certainly contains at least one non-Latin1 code point. > > Source/WebCore/platform/text/TextCodecCJK.h:40 > > enum class Encoding : uint8_t { > > EUC_JP, > > + ISO2022JP, > > + Shift_JIS, > > Big5 > > }; > > I guess you don’t agree, but I find that "all on one line" nicer for simple > enumerations like this one. This one is being incrementally added to. This allows the version control history to be nicer. > > Source/WebCore/platform/text/TextCodecCJK.h:60 > > + enum class ISO2022JPEncoderState : uint8_t { > > + ASCII, > > + Roman, > > + jis0208 > > + }; > > Ditto. This one should probably be all on one line because it is simple and finished.

Alex Christensen

Comment 4 2020-09-01 17:49:20 PDT

Comment on attachment 407712 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=407712&action=review > Source/WebCore/platform/text/TextCodecCJK.cpp:164 > + auto statefulUnencodableHandler = [&] (UChar32 codePoint, Vector<uint8_t>& result) { > + changeStateToASCIIIfNeeded(); Here I need to not switch the state to ASCII if the state is ISO2022JPEncoderState::Roman. I'll make that change to not regress anything. > Source/WebCore/platform/text/TextCodecCJK.cpp:169 > + if ((m_iso2022JPEncoderState == ISO2022JPEncoderState::ASCII || m_iso2022JPEncoderState == ISO2022JPEncoderState::Roman) && (codePoint == 0x000E || codePoint == 0x000F || codePoint == 0x001B)) { It looks like this line of the specification does not match what Chrome, Firefox, and Safari all do with these code points. I'll remove this statement to match behavior of browsers.

Alex Christensen

Comment 5 2020-09-01 17:52:38 PDT

Created attachment 407720 [details] Patch

EWS

Comment 6 2020-09-01 19:54:37 PDT

Committed r266448: <https://trac.webkit.org/changeset/266448> All reviewed patches have been landed. Closing bug and clearing flags on attachment 407720 [details].

Radar WebKit Bug Importer

Comment 7 2020-09-01 19:55:16 PDT

<rdar://problem/68182742>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Alex Christensen

Reported

2020-09-01 14:06 PDT

Modified

2020-09-01 19:55 PDT History

CC List

2 users Show

URL

Keywords InRadar

Depends on

Blocks