RESOLVED FIXED 216052
Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification
https://bugs.webkit.org/show_bug.cgi?id=216052
Summary Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the speci...
Alex Christensen
Reported 2020-09-01 14:06:27 PDT
Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification
Attachments
Patch (478.89 KB, patch)
2020-09-01 14:08 PDT, Alex Christensen
no flags
Patch (479.51 KB, patch)
2020-09-01 17:52 PDT, Alex Christensen
no flags
Alex Christensen
Comment 1 2020-09-01 14:08:59 PDT
Darin Adler
Comment 2 2020-09-01 15:05:43 PDT
Comment on attachment 407712 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=407712&action=review > Source/WebCore/platform/text/TextCodecCJK.cpp:244 > + auto characters = string.upconvertedCharacters(); > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) { I didn’t notice this on previous passes. Why not just do this? for (auto codePoint : string.codePoints()) I don’t understand why it is better to use StringView::upconvertedCharacters and WTF::CodePointIterator<UChar> instead. > Source/WebCore/platform/text/TextCodecCJK.cpp:262 > + auto characters = string.upconvertedCharacters(); > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) { Ditto. > Source/WebCore/platform/text/TextCodecCJK.h:40 > enum class Encoding : uint8_t { > EUC_JP, > + ISO2022JP, > + Shift_JIS, > Big5 > }; I guess you don’t agree, but I find that "all on one line" nicer for simple enumerations like this one. > Source/WebCore/platform/text/TextCodecCJK.h:60 > + enum class ISO2022JPEncoderState : uint8_t { > + ASCII, > + Roman, > + jis0208 > + }; Ditto.
Alex Christensen
Comment 3 2020-09-01 15:20:02 PDT
Looks like I've got a few regressions to fix still. Will fix before landing. (In reply to Darin Adler from comment #2) > Comment on attachment 407712 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=407712&action=review > > > Source/WebCore/platform/text/TextCodecCJK.cpp:244 > > + auto characters = string.upconvertedCharacters(); > > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) { > > I didn’t notice this on previous passes. Why not just do this? > > for (auto codePoint : string.codePoints()) > > I don’t understand why it is better to use StringView::upconvertedCharacters > and WTF::CodePointIterator<UChar> instead. StringView::CodePoints::Iterator has to check whether the StringView is 8-bit or 16-bit every time it increments to the next code unit. It also stores a pointer to the StringView which has a pointer to the code units, so there is lots of dereferencing. We should probably optimize it, but here CodePointIterator is faster. StringView::upconvertedCharacters is probably free if you are using CJK encodings because the string almost certainly contains at least one non-Latin1 code point. > > Source/WebCore/platform/text/TextCodecCJK.h:40 > > enum class Encoding : uint8_t { > > EUC_JP, > > + ISO2022JP, > > + Shift_JIS, > > Big5 > > }; > > I guess you don’t agree, but I find that "all on one line" nicer for simple > enumerations like this one. This one is being incrementally added to. This allows the version control history to be nicer. > > Source/WebCore/platform/text/TextCodecCJK.h:60 > > + enum class ISO2022JPEncoderState : uint8_t { > > + ASCII, > > + Roman, > > + jis0208 > > + }; > > Ditto. This one should probably be all on one line because it is simple and finished.
Alex Christensen
Comment 4 2020-09-01 17:49:20 PDT
Comment on attachment 407712 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=407712&action=review > Source/WebCore/platform/text/TextCodecCJK.cpp:164 > + auto statefulUnencodableHandler = [&] (UChar32 codePoint, Vector<uint8_t>& result) { > + changeStateToASCIIIfNeeded(); Here I need to not switch the state to ASCII if the state is ISO2022JPEncoderState::Roman. I'll make that change to not regress anything. > Source/WebCore/platform/text/TextCodecCJK.cpp:169 > + if ((m_iso2022JPEncoderState == ISO2022JPEncoderState::ASCII || m_iso2022JPEncoderState == ISO2022JPEncoderState::Roman) && (codePoint == 0x000E || codePoint == 0x000F || codePoint == 0x001B)) { It looks like this line of the specification does not match what Chrome, Firefox, and Safari all do with these code points. I'll remove this statement to match behavior of browsers.
Alex Christensen
Comment 5 2020-09-01 17:52:38 PDT
EWS
Comment 6 2020-09-01 19:54:37 PDT
Committed r266448: <https://trac.webkit.org/changeset/266448> All reviewed patches have been landed. Closing bug and clearing flags on attachment 407720 [details].
Radar WebKit Bug Importer
Comment 7 2020-09-01 19:55:16 PDT
Note You need to log in before you can comment on or make changes to this bug.