WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
216052
Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification
https://bugs.webkit.org/show_bug.cgi?id=216052
Summary
Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the speci...
Alex Christensen
Reported
2020-09-01 14:06:27 PDT
Align ISO-2022-JP and Shift_JIS encodings with Chrome, Firefox, and the specification
Attachments
Patch
(478.89 KB, patch)
2020-09-01 14:08 PDT
,
Alex Christensen
no flags
Details
Formatted Diff
Diff
Patch
(479.51 KB, patch)
2020-09-01 17:52 PDT
,
Alex Christensen
no flags
Details
Formatted Diff
Diff
Show Obsolete
(1)
View All
Add attachment
proposed patch, testcase, etc.
Alex Christensen
Comment 1
2020-09-01 14:08:59 PDT
Created
attachment 407712
[details]
Patch
Darin Adler
Comment 2
2020-09-01 15:05:43 PDT
Comment on
attachment 407712
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=407712&action=review
> Source/WebCore/platform/text/TextCodecCJK.cpp:244 > + auto characters = string.upconvertedCharacters(); > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) {
I didn’t notice this on previous passes. Why not just do this? for (auto codePoint : string.codePoints()) I don’t understand why it is better to use StringView::upconvertedCharacters and WTF::CodePointIterator<UChar> instead.
> Source/WebCore/platform/text/TextCodecCJK.cpp:262 > + auto characters = string.upconvertedCharacters(); > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) {
Ditto.
> Source/WebCore/platform/text/TextCodecCJK.h:40 > enum class Encoding : uint8_t { > EUC_JP, > + ISO2022JP, > + Shift_JIS, > Big5 > };
I guess you don’t agree, but I find that "all on one line" nicer for simple enumerations like this one.
> Source/WebCore/platform/text/TextCodecCJK.h:60 > + enum class ISO2022JPEncoderState : uint8_t { > + ASCII, > + Roman, > + jis0208 > + };
Ditto.
Alex Christensen
Comment 3
2020-09-01 15:20:02 PDT
Looks like I've got a few regressions to fix still. Will fix before landing. (In reply to Darin Adler from
comment #2
)
> Comment on
attachment 407712
[details]
> Patch > > View in context: >
https://bugs.webkit.org/attachment.cgi?id=407712&action=review
> > > Source/WebCore/platform/text/TextCodecCJK.cpp:244 > > + auto characters = string.upconvertedCharacters(); > > + for (WTF::CodePointIterator<UChar> iterator(characters.get(), characters.get() + string.length()); !iterator.atEnd(); ++iterator) { > > I didn’t notice this on previous passes. Why not just do this? > > for (auto codePoint : string.codePoints()) > > I don’t understand why it is better to use StringView::upconvertedCharacters > and WTF::CodePointIterator<UChar> instead.
StringView::CodePoints::Iterator has to check whether the StringView is 8-bit or 16-bit every time it increments to the next code unit. It also stores a pointer to the StringView which has a pointer to the code units, so there is lots of dereferencing. We should probably optimize it, but here CodePointIterator is faster. StringView::upconvertedCharacters is probably free if you are using CJK encodings because the string almost certainly contains at least one non-Latin1 code point.
> > Source/WebCore/platform/text/TextCodecCJK.h:40 > > enum class Encoding : uint8_t { > > EUC_JP, > > + ISO2022JP, > > + Shift_JIS, > > Big5 > > }; > > I guess you don’t agree, but I find that "all on one line" nicer for simple > enumerations like this one.
This one is being incrementally added to. This allows the version control history to be nicer.
> > Source/WebCore/platform/text/TextCodecCJK.h:60 > > + enum class ISO2022JPEncoderState : uint8_t { > > + ASCII, > > + Roman, > > + jis0208 > > + }; > > Ditto.
This one should probably be all on one line because it is simple and finished.
Alex Christensen
Comment 4
2020-09-01 17:49:20 PDT
Comment on
attachment 407712
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=407712&action=review
> Source/WebCore/platform/text/TextCodecCJK.cpp:164 > + auto statefulUnencodableHandler = [&] (UChar32 codePoint, Vector<uint8_t>& result) { > + changeStateToASCIIIfNeeded();
Here I need to not switch the state to ASCII if the state is ISO2022JPEncoderState::Roman. I'll make that change to not regress anything.
> Source/WebCore/platform/text/TextCodecCJK.cpp:169 > + if ((m_iso2022JPEncoderState == ISO2022JPEncoderState::ASCII || m_iso2022JPEncoderState == ISO2022JPEncoderState::Roman) && (codePoint == 0x000E || codePoint == 0x000F || codePoint == 0x001B)) {
It looks like this line of the specification does not match what Chrome, Firefox, and Safari all do with these code points. I'll remove this statement to match behavior of browsers.
Alex Christensen
Comment 5
2020-09-01 17:52:38 PDT
Created
attachment 407720
[details]
Patch
EWS
Comment 6
2020-09-01 19:54:37 PDT
Committed
r266448
: <
https://trac.webkit.org/changeset/266448
> All reviewed patches have been landed. Closing bug and clearing flags on
attachment 407720
[details]
.
Radar WebKit Bug Importer
Comment 7
2020-09-01 19:55:16 PDT
<
rdar://problem/68182742
>
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug