Bug 36419

Summary:

A backslash in EUC-JP becomes to a yen sign when it is copied

Product:

WebKit

Reporter:

Shinichiro Hamaji <hamaji>

Component:

WebCore Misc.

Assignee:

Shinichiro Hamaji <hamaji>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, ap, darin, eric, jshin, mitz, webkit.review.bot, yuzo

Priority:

Version:

528+ (Nightly build)

Hardware:

All

OS:

All

Bug Depends on:

Bug Blocks:

24906

Attachments:

Description	Flags
Patch v1	none
Patch v2	none
Patch v3	none
Patch v4	none
Patch v5	darin: review+

Shinichiro Hamaji

Reported 2010-03-20 14:39:29 PDT

Discussion about this issue can be found in Bug 24906. This happens because WebKit is calling display(String|Buffer)ModifiedByEncoding even when the strings are not for displaying.

Attachments
Patch v1 (25.80 KB, patch) 2010-03-20 14:46 PDT, Shinichiro Hamaji	no flags	Details Formatted Diff Diff
Patch v2 (26.86 KB, patch) 2010-04-20 06:08 PDT, Shinichiro Hamaji	no flags	Details Formatted Diff Diff
Patch v3 (26.64 KB, patch) 2010-04-20 06:27 PDT, Shinichiro Hamaji	no flags	Details Formatted Diff Diff
Patch v4 (26.63 KB, patch) 2010-04-20 14:13 PDT, Shinichiro Hamaji	no flags	Details Formatted Diff Diff
Patch v5 (26.51 KB, patch) 2010-04-20 15:02 PDT, Shinichiro Hamaji	darin: review+	Details Formatted Diff Diff
Show Obsolete (4) View All Add attachment proposed patch, testcase, etc.

Shinichiro Hamaji

Comment 1 2010-03-20 14:46:25 PDT

Created attachment 51234 [details] Patch v1

Shinichiro Hamaji

Comment 2 2010-03-20 15:01:35 PDT

As I wrote in Bug 24906, I'd like to fix this issue first. This patch adds a parameter isForCopying into TextIterator and plainText uses this version. I examined how plainText is used and if I understand correctly, it is used by copying, spell checker, and transpose. I think our transcoding doesn't affect the spell checker. Copying and transpose are covered by the layout tests in my patch. Note that this patch fixes the issue of transpose as well as copying. Before this patch, transposing "\\" yields "<yen><yen>". I'll reply comments in Bug 24906 related to this issue. > I don't understand the meaning here. What is a "non-display string"? A display > string is a string intended for display to the end user. But "non-display" > doesn't tell me anything. I'm using textForCopying in this patch. It might not be the best name as this will be used by a few other ways such as spell checker and transpose. I hope this naming makes more sense because the main purpose of having this function is copying. > Does searching work correctly with this patch, so that a yen sign as visible on > a web page can always be found by searching for a yen sign? I think it should > work, but nothing beats actual testing. As I wrote in Bug 24906, my patch in Bug 24906 broke searching, but this patch won't. Thanks again for this catch. > Do these tests pass in IE? Yes, actually they are doing nothing, but the backslash in their font is the yen sign. > 2. In the text-transform + transcoding case, you just fall back on using the (transcoded) text() as the nonDisplayString(). Perhaps in this case you nonDisplayString() could apply the transform on the fly to the original string (so it will return a transformed, yet not transcoded, string). Nice suggestion! Now my patch is doing on the fly transformation.

Shinichiro Hamaji

Comment 3 2010-04-07 06:03:04 PDT

Ping?

Alexey Proskuryakov

Comment 4 2010-04-07 10:33:51 PDT

Comment on attachment 51234 [details] Patch v1 This looks good enough for me to say r+. But it would be even better if Dan and/or Darin could take a look, too. Some comments below. > +(Note that this test makes sense only when you are using DumpRenderTree and dumpAsText. You will see yen signs when you are using Safari because of transcoding) It seems that this test can be made to work in Safari with String.charCodeAt(). > + for (TextIterator it(r, false, false, !isDisplayString); !it.atEnd(); it.advance()) { This is really hard to read. It's a common problem in WebCore, but we are trying to improve argument readability (usually by adding new overloaded methods, or using named enums for argument values). > + bool m_isForCopying; As you said, this name can be confusing. I don't have a better suggestion, but there should be at least a comment. > +void RenderText::updateNeedsTranscode() > +{ > + const TextEncoding* encoding = document()->decoder() ? &document()->decoder()->encoding() : 0; > + m_needsTranscode = (encoding && encoding->backslashAsCurrencySymbol() != '\\'); > +} TextResourceDecoder always has an encoding, so you could save a null check by writing this in a different way. Thanks for the good test coverage!

Eric Seidel (no email)

Comment 5 2010-04-18 16:29:30 PDT

Attachment 51234 [details] was posted by a committer and has review+, assigning to Shinichiro Hamaji for commit.

Shinichiro Hamaji

Comment 6 2010-04-20 06:08:56 PDT

Created attachment 53805 [details] Patch v2

Shinichiro Hamaji

Comment 7 2010-04-20 06:27:25 PDT

Created attachment 53807 [details] Patch v3

Shinichiro Hamaji

Comment 8 2010-04-20 06:28:45 PDT

Thanks for your review! > > +(Note that this test makes sense only when you are using DumpRenderTree and dumpAsText. You will see yen signs when you are using Safari because of transcoding) > > It seems that this test can be made to work in Safari with String.charCodeAt(). True. I modified the test. > This is really hard to read. It's a common problem in WebCore, but we are > trying to improve argument readability (usually by adding new overloaded > methods, or using named enums for argument values). Added an enum TextIteratorBehavior. I didn't remove the older TextIterator because 1. I'm not sure this change improves the readability a lot and 2. the change becomes bigger. If someone thinks we should remove the boolean version of TextIterator, I'll do this, but in a separated change. > > + bool m_isForCopying; > > As you said, this name can be confusing. I don't have a better suggestion, but > there should be at least a comment. Added a comment. Also, I guess m_emitsTextsWithoutTranscoding is easier to understand. I changed the name of this member. > TextResourceDecoder always has an encoding, so you could save a null check by > writing this in a different way. Done. I've updated copy-backslash-with-euc.html between Patch v2 and v3 because I uploaded the wrong version of this in v2.

Darin Adler

Comment 9 2010-04-20 10:17:27 PDT

(In reply to comment #8) > Added an enum TextIteratorBehavior. I didn't remove the older TextIterator > because 1. I'm not sure this change improves the readability a lot and 2. the > change becomes bigger. If someone thinks we should remove the boolean version > of TextIterator, I'll do this, but in a separated change. I do think it does improve readability at call sites like the one in visible_units.cpp and I would like to do that in a separate change. The functions in TextIterator.h that use booleans in a confusing way are: - plainTextToMallocAllocatedBuffer - findPlainText - TextIterator constructor - TextIterator::rangeLength - TextIterator::rangeFromLocationAndLength - CharacterIterator constructor

Alexey Proskuryakov

Comment 10 2010-04-20 10:18:52 PDT

Comment on attachment 53807 [details] Patch v3 +<script charset="UTF-8"> I don't think this charset attribute has any effect, other than adding confusion. Script charset attribute is only used for external scripts. > If someone thinks we should remove the boolean version > of TextIterator, I'll do this, but in a separated change. That sounds like a good change to make. r=me

Darin Adler

Comment 11 2010-04-20 12:12:47 PDT

Comment on attachment 53807 [details] Patch v3 > + , m_emitsTextsWithoutTranscoding(false) Should be m_emitsTextWithoutTranscoding. The word "texts" is not appropriate here, because it doesn't mean "more than one piece of text" in normal English usage. That plural normally only used with the word "text" in the sense of "an entire book". And the name works fine without the plural. The word "text" can mean "the text from inside multiple text nodes". > - String str = renderer->text(); > + String str; > + if (m_emitsTextsWithoutTranscoding) > + str = renderer->textWithoutTranscoding(); > + else > + str = renderer->text(); I think this would be more efficient (no empty construction before assignment) as well as slightly clearer with a trinary operator instead of an if statement. Or it might end up making the line too long and so be less clear. Not sure. > +enum TextIteratorBehavior { > + TextIteratorBehaviorDefault = 0, > + TextIteratorBehaviorEmitCharactersBetweenAllVisiblePositions = 1 << 0, > + TextIteratorBehaviorEnterTextControls = 1 << 1, > + TextIteratorBehaviorEmitsTextsWithoutTranscoding = 1 << 2, > +}; A problem with using an enum for a bit mask is that if you use the "|" operator then the result is an integer, not a TextIteratorBehavior. I could not find a good example of a better solution elsewhere in WebCore, so I suppose this is OK. > + explicit TextIterator(const Range*, TextIteratorBehavior); The explicit keyword has no meaning on a constructor with more than one argument so you should omit it. > + // FIXME: check fonts when we implement font-based transcoding. This comment is about future plans, but I'm not sure it's helpful. For example, it says "FIXME" but there is nothing to fix unless we add font-based transcoding. > +void RenderText::transformText(String& text) const > +{ > + ASSERT(style()); > + switch (style()->textTransform()) { > + case TTNONE: > + break; > + case CAPITALIZE: > + makeCapitalized(&text, previousCharacter()); > + break; > + case UPPERCASE: > + text.makeUpper(); > + break; > + case LOWERCASE: > + text.makeLower(); > + break; > + } > +} I think this function should be moved up higher in the file and marked inline since there are only two callers, both inside this source file. My guess is that would make the code smaller and reduce overhead in the common case of no transform. Another possibility would be to make it illegal to call this function when the transform is TTNONE and add another check of the transform. One call site has it already.

Shinichiro Hamaji

Comment 12 2010-04-20 14:13:06 PDT

Created attachment 53878 [details] Patch v4

Shinichiro Hamaji

Comment 13 2010-04-20 14:20:12 PDT

Thanks for your reviews! I think I addressed all comments. I'll create another patch for the readability of TextIterator's callers. > +<script charset="UTF-8"> > > I don't think this charset attribute has any effect, other than adding > confusion. Script charset attribute is only used for external scripts. Thanks for this comment. I didn't know this. I wanted to get 92 by "\\".charCodeAt(0) even if the backslash in <script> tag is transcoded into a currency sign due to future changes. I modified the code so we just use 92.

Darin Adler

Comment 14 2010-04-20 14:58:05 PDT

Comment on attachment 53878 [details] Patch v4 > - explicit TextIterator(const Range*, bool emitCharactersBetweenAllVisiblePositions = false, bool enterTextControls = false); > - > + TextIterator(const Range*, bool emitCharactersBetweenAllVisiblePositions = false, bool enterTextControls = false); My comment about "explicit" does not apply to this one. This does need "explicit" because all the arguments except the first are optional. Otherwise looks good to me.

Shinichiro Hamaji

Comment 15 2010-04-20 15:02:23 PDT

Created attachment 53885 [details] Patch v5

Shinichiro Hamaji

Comment 16 2010-04-20 15:04:40 PDT

> My comment about "explicit" does not apply to this one. This does need > "explicit" because all the arguments except the first are optional. Oops! I might be sleeping when I changed this... Thanks for your comment!

WebKit Review Bot

Comment 17 2010-04-20 20:54:00 PDT

Yuzo Fujishima

Comment 18 2010-04-20 22:25:33 PDT

Committed r57952: <http://trac.webkit.org/changeset/57952>

Note You need to log in before you can comment on or make changes to this bug.