Bug 36419

Summary: A backslash in EUC-JP becomes to a yen sign when it is copied
Product: WebKit Reporter: Shinichiro Hamaji <hamaji>
Component: WebCore Misc.Assignee: Shinichiro Hamaji <hamaji>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, darin, eric, jshin, mitz, webkit.review.bot, yuzo
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 24906    
Attachments:
Description Flags
Patch v1
none
Patch v2
none
Patch v3
none
Patch v4
none
Patch v5 darin: review+

Description Shinichiro Hamaji 2010-03-20 14:39:29 PDT
Discussion about this issue can be found in Bug 24906. This happens because WebKit is calling display(String|Buffer)ModifiedByEncoding even when the strings are not for displaying.
Comment 1 Shinichiro Hamaji 2010-03-20 14:46:25 PDT
Created attachment 51234 [details]
Patch v1
Comment 2 Shinichiro Hamaji 2010-03-20 15:01:35 PDT
As I wrote in Bug 24906, I'd like to fix this issue first.

This patch adds a parameter isForCopying into TextIterator and plainText uses this version. I examined how plainText is used and if I understand correctly, it is used by copying, spell checker, and transpose. I think our transcoding doesn't affect the spell checker. Copying and transpose are covered by the layout tests in my patch. Note that this patch fixes the issue of transpose as well as copying. Before this patch, transposing "\\" yields "<yen><yen>".

I'll reply comments in Bug 24906 related to this issue.

> I don't understand the meaning here. What is a "non-display string"? A display
> string is a string intended for display to the end user. But "non-display"
> doesn't tell me anything.

I'm using textForCopying in this patch. It might not be the best name as this will be used by a few other ways such as spell checker and transpose. I hope this naming makes more sense because the main purpose of having this function is copying.

> Does searching work correctly with this patch, so that a yen sign as visible on
> a web page can always be found by searching for a yen sign? I think it should
> work, but nothing beats actual testing.

As I wrote in Bug 24906, my patch in Bug 24906 broke searching, but this patch won't. Thanks again for this catch.

> Do these tests pass in IE?

Yes, actually they are doing nothing, but the backslash in their font is the yen sign.

> 2. In the text-transform + transcoding case, you just fall back on using the (transcoded) text() as the nonDisplayString(). Perhaps in this case you nonDisplayString() could apply the transform on the fly to the original string (so it will return a transformed, yet not transcoded, string).

Nice suggestion! Now my patch is doing on the fly transformation.
Comment 3 Shinichiro Hamaji 2010-04-07 06:03:04 PDT
Ping?
Comment 4 Alexey Proskuryakov 2010-04-07 10:33:51 PDT
Comment on attachment 51234 [details]
Patch v1

This looks good enough for me to say r+. But it would be even better if Dan and/or Darin could take a look, too. Some comments below.

> +(Note that this test makes sense only when you are using DumpRenderTree and dumpAsText. You will see yen signs when you are using Safari because of transcoding)

It seems that this test can be made to work in Safari with String.charCodeAt().

> +    for (TextIterator it(r, false, false, !isDisplayString); !it.atEnd(); it.advance()) {

This is really hard to read. It's a common problem in WebCore, but we are trying to improve argument readability (usually by adding new overloaded methods, or using named enums for argument values).

> +    bool m_isForCopying;

As you said, this name can be  confusing. I don't have a better suggestion, but there should be at least a comment.

> +void RenderText::updateNeedsTranscode()
> +{
> +    const TextEncoding* encoding = document()->decoder() ? &document()->decoder()->encoding() : 0;
> +    m_needsTranscode = (encoding && encoding->backslashAsCurrencySymbol() != '\\');
> +}

TextResourceDecoder always has an encoding, so you could save a null check by writing this in a different way.

Thanks for the good test coverage!
Comment 5 Eric Seidel (no email) 2010-04-18 16:29:30 PDT
Attachment 51234 [details] was posted by a committer and has review+, assigning to Shinichiro Hamaji for commit.
Comment 6 Shinichiro Hamaji 2010-04-20 06:08:56 PDT
Created attachment 53805 [details]
Patch v2
Comment 7 Shinichiro Hamaji 2010-04-20 06:27:25 PDT
Created attachment 53807 [details]
Patch v3
Comment 8 Shinichiro Hamaji 2010-04-20 06:28:45 PDT
Thanks for your review!

> > +(Note that this test makes sense only when you are using DumpRenderTree and dumpAsText. You will see yen signs when you are using Safari because of transcoding)
> 
> It seems that this test can be made to work in Safari with String.charCodeAt().

True. I modified the test.

> This is really hard to read. It's a common problem in WebCore, but we are
> trying to improve argument readability (usually by adding new overloaded
> methods, or using named enums for argument values).

Added an enum TextIteratorBehavior. I didn't remove the older TextIterator because 1. I'm not sure this change improves the readability a lot and 2. the change becomes bigger. If someone thinks we should remove the boolean version of TextIterator, I'll do this, but in a separated change.

> > +    bool m_isForCopying;
> 
> As you said, this name can be  confusing. I don't have a better suggestion, but
> there should be at least a comment.

Added a comment. Also, I guess m_emitsTextsWithoutTranscoding is easier to understand. I changed the name of this member.

> TextResourceDecoder always has an encoding, so you could save a null check by
> writing this in a different way.

Done.

I've updated copy-backslash-with-euc.html between Patch v2 and v3 because I uploaded the wrong version of this in v2.
Comment 9 Darin Adler 2010-04-20 10:17:27 PDT
(In reply to comment #8)
> Added an enum TextIteratorBehavior. I didn't remove the older TextIterator
> because 1. I'm not sure this change improves the readability a lot and 2. the
> change becomes bigger. If someone thinks we should remove the boolean version
> of TextIterator, I'll do this, but in a separated change.

I do think it does improve readability at call sites like the one in visible_units.cpp and I would like to do that in a separate change.

The functions in TextIterator.h that use booleans in a confusing way are:

    - plainTextToMallocAllocatedBuffer
    - findPlainText
    - TextIterator constructor
    - TextIterator::rangeLength
    - TextIterator::rangeFromLocationAndLength
    - CharacterIterator constructor
Comment 10 Alexey Proskuryakov 2010-04-20 10:18:52 PDT
Comment on attachment 53807 [details]
Patch v3

+<script charset="UTF-8">

I don't think this charset attribute has any effect, other than adding confusion. Script charset attribute is only used for external scripts.

> If someone thinks we should remove the boolean version
> of TextIterator, I'll do this, but in a separated change.

That sounds like a good change to make.

r=me
Comment 11 Darin Adler 2010-04-20 12:12:47 PDT
Comment on attachment 53807 [details]
Patch v3

> +    , m_emitsTextsWithoutTranscoding(false)

Should be m_emitsTextWithoutTranscoding. The word "texts" is not appropriate here, because it doesn't mean "more than one piece of text" in normal English usage. That plural normally only used with the word "text" in the sense of "an entire book". And the name works fine without the plural. The word "text" can mean "the text from inside multiple text nodes".

> -    String str = renderer->text();
> +    String str;
> +    if (m_emitsTextsWithoutTranscoding)
> +        str = renderer->textWithoutTranscoding();
> +    else
> +        str = renderer->text();

I think this would be more efficient (no empty construction before assignment) as well as slightly clearer with a trinary operator instead of an if statement. Or it might end up making the line too long and so be less clear. Not sure.

> +enum TextIteratorBehavior {
> +    TextIteratorBehaviorDefault = 0,
> +    TextIteratorBehaviorEmitCharactersBetweenAllVisiblePositions = 1 << 0,
> +    TextIteratorBehaviorEnterTextControls = 1 << 1,
> +    TextIteratorBehaviorEmitsTextsWithoutTranscoding = 1 << 2,
> +};

A problem with using an enum for a bit mask is that if you use the "|" operator then the result is an integer, not a TextIteratorBehavior. I could not find a good example of a better solution elsewhere in WebCore, so I suppose this is OK.

> +    explicit TextIterator(const Range*, TextIteratorBehavior);

The explicit keyword has no meaning on a constructor with more than one argument so you should omit it.

> +    // FIXME: check fonts when we implement font-based transcoding.

This comment is about future plans, but I'm not sure it's helpful. For example, it says "FIXME" but there is nothing to fix unless we add font-based transcoding.

> +void RenderText::transformText(String& text) const
> +{
> +    ASSERT(style());
> +    switch (style()->textTransform()) {
> +    case TTNONE:
> +        break;
> +    case CAPITALIZE:
> +        makeCapitalized(&text, previousCharacter());
> +        break;
> +    case UPPERCASE:
> +        text.makeUpper();
> +        break;
> +    case LOWERCASE:
> +        text.makeLower();
> +        break;
> +    }
> +}

I think this function should be moved up higher in the file and marked inline since there are only two callers, both inside this source file. My guess is that would make the code smaller and reduce overhead in the common case of no transform.

Another possibility would be to make it illegal to call this function when the transform is TTNONE and add another check of the transform. One call site has it already.
Comment 12 Shinichiro Hamaji 2010-04-20 14:13:06 PDT
Created attachment 53878 [details]
Patch v4
Comment 13 Shinichiro Hamaji 2010-04-20 14:20:12 PDT
Thanks for your reviews! I think I addressed all comments. I'll create another patch for the readability of TextIterator's callers.

> +<script charset="UTF-8">
> 
> I don't think this charset attribute has any effect, other than adding
> confusion. Script charset attribute is only used for external scripts.

Thanks for this comment. I didn't know this. I wanted to get 92 by "\\".charCodeAt(0) even if the backslash in <script> tag is transcoded into a currency sign due to future changes. I modified the code so we just use 92.
Comment 14 Darin Adler 2010-04-20 14:58:05 PDT
Comment on attachment 53878 [details]
Patch v4

> -    explicit TextIterator(const Range*, bool emitCharactersBetweenAllVisiblePositions = false, bool enterTextControls = false);
> -    
> +    TextIterator(const Range*, bool emitCharactersBetweenAllVisiblePositions = false, bool enterTextControls = false);

My comment about "explicit" does not apply to this one. This does need "explicit" because all the arguments except the first are optional.

Otherwise looks good to me.
Comment 15 Shinichiro Hamaji 2010-04-20 15:02:23 PDT
Created attachment 53885 [details]
Patch v5
Comment 16 Shinichiro Hamaji 2010-04-20 15:04:40 PDT
> My comment about "explicit" does not apply to this one. This does need
> "explicit" because all the arguments except the first are optional.

Oops! I might be sleeping when I changed this... Thanks for your comment!
Comment 18 Yuzo Fujishima 2010-04-20 22:25:33 PDT
Committed r57952: <http://trac.webkit.org/changeset/57952>