Bug 24906 - 0x5C of EUC-JP is not Yen Sign but U+005C
Summary: 0x5C of EUC-JP is not Yen Sign but U+005C
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 525.x (Safari 3.2)
Hardware: All All
: P3 Normal
Assignee: Shinichiro Hamaji
URL:
Keywords: InRadar
Depends on: 22339 36010 36011 36419
Blocks:
  Show dependency treegraph
 
Reported: 2009-03-27 19:09 PDT by NARUSE, Yui
Modified: 2010-05-24 11:23 PDT (History)
9 users (show)

See Also:


Attachments
Patch for removing EUC-JP from TextEncoding::backslashAsCurrencyS. (1.54 KB, patch)
2009-03-28 21:25 PDT, NARUSE, Yui
no flags Details | Formatted Diff | Diff
other browsers' results (131.70 KB, application/zip)
2009-03-29 07:23 PDT, Alexey Proskuryakov
no flags Details
Screenshot of IE8 (63.54 KB, image/jpeg)
2009-03-31 14:53 PDT, NARUSE, Yui
no flags Details
Webkit's result (59.83 KB, image/jpeg)
2009-03-31 14:59 PDT, NARUSE, Yui
no flags Details
Search page by 0x5C on IE (119.03 KB, image/jpeg)
2009-04-07 01:04 PDT, NARUSE, Yui
no flags Details
Search by 0x5C on Firefox (308.31 KB, image/png)
2009-04-07 01:04 PDT, NARUSE, Yui
no flags Details
Search by 0x5C on Safari (259.69 KB, image/png)
2009-04-07 01:05 PDT, NARUSE, Yui
no flags Details
Patch v1 (265.24 KB, patch)
2010-03-05 00:05 PST, Shinichiro Hamaji
no flags Details | Formatted Diff | Diff
Patch v2 (242.39 KB, patch)
2010-05-13 05:23 PDT, Shinichiro Hamaji
ap: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description NARUSE, Yui 2009-03-27 19:09:28 PDT
Current webkit converts 0x5C of EUC-JP to U+00A5.

But by IANA Charsets definition, 0x00-0x7F of EUC-JP is US-ASCII.
http://www.iana.org/assignments/character-sets

And by CP51932 (de fact real encoding table of EUC-JP;
Internet Explorer and Firefox use this) also converts
0x00-0x7F of EUC-JP to U+0000-U+007F.
http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932 (written in Japanese)
http://code.google.com/p/chromium/issues/detail?id=3094 (wriitten in English)

So this webkit's behavior is against both de fact and de jure standard.

The patch for this is following:
Index: WebCore/platform/text/TextEncoding.cpp
===================================================================
--- WebCore/platform/text/TextEncoding.cpp      (revision 42061)
+++ WebCore/platform/text/TextEncoding.cpp      (working copy)
@@ -157,9 +157,13 @@ UChar TextEncoding::backslashAsCurrencySymbol() co

     // The text encodings below treat backslash as a currency symbol.
     // See http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx for more information.
+    // But on CP932, 0x5C is mapped as U+005C.
+    // (And showed as Yen Sign glyph on Japanese Font set)
+    // And Shift_JIS should follow CP932.
+    // 0x00-x7F of EUC-JP is US-ASCII.
+    // So 0x5C is U+00A5 only on Shift_JIS_X0213-2000.
     static const char* const a = atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000");
-    static const char* const b = atomicCanonicalTextEncodingName("EUC-JP");
-    return (m_name == a || m_name == b) ? 0x00A5 : '\\';
+    return (m_name == a) ? 0x00A5 : '\\';
 }

 bool TextEncoding::isNonByteBasedEncoding() const
Comment 1 Alexey Proskuryakov 2009-03-28 09:24:16 PDT
Would you be willing to submit the patch as described in <http://webkit.org/coding/contributing.html>?

Also, are there any known pages that are affected by this issue? Could you please provide some examples?
Comment 2 NARUSE, Yui 2009-03-28 21:25:03 PDT
Created attachment 29042 [details]
Patch for removing EUC-JP from TextEncoding::backslashAsCurrencyS.
Comment 3 NARUSE, Yui 2009-03-28 21:26:21 PDT
Created attachment 29043 [details]
Short example of Bug 24906
Comment 4 NARUSE, Yui 2009-03-28 21:28:43 PDT
live example is http://doc.okkez.net/187/view/spec/regexp
This page explains Regexp but all escapes are showed as U+00A5.
Of course you can't execute examples in this page because of this problem.

simple example is attached file.
Comment 5 Alexey Proskuryakov 2009-03-29 07:23:29 PDT
Created attachment 29044 [details]
other browsers' results

I made screenshots of the attachments using browsershots.org, and all IE versions display yen sign, see an archive with results in attachment.

So, it seems that we actually do match IE here.
Comment 6 Alexey Proskuryakov 2009-03-29 08:04:19 PDT
Firefox 3.0.5 / Windows XP also displayed a yen sign on browsershots.org - although that's not what I see with Firefox/Mac, mysteriously.
Comment 7 NARUSE, Yui 2009-03-29 08:09:39 PDT
> I made screenshots of the attachments using browsershots.org, and all IE
> versions display yen sign, see an archive with results in attachment.
>
> So, it seems that we actually do match IE here.

No.
If you see this carefully, you can find the difference of
the glyphs between (0x5C and U+005C) and U+00A5.
Some fonts on Windows shows backslash as Yen Sign.
It is logically backslash but has Yen Sign glyph.
Comment 8 NARUSE, Yui 2009-03-29 08:18:32 PDT
(In reply to comment #6)
> Firefox 3.0.5 / Windows XP also displayed a yen sign on browsershots.org -
> although that's not what I see with Firefox/Mac, mysteriously.

In other words this should be font specific problem.
If you use fonts which shows U+005C as Yen Sign, you can see Yen Sign.
If you not, you can see backslash.
MS Gothic is one of those hacked fonts.

Or I should explicitly specify the font in my exapmle.
Comment 9 Alexey Proskuryakov 2009-03-29 09:03:10 PDT
I see, interesting. So, this change will fix some pages, but break others (those that use e.g. MS Gothic), correct?

It is a tough choice to make - perhaps we'll need to apply font-specific transcoding to text using those fonts.
Comment 10 NARUSE, Yui 2009-03-29 09:26:49 PDT
(In reply to comment #9)
> I see, interesting. So, this change will fix some pages, but break others
> (those that use e.g. MS Gothic), correct?

The glyph of 0x5c of EUC-JP can be both backslash and yen sign.
So they are still correct, even if the glyph may differ on various environments.
# this is specified CP51932 and original eucJP definition.

But 0x5C must be logically backslash: U+005C.
Current implementation breaks this and this affect copy from the page and paste to something.

> It is a tough choice to make - perhaps we'll need to apply font-specific
> transcoding to text using those fonts.

Web Browser is not only for seeing, but for machine.
Hacks like this --- replace characters logically --- affect to machine.
For example, copy texts from the page, scripting on the page, and so on.

We must consider both rendered glyphs and logically displayed characters.


For your information in English document, CP932 (base encoding of CP51932) definition.
http://msdn.microsoft.com/ja-jp/goglobal/cc305152(en-us).aspx
5C = U+005C : REVERSE SOLIDUS (YEN SIGN)
This shows 0x5C of CP932 (Shift-JIS variant) is logically U+005C : REVERSE SOLIDUS,
but uses YEN SIGN glyph.
Comment 11 NARUSE, Yui 2009-03-29 10:21:17 PDT
I clarify that I don't want on-the-fly transcoding.

If the author want that visitors see 0x5C as backslash, they can specify fonts.
If the author want that visitors see 0x5C as yen sign, they can specify another fonts.
If the author didn't mind this, they ignore this.
0x5C of EUC-JP is double meaning and can't decide backslash or yen sign.

If browser hacks their glyph by on-the-fly transcoding, it is a barrier for them.
Comment 12 Alexey Proskuryakov 2009-03-29 10:40:50 PDT
One problem is that on non-Windows platforms, the fonts are not hacked, so a lot of pages that display money amounts using the 0x5c character in MS fonts would presumably suddenly look wrong if we took this patch as is.

Another issue is that if the user searches for "Y5" (I'm using Y instead of yen sign here to please bugzilla), the browser wouldn't find "Y5" on the page - and that's clearly not desired behavior. Similarly, a money amount would suddenly look broken when copied from a Web page to another application that uses a different font.

Technically, it is possible to make the currency glyph substitution happen only for rendering purposes, but then, search and copy will be broken. This is what makes me think that per-font transcoding is the only practical way to improve the current behavior.
Comment 13 NARUSE, Yui 2009-03-29 11:42:06 PDT
(In reply to comment #12)
> One problem is that on non-Windows platforms, the fonts are not hacked, so a
> lot of pages that display money amounts using the 0x5c character in MS fonts
> would presumably suddenly look wrong if we took this patch as is.

Non-japanese user's environment, it may YES.
But japanese users use fonts for japanese.

For example, IPA font is 'hacked' font.
http://ossipedia.ipa.go.jp/ipafont/
And VL Gothic is normal font.
http://dicey.org/vlgothic/

We can choose suitable font.

This is because, this problem is derived from JIS X 0201.
JIS X 0201 is a variant of ISO 646 and in JIS X 0201, 0x5C is Yen Sign.
So this is long and wide problem and we have both backslash fonts and yen sign fonts.

> Another issue is that if the user searches for "Y5" (I'm using Y instead of yen
> sign here to please bugzilla), the browser wouldn't find "Y5" on the page - and
> that's clearly not desired behavior. Similarly, a money amount would suddenly
> look broken when copied from a Web page to another application that uses a
> different font.

Japanese wide spread keyboard, JP106 has keys which have yen sign image
and backslash image.
http://www.stanford.edu/class/cs140/projects/pintos/specs/kbd/jp106.jpg
But both of them are assigned to 0x5C of CP932 (this is mapped to U+005C).

So japanese usually searchs the character (backslash or yen sign) by U+005C.
In other words the problem ``the browser wouldn't find "Y5" on the page''
is *current* problem, not another issue.

# on Japanese Windows Internet Explorer,
# both U+005C and U+00A5 are fallback to 0x5C of CP932.
# So 0x5C matches both U+005C and U+00A5.

> Technically, it is possible to make the currency glyph substitution happen only
> for rendering purposes, but then, search and copy will be broken. This is what
> makes me think that per-font transcoding is the only practical way to improve
> the current behavior.

I want no substitution and no per-font transcoding.
We know that 0x5C of EUC-JP may be backslash (US-ASCII) or yen sign (JIS X 0201).
Moreover in japan, U+005C is also backslash and yen sign.
This is not only the problem of EUC-JP but Shift_JIS and Unicode in Japan have the same problem.
But we have workarrounds.

Such substitutions and per-font transcodings only make the problem more difficult.
0x5C of EUC-JP is not yen sign, not backslash.
It unified both.
So DON'T separate them by the browser.
Comment 14 NARUSE, Yui 2009-03-29 12:08:18 PDT
In short:

For historical reason, in Japan 0x5C and U+005C is ambiguous.
They may be backslash and yen sign.

But we have workarounds for this problem.

However webkit make 0x5C of EUC-JP as U+00A5.
U+00A5 is absolutely YEN SIGN.

This is separete backslash from 0x5C
and against both de fact and de jure standards,
and breaks copy and search.

Desirable behavior is adjust to IE and Firefox:
maps 0x5C to U+005C and don't hack moreover.
Comment 15 NARUSE, Yui 2009-03-31 14:53:57 PDT
Created attachment 29133 [details]
Screenshot of IE8
Comment 16 NARUSE, Yui 2009-03-31 14:54:58 PDT
Created attachment 29134 [details]
short example v2

add both MS Gothic and Arial.
Comment 17 NARUSE, Yui 2009-03-31 14:59:02 PDT
Created attachment 29135 [details]
Webkit's result

On webkit both MS Gothic and Arial are showed as yen sign.
Comment 18 NARUSE, Yui 2009-03-31 15:05:10 PDT
I added the new example which specify MS Gothic and Arial.
If you have both fonts, you can see the glyph of 0x5C and U+005C is surprisingly
differ between these fonts. (Screenshot of IE8)

But on webkit all of them are yen sign.


Moreover on webkit the character which is created from character reference
&#x005C; is also replaced as U+00A5.
This seems also a bug.
Comment 19 NARUSE, Yui 2009-04-01 07:39:47 PDT
FYI
3.1.1 Details of Problems (4) The Yen Sign problem
http://home.m05.itscom.net/numa/cde/ucs-conv-e/ucs-conv-e.html#ch3_1_1
# defined above is eucJP-ms; it is different from CP51932
Comment 20 Jungshik Shin 2009-04-06 18:00:04 PDT
I more or less agree with the reporter. Actually, I'd rather get rid of the 'hack' for 0x5C even for SJIS. Anyway, at least for EUC-JP, we'd better remove that code. 

Alexey, 0x5C in EUC-JP (and EUC-KR) is always ambiguous (Linux/Unix people like to see 0x5C to be treated as U+005C) and Japanese and Korean are accustomed to see 0x5C rendered as backslash or Yen-sign depending on the font being used. So, we don't have to worry about 'breaking' EUC-JP pages. It's perfectly ok to show backslash for 0x5C in EUC-JP (and EUC-KR). I'd even say it's ok for Shift_JIS. 

Comment 21 Alexey Proskuryakov 2009-04-06 23:37:37 PDT
Internally, I'm getting a different input - it would not be acceptable to break the display of yen sign. Really, if it were ok, Microsoft would have removed this quirk from its fonts years ago, I'm sure it's a pain for them to have it.

Currently, this bug doesn't have any examples of pages that work in IE properly, but do not work in Safari (http://doc.okkez.net/187/view/spec/regexp looks in IE exactly as it does in Safari, even though the rendering can be considered wrong). Without any known affected sites, the priority of this issue is very low, see <http://webkit.org/quality/bugpriorities.html>.
Comment 22 NARUSE, Yui 2009-04-07 01:04:12 PDT
Created attachment 29301 [details]
Search page by 0x5C on IE
Comment 23 NARUSE, Yui 2009-04-07 01:04:49 PDT
Created attachment 29302 [details]
Search by 0x5C on Firefox
Comment 24 NARUSE, Yui 2009-04-07 01:05:42 PDT
Created attachment 29303 [details]
Search by 0x5C on Safari
Comment 25 NARUSE, Yui 2009-04-07 02:01:36 PDT
(In reply to comment #21)
> Internally, I'm getting a different input - it would not be acceptable to break
> the display of yen sign.

EUC-JP is defined on Unix context.
EUC is the framework for multi-national language system.
EUC has 4 codesets, 0 is always US-ASCII and others are language specific.
(desribed in UNIX System V リリース 4 国際化機能 (MNLS) 機能説明書)
http://www.amazon.co.jp/gp/switch-language/product/4320097165/ref=dp_change_lang?ie=UTF8&language=en%5FJP
(Historically, implementations are allowed to show 0x5C as yen sign.
 But specs says that 0x5C is US-ASCII)

If you are saying about JIS version of EUC:
EUC-JISX0213 (and EUC-JP-2004) is defined by JIS.
But they are also defined 7bit range as ISO/IEC 646 IRV.


What specs they read who says that 0x5C's glyph must be yen sign.
And if there is a certain spec, is it more authorized than IANA Character Set and UNIX System V Release V MNLS and JIS?

> Really, if it were ok, Microsoft would have removed
> this quirk from its fonts years ago, I'm sure it's a pain for them to have it.

I know you love yen sign glpyh, I Japanese thank you for it.
But do you imagine data which expects that 0x5C is backslash glpyh?
I explained above, many standards defines 7bit range is US-ASCII,
and many data expects it.
(backslash is usually used as escape in EUC-JP text,
 so glyph is less important than Yen sign.
 But logically, this is as important as it.)
For example following page, you'll think "\n" should be backslash-n.
http://cms.phys.s.u-tokyo.ac.jp/~naoki/CIPINTRO/CBEG/cbeg6.html

If Microsoft removes this quirk form,
it affect not only 0x5C of EUC-JP but U+005C (see my example).
And it has come if you use Arial Unicode MS.
http://en.wikipedia.org/wiki/Arial_Unicode_MS
Moreover on Ubuntu (uses VL Gothic) or Mac (uses Hiragino) it is backslash glyph.

> Currently, this bug doesn't have any examples of pages that work in IE
> properly, but do not work in Safari (http://doc.okkez.net/187/view/spec/regexp
> looks in IE exactly as it does in Safari, even though the rendering can be
> considered wrong).

I add more example on IE, on Firefox of Mac, on Safari of Mac.
Do you need more explanation or examples?
Comment 26 Alexey Proskuryakov 2009-04-07 02:12:01 PDT
Yes, you still haven't provided any examples of pages that Safari displays worse than IE does.
Comment 27 NARUSE, Yui 2009-04-07 02:54:27 PDT
If you want to show 0x5C as yen sign,
why are you particular about Shift_JISX0213 and EUC-JP?

Any other encoding: Shift_JIS, UTF-8, ISO-2022-JP and so on have same problem.
I can't understand you specialize EUC-JP.
Comment 28 Alexey Proskuryakov 2009-04-07 03:52:27 PDT
Yes, it does feel strange that the other Japanese encodings aren't included in the test.

I think that you have already demonstrated that we should check font name, not encoding name. Whether this should be a decoding or rendering quirk is a separate question, and both solutions have their downsides. Some of the downsides are purely theoretical (such as a script finding an unexpected character code in element.innerHTML value). Others are practical, but familiar to many users (such as being unable to use search with these characters reliably).

But without any known pages (ideally, from major sites) that we render worse than Internet Explorer, it would be quite hard to justify making changes in either direction, even as we agree that the current solution is imperfect.
Comment 29 Alexey Proskuryakov 2009-04-07 04:02:49 PDT
> (backslash is usually used as escape in EUC-JP text,
> so glyph is less important than Yen sign.

That's a very important observation, in my opinion! Indeed, erring on the side of yen sign is usually OK, because non-programmers are unlikely to encounter texts that use the backslash often. Programmers are less numerous than non-programmers, and more likely to be able to cope with such a glitch.
Comment 30 Jungshik Shin 2009-04-07 14:09:13 PDT
(In reply to comment #21)
> Internally, I'm getting a different input - it would not be acceptable to break
> the display of yen sign. Really, if it were ok, Microsoft would have removed
> this quirk from its fonts years ago, I'm sure it's a pain for them to have it.

MS could have fixed the problem a long time ago at least when they released Windows XP (i.e. when their main line OS became fully Unicode-aware/capable. Win 2k and NT4 were also Unicode-capable/aware, but Win 9x/ME were a lot more widespread than Win 2k/NT4). See my argument on the issue at http://unicode.org/mail-arch/unicode-ml/y2002-m10/0340.html

JA/KO fonts shipped with Windows are not compliant to the Unicode standard. I don't think even MS people would say that the identity of U+005C depends on a font being used to render it. However, that's exactly what they do with fonts like MS (P) Gothic/MS (P) Mincho and Gulim/Batang(che). 

<span style="font-family: MS P Gothic">\</span>
<span style="font-family: Gulim">\</span>

The above snippet (in any encoding, UTF-8,16, EUC-JP, EUC-KR) rendered on Windows with two fonts installed will have 'Yen' and 'Won' glyphs in place of '\'. 

> Currently, this bug doesn't have any examples of pages that work in IE
> properly, but do not work in Safari (http://doc.okkez.net/187/view/spec/regexp
> looks in IE exactly as it does in Safari, even though the rendering can be
> considered wrong). Without any known affected sites, the priority of this issue
> is very low, see <http://webkit.org/quality/bugpriorities.html>.
> 

Comment 31 Jungshik Shin 2009-04-07 14:31:32 PDT
(In reply to comment #29)
> > (backslash is usually used as escape in EUC-JP text,
> > so glyph is less important than Yen sign.
> 
> That's a very important observation, in my opinion! Indeed, erring on the side
> of yen sign is usually OK, because non-programmers are unlikely to encounter
> texts that use the backslash often. Programmers are less numerous than
> non-programmers, and more likely to be able to cope with such a glitch.


How about copy'n'paste of a Windows file path? The pasted result will have Yen sign (U+00A5) instead of back slash (U+005C) and will not be a valid Windows file path any more. Ok, it's not so common to copy'n'paste a file path, but it'd be rather annoying when it happens to you.

LaTeX users are not programmers (I wouldn't say there are many, but their number is likely to be larger than programmers :-) ), but they'd be surprised to find that an example TeX snippet copied from an EUC-JP page to a their TeX file results in numerous errors.  

IE does not have this problem (neither does it have 'Find in Page' issue as pointed out earlier) as you already figured out. 



Comment 32 Jungshik Shin 2009-04-08 15:02:54 PDT
Created attachment 29347 [details]
another example demonstrating what IE and Firefox do

IE does not do anything fancy, but it just "relies on" (perhaps, 'leave it up to" is better than 'relies on') the hack 'built-into' Windows Japanese fonts. 

In addition, when no font is specified for EUC-JP (Shift_JIS, ISO-2022-JP) documents, it uses the default Japanese font (MS PGothic). That's what my preliminary patch for bug 18085 does and Firefox does. Chrome 1.x/2.0beta have that patch but does not benefit from it because U+005C is replaced with U+00A5 by webkit before the actual text drawing code path is hit.

The sample html file I'm uploading illustrates that point. When no font is specified, IE uses MS P Gothic (for EUC-JP documents) as shown by the identical rendering results of the first two sections. Because MS P Gothic has the Yen sign glyph for U+005C, we see the Yen sign glyph although the underlying character remains U+005C. 

When arial is specified, the glyph for U+005C comes from Arial and its glyph is that of the back slash (not that of Yen sign).
Comment 33 Alexey Proskuryakov 2009-07-21 14:18:40 PDT
<rdar://problem/4174448>
Comment 34 Jungshik Shin 2009-07-21 15:44:12 PDT
Hmm, the code in question is still there (platform/text/TextEncoding.{cpp,h}) as well as the callers of displayString* ), but somehow I don't see the problem any more in Safari 4.0.2 and the Webkit trunk + Safari on Mac. 

Very strange.. Am I hallucinating? 

Anyway, if we can agree on not treating 0x5C in Japanese encodings as U+00A5, we can remove displayString* and its callers as well. 

BTW, it's http://crbug.com/9696
Comment 35 Shinichiro Hamaji 2010-01-12 02:47:12 PST
Jungshik, I think this bug is still live. I'm new in this bug and the detail may have changed though. I confirmed this bug by

1. Access http://shinh.skr.jp/tmp/backslash_euc.html
2. Copy all text and paste it into the Windows' command prompt
3. The command doesn't work :(

I agree with Jungshik that this bug is terrible when we copy texts. For example,  fairly large number of non-programmers post a blog about Windows' BAT file as a tip to improve their life. Bookmarklets can be another example. So, I think we should remove this hack at least for copying (why TextEncoding::*display*(String|Buffer) are used for copying? :-).


As for displaying, I still dislike this hack, but I can agree some websites may expect 0x5c becomes a yen sign. However, I think this hack isn't working at all for websites where charset=Shift_JIS maybe because we are testing the encoding is "Shift_JIS_X0213-2000", not "Shift_JIS". I created two HTMLs

http://shinh.skr.jp/tmp/backslash_sjis.html (charset=Shift_JIS)
http://shinh.skr.jp/tmp/backslash_sjis0213.html (charset=Shift_JIS_X0213-2000)

On Mac, I don't see yen signs in the former HTML and I see yen signs in the latter. On Windows and Linux, no conversions are done in both HTML, though I see yen signs in both HTMLs on Windows because of the hacked font. So, it seems this hack in question isn't working at all for websites whose charset is Shift_JIS. I guess most websites which expect 0x5c is a yen sign would use Shift_JIS...


By the way, Alexey, it's very difficult to find a major website where this bug is a big issue because major websites may

- prefer UTF8
- know the issues around 0x5c and use U+FF3C and U+FFE5 instead

but I find a bunch of complaints about this behavior in blogs and this bug actually bit me more than ten times... Please note that some major SNS and blog services in Japanese are using EUC-JP and the users of them cannot workaround this issue as they cannot change the charset in the service.


Summary: I believe the best fix is just removing this hack, but if we want to keep this hack, we may want to change the code so that

1. Backslashes aren't changed when a user copies a text.
2. Add "Shift_JIS" into the list of hacked encodings.
3. (optional) Only work on Mac (I guess Windows people have hacked fonts and they don't need this hack, and Linux people can convert backslashes into yen signs in their brain :).

By the way, is it easy to do the step 1 above?
Comment 36 Alexey Proskuryakov 2010-01-12 08:40:22 PST
> 1. Backslashes aren't changed when a user copies a text.

Since the behavior on Windows depends on which font is in use, I doubt we can make any generic change that wouldn't break more than it would fix. Bug 22339 would give us a way to perform font-specific adjustments.
Comment 37 Shinichiro Hamaji 2010-01-14 00:58:22 PST
> Since the behavior on Windows depends on which font is in use, I doubt we can
> make any generic change that wouldn't break more than it would fix. Bug 22339
> would give us a way to perform font-specific adjustments.

Thanks for the comment!

Let me clarify the requirement of font-specific
transcoding. Sorry, but I'm not sure if I understand your suggestion
correctly...

1. if the font known hacked (e.g., MS P Gothic), do backslash=>yen
   transcoding.
2. we do the transcoding described in 1 only on Mac. (I don't see any
   reason we should do this hack on Windows, please correct me if I'm
   wrong)
3. stop encoding-based transcoding completely.


I've noticed the 2 may be objected by this reason:

> Similarly, a money amount would suddenly look broken when copied
> from a Web page to another application that uses a different font.

I think this isn't an issue of a browser and even IE doesn't support
this behavior. Please see the following URL with IE, copy the filepath
from the EUC iframe, and paste it into the textarea in UTF8 iframe. I
think you will see yen signs become backslashes.

http://shinh.skr.jp/tmp/backslash_iframe.html
Comment 38 Alexey Proskuryakov 2010-01-14 08:44:03 PST
Yes, this correctly re-phrases my suggestion. This is indeed not as important to do on Windows, but I think that would still be nice for consistency.

Per comment 32, we should still document encoding into consideration, but only because it affects default font selection in IE.
Comment 39 Shinichiro Hamaji 2010-03-05 00:05:05 PST
Created attachment 50090 [details]
Patch v1
Comment 40 Shinichiro Hamaji 2010-03-05 00:06:17 PST
I think I followed the discussion on this bug. Some description about this patch:

- If a reviewer feels this patch is big, please let me know and I'll split this.
- I referred the patch in Bug 22339 to write transcoder/FontTranscoder. I hope this can be reused by EOT.
- When a font explicitly specified by an HTML author is known non-hacked, we may want not to transcode even if the encoding is Japanese encodings. I couldn't figure out how we can know a Font object is a system fallback font or explicitly specified. I put a FIXME in FontTranscoder.cpp:71 .
- I added "Shift_JIS", "ISO-2022-JP", and "ISO-2022-JP-2" as hacked-backslash encodings as it seems Firefox is doing the same transcoding.
- Whenever users copies U+005C, they will get non-transcoded characters.
- I think we cannot test this patch using dumpAsText with current test infrastructure. The dumpAsText() dumps texts in DOM objects, but we need texts in RenderText to test this patch.

Any suggestions will be welcomed!
Comment 41 Alexey Proskuryakov 2010-03-08 17:30:53 PST
I'm not qualified to review this, but the above description sounds good to me. Not sure how bad the problem with fallback/explicit fonts is.

Does searching work correctly with this patch, so that a yen sign as visible on a web page can always be found by searching for a yen sign? I think it should work, but nothing beats actual testing.
Comment 42 Darin Adler 2010-03-08 17:33:04 PST
Comment on attachment 50090 [details]
Patch v1

The early warning system was unable to apply the patch because the Xcode project part no longer applies. It would be best to have a new patch so we can see what EWS has to say.
Comment 43 Darin Adler 2010-03-08 17:33:56 PST
(In reply to comment #41)
> I'm not qualified to review this

What expertise do you think it will take to review?
Comment 44 Darin Adler 2010-03-08 17:53:13 PST
Comment on attachment 50090 [details]
Patch v1

Looks good.

> -    String str = renderer->text();
> +    String str = renderer->nonDisplayString();

I don't understand the meaning here. What is a "non-display string"? A display string is a string intended for display to the end user. But "non-display" doesn't tell me anything.

Why would this code path in particular use nonDisplayString?

> +    , m_needTranscode(false)

For booleans data members we normally use words that work in a sentence like this "font <xxx>". But "font need transcode" is not good grammar. I suggest naming this m_needsTranscoding.

> +    bool needTranscode() const { return m_needTranscode; }

And this should be needsTranscoding() for the same reason.

> -    static const char* const a = atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000");
> -    static const char* const b = atomicCanonicalTextEncodingName("EUC-JP");
> -    return (m_name == a || m_name == b) ? 0x00A5 : '\\';
> +    DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ());
> +    if (set.isEmpty()) {
> +        addEncodingName(set, "Shift_JIS");
> +        addEncodingName(set, "Shift_JIS_X0213-2000");
> +        addEncodingName(set, "EUC-JP");
> +        addEncodingName(set, "ISO-2022-JP");
> +        addEncodingName(set, "ISO-2022-JP-2");
> +    }
> +    return m_name && set.contains(m_name) ? 0x00A5 : '\\';

This fix seems separate from the rest of the patch and should be landed separately if possible. And needs a test case.

> +        bool useBackslashAsCurrencySymbol() const
> +        {
> +            return m_backslashAsCurrencySymbol != '\\';
> +        }

Again, boolean functions need to finish a sentence "text encoding <xxx>" and "text encoding use backslash as currency symbol" does not make sense. It should be usesBackslashAsCurrencySymbol().

>          PassRefPtr<StringImpl> displayString(PassRefPtr<StringImpl> str) const
>          {
> -            if (m_backslashAsCurrencySymbol == '\\' || !str)
> +            if (!useBackslashAsCurrencySymbol() || !str)
>                  return str;
>              return str->replace('\\', m_backslashAsCurrencySymbol);
>          }
>          void displayBuffer(UChar* characters, unsigned len) const
>          {
> -            if (m_backslashAsCurrencySymbol == '\\')
> +            if (!useBackslashAsCurrencySymbol())

I don't see how these changes are related to the major change here. The code is the same as before, but uses a helper function. This should be landed separately so the patch isn't so huge. And doesn't seem important.

> +#include "config.h"
> +
> +#include "FontTranscoder.h"

There should not be a space here.

> +    // IE's default font for Japanese encodings is known hacked.

This comment is unclear. What specifically do you mean by "known hacked"? I think you should say specifically in a way that is specific about what it does rather than making a value judgement "hacked".

> +    return NoConvertion;

There is no such word as "convertion". Instead it should be "conversion".

> +bool FontTranscoder::needTranscode(const AtomicString& fontFamily, const TextEncoding* encoding, bool* outUseOriginalTextAsNonDisplayString) const

Again, functions that return booleans should ideally be "font transcoder <xxx>" and "font transcoder need transcode" does not make sense. I don't have a specific name to suggest.

> +FontTranscoder* fontTranscoder()
> +{
> +    static FontTranscoder* transcoder = new FontTranscoder();
> +    return transcoder;
> +}

Normally we would return a reference from a function like this instead of a pointer. Also, it's typically "new FontTranscoder" without parentheses.

> +#include "AtomicString.h"
> +#include "AtomicStringHash.h"
> +#include "PlatformString.h"
> +#include <wtf/HashMap.h>

There is no need to include "PlatformString.h" if you are also including "AtomicString.h".

> +class FontTranscoder {

This should derive from Noncopyable.

> +    bool needTranscode(const AtomicString& fontFamily, const TextEncoding* encoding = 0, bool* outUseOriginalTextAsNonDisplayString = 0) const;

The argument name "encoding" should be omitted here.

> +    ConverterType converterType(const AtomicString& fontFamily, const TextEncoding* encoding) const;

The argument name "encoding" should be omitted here.

> +     , m_needTranscode(false)

"render text need transcode" is not good. But m_needsTranscoding might be OK.

> +     , m_useOriginalTextAsNonDisplayString(false)

I still don't understand what a non-display string is, so I can't suggest a better name here.

> +        String transcoded(m_text);
> +        fontTranscoder()->convert(transcoded, fontFamily, encoding);
> +        m_text = transcoded.impl();

We really should change m_text to be a String instead of a RefPtr<StringImpl>. The two are almost the same thing.

We need test cases covering *searching* for the symbols in these various encodings and fonts. In general, I would like to see a test case that covers each call site. The search code is one example that uses the TextIterator call site.

review- because I'm sure you'll want to change at least one of the things I mentioned above.
Comment 45 Alexey Proskuryakov 2010-03-09 14:09:19 PST
+++ b/LayoutTests/editing/pasteboard/copy-backslash-from-euc.html

Can this test in particular be dumpAsText()?

It also seems useful to test copying unstyled text (i.e. copying from a form input).

+Copied text must be a backslash: <input id="result">

I think that this could use a better worded instruction for running the test manually. One doesn't know that the input contains copied text, so the text could be just "Should be a backslash: ". Just like you have in other tests.

Do these tests pass in IE?

     // The text encodings below treat backslash as a currency symbol.
     // See http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx for more information.
-    static const char* const a = atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000");
-    static const char* const b = atomicCanonicalTextEncodingName("EUC-JP");
-    return (m_name == a || m_name == b) ? 0x00A5 : '\\';
+    DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ());

The comment above the changed code should say "yen" specifically. I think there is (was?) a similar problem with Korean encodings, and it's not handled here.

+    return m_name && set.contains(m_name) ? 0x00A5 : '\\';

I'd have added parentheses around the condition. I never remember precedence rules, and that makes me write unambiguous code.

+    UChar unicodeNameMSPGothic[] = {0xFF2D, 0xFF33, 0x0020, 0xFF30, 0x30B4, 0x30B7, 0x30C3, 0x30AF, 0};
+    m_converterTypes.add(AtomicString(unicodeNameMSPGothic), BackslashToYenSign);

It doesn't matter much in practice, but you could as easily avoid using null-terminated strings:

    UChar unicodeNameMSPGothic[] = {0xFF2D, 0xFF33, 0x0020, 0xFF30, 0x30B4, 0x30B7, 0x30C3, 0x30AF, 0};
    m_converterTypes.add(AtomicString(unicodeNameMSPGothic, sizeof(unicodeNameMSPGothic) / sizeof(UChar)), BackslashToYenSign);

+    // FIXME: We should not do this when a user explicitly specifies a
+    //        non-hacked font.

No need to wrap the FIXME. But even in cases where comments are long enough to wrap, we don't try to indent.

How bad is this issue in practice? I'm not sure if I can make a test case just from this description.

+        static const UChar yenSign = 0x00A5;

This should go to CharacterNames.h.

+    switch (converterType(fontFamily, encoding)) {
+    case BackslashToYenSign: {
<...>
+    case NoConvertion:
+    default:
+        ASSERT_NOT_REACHED();
+    }

Using a switch for two cases, on of which can not be taken, is somewhat unusual. Is there a reason you did that, and not just ASSERT+if()?

+    static FontTranscoder* transcoder = new FontTranscoder();

You should probably use DEFINE_LOCAL_STATIC here.

+#include "AtomicString.h"
+#include "AtomicStringHash.h"
+#include "PlatformString.h"

Not only there is no need to include PlatformString.h, but there is no need to include AtomicString.h.

+    friend FontTranscoder* fontTranscoder();

We usually put friend declarations in private section, for what it's worth.

It's great to see transcoding finally move forward!
Comment 46 mitz 2010-03-10 10:56:54 PST
1. Do you also need to use “nonDisplayString()” in SimplifiedBackwardsTextIterator::handleTextNode() ?
2. In the text-transform + transcoding case, you just fall back on using the (transcoded) text() as the nonDisplayString(). Perhaps in this case you nonDisplayString() could apply the transform on the fly to the original string (so it will return a transformed, yet not transcoded, string).
3. Is there harm in transcoding when running on Windows and the font being used is actually one of the fonts that have backslash mapped to the Yen sign?
Comment 47 Shinichiro Hamaji 2010-03-11 05:16:54 PST
Thanks folks for reviews!

I think the most serious issue in my previous patch was searching, which Alexey
mentioned. Actually, with my previous patch, yen signs weren't found by
searching for a yen sign. Thanks for this catch.

As for searching, I think the current behavior (displayed yen signs were found
by searching a yen sign, but not found by a backslash) isn't the best as the
attachments in this bug ("Search by 0x5C on XXX") mention. I guess the best
behavior would be displayed yen signs, which are originally backslashes, can be
found by either a yen sign or a backslash. I guess we can normalize all yen
signs into backslashes to implement this behavior easily.

Anyway, I think I should avoid changing all behaviors at once. I won't change
the behavior of searching before I finish the copying issue and the font based
transcoding. I'll work on the copying issue first. I created a few rafactoring
patches to make the main patches smaller (Bug 36010 and Bug 36011).

I think I can post another patch with replies for other comments tomorrow.
Sorry for the latency.
Comment 48 Shinichiro Hamaji 2010-05-13 05:23:09 PDT
Created attachment 55964 [details]
Patch v2
Comment 49 Shinichiro Hamaji 2010-05-13 05:26:21 PDT
(In reply to comment #48)
> Created an attachment (id=55964) [details]
> Patch v2

Sorry for the latency... but finally :) I think I've addressed all comments.

> Does searching work correctly with this patch, so that a yen sign as visible on a web page can always be found by searching for a yen sign? I think it should work, but nothing beats actual testing.

Thanks for this comment. I've added a test case for searching.

> > -    static const char* const a = atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000");
> > -    static const char* const b = atomicCanonicalTextEncodingName("EUC-JP");
> > -    return (m_name == a || m_name == b) ? 0x00A5 : '\\';
> > +    DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ());
> > +    if (set.isEmpty()) {
> > +        addEncodingName(set, "Shift_JIS");
> > +        addEncodingName(set, "Shift_JIS_X0213-2000");
> > +        addEncodingName(set, "EUC-JP");
> > +        addEncodingName(set, "ISO-2022-JP");
> > +        addEncodingName(set, "ISO-2022-JP-2");
> > +    }
> > +    return m_name && set.contains(m_name) ? 0x00A5 : '\\';
> 
> This fix seems separate from the rest of the patch and should be landed separately if possible. And needs a test case.

Yeah, I'll do in a separated patch.

> +    // FIXME: We should not do this when a user explicitly specifies a
> +    //        non-hacked font.
> 
> No need to wrap the FIXME. But even in cases where comments are long enough to wrap, we don't try to indent.
> 
> How bad is this issue in practice? I'm not sure if I can make a test case just from this description.

I updated the comment. I think it's a kind of rare case, but if an author of a web page explicitly uses english fonts to show backslashes, this change can break the author's intention. I'm not sure if it's easy to resolve this issue. The ideal behavior would be

1. If a document explicitly specifies a japanese font, do the backslash => yen sign transcoding
2. If a document explicitly specifies a non-japanese font, don't the transcoding
3. If IE's default font for the document encoding is japanese, do the transcoding
4. Otherwise, don't the transcoding

I didn't implement the step 2 because I wasn't sure if it's easy to implement. I was not sure if it's easy to check a Font is specified explicitly by CSS or not.

> +    switch (converterType(fontFamily, encoding)) {
> +    case BackslashToYenSign: {
> <...>
> +    case NoConvertion:
> +    default:
> +        ASSERT_NOT_REACHED();
> +    }
> 
> Using a switch for two cases, on of which can not be taken, is somewhat unusual. Is there a reason you did that, and not just ASSERT+if()?

I guessed we might be able to re-use this FontTranscoder class in Bug 22339 (EOT font issue) by adding enum values. That's also the reason why I added an enum which has only two values. If you want to keep code simple for now, please let me know and I'm happy to change to ASSERT+if() code.

> Is there harm in transcoding when running on Windows and the font being used is actually one of the fonts that have backslash mapped to the Yen sign?

I think such font has yen sign glyphs for both U+005C and U+0x00A5, so there should be no particular issues for such fonts.
Comment 50 Alexey Proskuryakov 2010-05-14 14:20:19 PDT
Comment on attachment 55964 [details]
Patch v2

+    // FIXME: We don't need transcoding when the document explicitly
+    // specifies a font which doesn't change backslashes into yen signs.

This does seem like something that could cause trouble indeed.

+FontTranscoder& fontTranscoder()
+{
+    static FontTranscoder* transcoder = new FontTranscoder;
+    return *transcoder;
+}

This should use DEFINE_STATIC_LOCAL. I'm not really sure why we're still doing this, but that's what we are doing.

r=me. You may want to give Darin and Dan some time to comment, if they want to.
Comment 51 Adam Barth 2010-05-14 23:32:04 PDT
Attachment 55964 [details] was posted by a committer and has review+, assigning to Shinichiro Hamaji for commit.
Comment 52 Shinichiro Hamaji 2010-05-16 22:33:52 PDT
Thanks Alexey for your review! Yaeh, I'll wait for a few days.

> This should use DEFINE_STATIC_LOCAL. I'm not really sure why we're still doing this, but that's what we are doing.

Thanks I'll do this change before I land.
Comment 53 Shinichiro Hamaji 2010-05-23 23:22:36 PDT
Committed r60063: <http://trac.webkit.org/changeset/60063>
Comment 54 Shinichiro Hamaji 2010-05-24 00:12:52 PDT
Committed r60064: <http://trac.webkit.org/changeset/60064>
Comment 55 Eric Seidel (no email) 2010-05-24 11:00:23 PDT
Looks like the Qt linux bot is failing after this checkin:
http://build.webkit.org/results/Qt%20Linux%20Release/r60073%20(12283)/results.html
Comment 56 Shinichiro Hamaji 2010-05-24 11:11:08 PDT
Committed r60074: <http://trac.webkit.org/changeset/60074>
Comment 57 Shinichiro Hamaji 2010-05-24 11:12:50 PDT
(In reply to comment #55)
> Looks like the Qt linux bot is failing after this checkin:
> http://build.webkit.org/results/Qt%20Linux%20Release/r60073%20(12283)/results.html

Oops. Sorry about this... Hmm I wish reftests.
Comment 58 Shinichiro Hamaji 2010-05-24 11:23:49 PDT
> Oops. Sorry about this... Hmm I wish reftests.

It seems Qt bot is still red, but maybe due to flakiness. I think my fix is working as the two failures have gone.