Bug 17014 - REGRESSION: EUC-CN code A3A0 is mapped to U+E5E5 instead of U+3000
Summary: REGRESSION: EUC-CN code A3A0 is mapped to U+E5E5 instead of U+3000
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Windows XP
: P2 Major
Assignee: Alexey Proskuryakov
URL: http://www.wo99.com
Keywords: Regression
Depends on:
Reported: 2008-01-25 22:08 PST by Anantha Keesara
Modified: 2008-01-27 21:04 PST (History)
0 users

See Also:

screenshot (39.52 KB, image/png)
2008-01-25 22:10 PST, Anantha Keesara
no flags Details
Reduction (347 bytes, text/html)
2008-01-25 22:10 PST, Anantha Keesara
no flags Details
proposed fix (4.71 KB, patch)
2008-01-27 09:18 PST, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Anantha Keesara 2008-01-25 22:08:23 PST
Reproduction steps:
1. Go to www.wo99.com

The password label ("密 码") is displayed with a rectangle in the middle.

A rectangular box should not be displayed in between the characters.

Other browsers:
IE, FF, Opera: work fine.

looks like there  is a GB2312/GBK converter issue. Firefox's converter maps 'two bytes'  between 密 and 码 in the original document  to U+3000 but apparently the ICU converter used by Safari maps it to U+E5E5.

Nightly tested: WebKit r29785 

Attached is the screenshot and reduction.
Comment 1 Anantha Keesara 2008-01-25 22:10:23 PST
Created attachment 18694 [details]
Comment 2 Anantha Keesara 2008-01-25 22:10:48 PST
Created attachment 18695 [details]
Comment 3 Alexey Proskuryakov 2008-01-27 08:30:20 PST
I do not think that this is a general problem with PUAs, renaming the bug to match its scope, as I understand it. Please correct me if I'm wrong.

Some history: A3A0 (or 0300 in unencoded form) was undefined in original GB2312, GB2312-80, GBK or Microsoft's version of the latter. Due to what looks like a bug, it was mapped to Unicode U+3000 in browsers though. WebKit also used to have a workaround for this, added for <rdar://problem/3225472> "www.sina.com.cn uses A3A0 for full-width space". This workaround was lost when switching to ICU.

GB18030, which is the next iteration of GBK, maps it to a private use character U+E5E5, but browsers do not follow the spec in this.
Comment 4 Alexey Proskuryakov 2008-01-27 09:18:14 PST
Created attachment 18721 [details]
proposed fix

I am not sure if this code is actually needed in TextCodecMac, but I do not see any compelling reason to remove it, either.
Comment 5 Darin Adler 2008-01-27 19:11:23 PST
Comment on attachment 18721 [details]
proposed fix

Comment 6 Alexey Proskuryakov 2008-01-27 21:04:18 PST
Committed revision 29826.