Bug 25487 - windows-949 returned by document.{charset,characterset} is not recognized by Korean advertising servers
Summary: windows-949 returned by document.{charset,characterset} is not recognized by...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Jungshik Shin
URL: http://adx.qubi.com/openx/www/deliver...
Keywords:
: 25488 (view as bug list)
Depends on:
Blocks:
 
Reported: 2009-04-30 12:00 PDT by Jungshik Shin
Modified: 2009-05-05 22:38 PDT (History)
1 user (show)

See Also:


Attachments
patch (3.27 KB, patch)
2009-05-01 10:54 PDT, Jungshik Shin
no flags Details | Formatted Diff | Diff
updated patch (added displayName() to TextEncoding class) (5.11 KB, patch)
2009-05-01 17:13 PDT, Jungshik Shin
ap: review-
Details | Formatted Diff | Diff
updated patch addressing ap's concerns (5.48 KB, patch)
2009-05-02 09:46 PDT, Jungshik Shin
ap: review+
Details | Formatted Diff | Diff
patch same as before except for the one line fix (ready for check-in) (5.47 KB, patch)
2009-05-02 15:29 PDT, Jungshik Shin
jshin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2009-04-30 12:00:46 PDT
1. Go to http://www.qubi.com (Korean Railroad) 
2. The ad frame in the middle of the page has garbled characters (UTF-8 interpreted as EUC-KR)

That frame is at http://file.qubi.com/sg_framework/sg_framework_top/season2/adserver_www_010.html
and uses 'document.write()' for <script>.  It has a meta charset declaration ('charset=euc-kr' at the top). 

When constructing a URL for an ad to show, it uses document.charset to pass to the server as a cgi param (charset). 

Because Webkit (that uses ICU) maps euc-kr to windows-949 (its superset), document.charset returns 'windows-949'. An example URL constructed as a result is like this:

http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=windows-949

Unfortunately, it's not recognized by most Korean web servers. 

Firefox (although it treats EUC-KR as windows-949 for converting to Unicode) still uses euc-kr and the following url is constructed and the web server at qubi.com emits EUC-KR strings back.

http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=EUC-KR

http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=UTF-8 also works. 

I was mildly worried about this issue, but bit the bullet (of unforking Chrome's copy of TextCodecICU.cpp to match Webkit trunk) because I thought there'd not be many web servers relying on docuement.charset value. It appears that some ad serving web servers (in Korea) use this technique to show ads in pages in both UTF-8 and EUC-KR

In the past (before Chrome unforked its copy of TextCodecICU.cpp), it modified ICU's charset alias table to treat EUC-KR the same as windows-949 but left alone TextCodecICU.cpp (as a result, document.charset returns 'EUC-KR' in Chrome in the past). Because Safari can't touch the charset alias table on ICU, it's not applicable to Webkit in general. 

A quick (and perhaps dirty fix) would be to add an exception to Document::encoding() to make it return 'EUC-KR' when encoding name is 'windows-949'.  Perhaps, there's a better way to deal with this in TextCodecICU.cpp. I haven't yet given much thought to that possibility. 

It's a Chromium bug 11242 :
http://code.google.com/p/chromium/issues/detail?id=11242
Comment 1 Jungshik Shin 2009-04-30 12:02:34 PDT
*** Bug 25488 has been marked as a duplicate of this bug. ***
Comment 2 Jungshik Shin 2009-04-30 12:08:31 PDT
I forgot to mention that 'windows-949' is not yet an IANA-registered name (unlike windows-12xx and windows-874). So, we can't blame web servers for not recognizing the name. 
Comment 3 Jungshik Shin 2009-05-01 10:54:01 PDT
Created attachment 29943 [details]
patch
Comment 4 Eric Seidel (no email) 2009-05-01 11:04:43 PDT
If this patch is intended for review, you should mark it r=?.  No target reviewer is necessary you can leave that field blank.
Comment 5 Eric Seidel (no email) 2009-05-01 11:06:25 PDT
This seems like a pretty heavy way to make this change.  I don't know how expensive it is to create a TextEncoding object, but it seems there would be a more surgical way to make this change.  Like a TextEncoding::displayName() method or something.
Comment 6 Jungshik Shin 2009-05-01 14:07:15 PDT
Thank you for your suggestion. 

(In reply to comment #5)
> This seems like a pretty heavy way to make this change.  I don't know how
> expensive it is to create a TextEncoding object, but it seems there would be a
> more surgical way to make this change.  Like a TextEncoding::displayName()
> method or something.

Yeah, I've been looking for a better way to solve this issue and that's why I haven't asked for review, yet. displayName() is one of ways I've been considering. 

Constructing TextEncoding is not very expensive (I believe. It takes canonicalizing, which is done by looking up the alias table, and slashAsCurrencyCheck). Anyway, it seems that adding displayName() is a cleaner way than what I have now. (I don't like changing Document.cpp in my patch).

Comment 7 Jungshik Shin 2009-05-01 17:13:37 PDT
Created attachment 29952 [details]
updated patch (added displayName() to TextEncoding class)

added displayName() to TextEncoding class.
Comment 8 Alexey Proskuryakov 2009-05-02 00:44:46 PDT
Comment on attachment 29952 [details]
updated patch (added displayName() to TextEncoding class)

+        * dom/Document.cpp:
+        (WebCore::Document::encoding):
+        * platform/text/TextEncoding.cpp:
+        (WebCore::TextEncoding::displayName):
+        * platform/text/TextEncoding.h:

It is best to describe individual changes to functions in ChangeLog - this is why prepare-ChangeLog generates this list.

+const char* TextEncoding::displayName() const {

The brace should go to next line.

+    // We treat EUC-KR as windows-949 (its superset), but need to expose 
+    // 'EUC-KR' because 'windows-949' is not recognized by most Korean
+    // web servers. 

This comment is somewhat misleading, because most Korean Web servers do not care about how the engine calls this encoding internally, and have no way of knowing that. It is only a certain (quite weird) technique of serving JavaScript ads that is affected.

+    static const char* const a = atomicCanonicalTextEncodingName("windows-949");

The TextEncoding class is used from multiple threads, so using static data without any protection or explicit initialization is dangerous. This method is not currently called from secondary threads, and this problem is already present in other methods of this class, so I'm not requesting that you fix it now - but we should do that some day.

+    return m_name != a ? m_name : "EUC-KR";

A "==" check or an if with early return would be slightly easier to read.

+        const char* displayName() const;

This function name is a bit misleading, it sounds like a name used for rendering. Maybe domName() would be better? If you prefer displayName(), please add a short comment in the header.

+2009-05-01  Jungshik Shin  <jshin@chromium.org>
+
+        Reviewed by NOBODY (OOPS!).
+
+        http://bugs.webkit.org/show_bug.cgi?id=25487
+
+        For euc-kr and other 8bit Korean encodings 
+        (similar to euc-kr/windows-949), make document.charset return
+        EUC-KR instead of windows-949. The latter is not recognized by
+        Korean web servers.

No need to copy a description of WebCore changes to LayoutTests ChangeLog.

+  var charset = document.characterSet;
+  if (!charset)
+    charset = document.charset;
+  if (!charset)
+    charset = document.inputEncoding;
+  document.write("Encoding: " + charset + " (should be EUC-KR)");

Perhaps we should check all of these properties for having a correct value, not just the first one that happens to be non-undefined?

All of the above are nitpicks, but I have one comment that may be substantial. This patch makes WebKit behavior weird and incompatible in the following case:

<meta charset="x-windows-949">
<script>alert(document.characterSet)</script>

Firefox alerts "x-windows-949", WebKit ToT alerts "windows-949", but with this patch, it will be "EUC-KR", which I think is a regression.

There are three options that I can see:
1) Ignore this problem for now, as it's unlikely that there are any servers using x-windows-949 encoding name.
2) Make the document remember the original name for its encoding.
3) Make TextEncoding do that.

If the servers in question supported "w-windows-949", we could also consider making that the canonical name, but unfortunately, adx.qubi.com does not recognize it.

r- for now, as even if you prefer option 1, there is enough nitpicks to justify another quick review round.
Comment 9 Jungshik Shin 2009-05-02 09:46:10 PDT
Created attachment 29959 [details]
updated patch addressing ap's concerns

Thank you for the review. Can you take another look? 

I addressed all the comments except for the last one. I took #1 because x-windows-949 is very rarely - if ever - used by a web server. It's basically my 'invention' (used internally in Gecko not meant to be exposed although it's when 'x-windows-949' or 'ks_c_5601-1987' - a wrong name to use to begin with - is specified by a web server).
Comment 10 Alexey Proskuryakov 2009-05-02 11:12:38 PDT
Comment on attachment 29959 [details]
updated patch addressing ap's concerns

-        return d->encoding().name();
+        return d->encoding().displayName();

Please make sure that WebKit builds before landing - this should be domName().

r=me
Comment 11 Jungshik Shin 2009-05-02 15:29:12 PDT
Created attachment 29961 [details]
patch same as before except for the one line fix (ready for check-in)

Thank you for catching that. I ran the layout test without rebuilding Webkit. Stupid me. I rebuilt and reran the test. 

Transferring r+ because that's just s/displayName/domName/ in Document.cpp
Comment 12 Eric Seidel (no email) 2009-05-05 22:38:16 PDT
Committing to http://svn.webkit.org/repository/webkit/trunk ...
	M	LayoutTests/ChangeLog
	A	LayoutTests/fast/encoding/euckr-name-expected.txt
	A	LayoutTests/fast/encoding/euckr-name.html
	M	WebCore/ChangeLog
	M	WebCore/dom/Document.cpp
	M	WebCore/platform/text/TextEncoding.cpp
	M	WebCore/platform/text/TextEncoding.h
Committed r43279