RESOLVED FIXED 25487
windows-949 returned by document.{charset,characterset} is not recognized by Korean advertising servers
https://bugs.webkit.org/show_bug.cgi?id=25487
Summary windows-949 returned by document.{charset,characterset} is not recognized by...
Jungshik Shin
Reported 2009-04-30 12:00:46 PDT
1. Go to http://www.qubi.com (Korean Railroad) 2. The ad frame in the middle of the page has garbled characters (UTF-8 interpreted as EUC-KR) That frame is at http://file.qubi.com/sg_framework/sg_framework_top/season2/adserver_www_010.html and uses 'document.write()' for <script>. It has a meta charset declaration ('charset=euc-kr' at the top). When constructing a URL for an ad to show, it uses document.charset to pass to the server as a cgi param (charset). Because Webkit (that uses ICU) maps euc-kr to windows-949 (its superset), document.charset returns 'windows-949'. An example URL constructed as a result is like this: http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=windows-949 Unfortunately, it's not recognized by most Korean web servers. Firefox (although it treats EUC-KR as windows-949 for converting to Unicode) still uses euc-kr and the following url is constructed and the web server at qubi.com emits EUC-KR strings back. http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=EUC-KR http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=UTF-8 also works. I was mildly worried about this issue, but bit the bullet (of unforking Chrome's copy of TextCodecICU.cpp to match Webkit trunk) because I thought there'd not be many web servers relying on docuement.charset value. It appears that some ad serving web servers (in Korea) use this technique to show ads in pages in both UTF-8 and EUC-KR In the past (before Chrome unforked its copy of TextCodecICU.cpp), it modified ICU's charset alias table to treat EUC-KR the same as windows-949 but left alone TextCodecICU.cpp (as a result, document.charset returns 'EUC-KR' in Chrome in the past). Because Safari can't touch the charset alias table on ICU, it's not applicable to Webkit in general. A quick (and perhaps dirty fix) would be to add an exception to Document::encoding() to make it return 'EUC-KR' when encoding name is 'windows-949'. Perhaps, there's a better way to deal with this in TextCodecICU.cpp. I haven't yet given much thought to that possibility. It's a Chromium bug 11242 : http://code.google.com/p/chromium/issues/detail?id=11242
Attachments
patch (3.27 KB, patch)
2009-05-01 10:54 PDT, Jungshik Shin
no flags
updated patch (added displayName() to TextEncoding class) (5.11 KB, patch)
2009-05-01 17:13 PDT, Jungshik Shin
ap: review-
updated patch addressing ap's concerns (5.48 KB, patch)
2009-05-02 09:46 PDT, Jungshik Shin
ap: review+
patch same as before except for the one line fix (ready for check-in) (5.47 KB, patch)
2009-05-02 15:29 PDT, Jungshik Shin
jshin: review+
Jungshik Shin
Comment 1 2009-04-30 12:02:34 PDT
*** Bug 25488 has been marked as a duplicate of this bug. ***
Jungshik Shin
Comment 2 2009-04-30 12:08:31 PDT
I forgot to mention that 'windows-949' is not yet an IANA-registered name (unlike windows-12xx and windows-874). So, we can't blame web servers for not recognizing the name.
Jungshik Shin
Comment 3 2009-05-01 10:54:01 PDT
Eric Seidel (no email)
Comment 4 2009-05-01 11:04:43 PDT
If this patch is intended for review, you should mark it r=?. No target reviewer is necessary you can leave that field blank.
Eric Seidel (no email)
Comment 5 2009-05-01 11:06:25 PDT
This seems like a pretty heavy way to make this change. I don't know how expensive it is to create a TextEncoding object, but it seems there would be a more surgical way to make this change. Like a TextEncoding::displayName() method or something.
Jungshik Shin
Comment 6 2009-05-01 14:07:15 PDT
Thank you for your suggestion. (In reply to comment #5) > This seems like a pretty heavy way to make this change. I don't know how > expensive it is to create a TextEncoding object, but it seems there would be a > more surgical way to make this change. Like a TextEncoding::displayName() > method or something. Yeah, I've been looking for a better way to solve this issue and that's why I haven't asked for review, yet. displayName() is one of ways I've been considering. Constructing TextEncoding is not very expensive (I believe. It takes canonicalizing, which is done by looking up the alias table, and slashAsCurrencyCheck). Anyway, it seems that adding displayName() is a cleaner way than what I have now. (I don't like changing Document.cpp in my patch).
Jungshik Shin
Comment 7 2009-05-01 17:13:37 PDT
Created attachment 29952 [details] updated patch (added displayName() to TextEncoding class) added displayName() to TextEncoding class.
Alexey Proskuryakov
Comment 8 2009-05-02 00:44:46 PDT
Comment on attachment 29952 [details] updated patch (added displayName() to TextEncoding class) + * dom/Document.cpp: + (WebCore::Document::encoding): + * platform/text/TextEncoding.cpp: + (WebCore::TextEncoding::displayName): + * platform/text/TextEncoding.h: It is best to describe individual changes to functions in ChangeLog - this is why prepare-ChangeLog generates this list. +const char* TextEncoding::displayName() const { The brace should go to next line. + // We treat EUC-KR as windows-949 (its superset), but need to expose + // 'EUC-KR' because 'windows-949' is not recognized by most Korean + // web servers. This comment is somewhat misleading, because most Korean Web servers do not care about how the engine calls this encoding internally, and have no way of knowing that. It is only a certain (quite weird) technique of serving JavaScript ads that is affected. + static const char* const a = atomicCanonicalTextEncodingName("windows-949"); The TextEncoding class is used from multiple threads, so using static data without any protection or explicit initialization is dangerous. This method is not currently called from secondary threads, and this problem is already present in other methods of this class, so I'm not requesting that you fix it now - but we should do that some day. + return m_name != a ? m_name : "EUC-KR"; A "==" check or an if with early return would be slightly easier to read. + const char* displayName() const; This function name is a bit misleading, it sounds like a name used for rendering. Maybe domName() would be better? If you prefer displayName(), please add a short comment in the header. +2009-05-01 Jungshik Shin <jshin@chromium.org> + + Reviewed by NOBODY (OOPS!). + + http://bugs.webkit.org/show_bug.cgi?id=25487 + + For euc-kr and other 8bit Korean encodings + (similar to euc-kr/windows-949), make document.charset return + EUC-KR instead of windows-949. The latter is not recognized by + Korean web servers. No need to copy a description of WebCore changes to LayoutTests ChangeLog. + var charset = document.characterSet; + if (!charset) + charset = document.charset; + if (!charset) + charset = document.inputEncoding; + document.write("Encoding: " + charset + " (should be EUC-KR)"); Perhaps we should check all of these properties for having a correct value, not just the first one that happens to be non-undefined? All of the above are nitpicks, but I have one comment that may be substantial. This patch makes WebKit behavior weird and incompatible in the following case: <meta charset="x-windows-949"> <script>alert(document.characterSet)</script> Firefox alerts "x-windows-949", WebKit ToT alerts "windows-949", but with this patch, it will be "EUC-KR", which I think is a regression. There are three options that I can see: 1) Ignore this problem for now, as it's unlikely that there are any servers using x-windows-949 encoding name. 2) Make the document remember the original name for its encoding. 3) Make TextEncoding do that. If the servers in question supported "w-windows-949", we could also consider making that the canonical name, but unfortunately, adx.qubi.com does not recognize it. r- for now, as even if you prefer option 1, there is enough nitpicks to justify another quick review round.
Jungshik Shin
Comment 9 2009-05-02 09:46:10 PDT
Created attachment 29959 [details] updated patch addressing ap's concerns Thank you for the review. Can you take another look? I addressed all the comments except for the last one. I took #1 because x-windows-949 is very rarely - if ever - used by a web server. It's basically my 'invention' (used internally in Gecko not meant to be exposed although it's when 'x-windows-949' or 'ks_c_5601-1987' - a wrong name to use to begin with - is specified by a web server).
Alexey Proskuryakov
Comment 10 2009-05-02 11:12:38 PDT
Comment on attachment 29959 [details] updated patch addressing ap's concerns - return d->encoding().name(); + return d->encoding().displayName(); Please make sure that WebKit builds before landing - this should be domName(). r=me
Jungshik Shin
Comment 11 2009-05-02 15:29:12 PDT
Created attachment 29961 [details] patch same as before except for the one line fix (ready for check-in) Thank you for catching that. I ran the layout test without rebuilding Webkit. Stupid me. I rebuilt and reran the test. Transferring r+ because that's just s/displayName/domName/ in Document.cpp
Eric Seidel (no email)
Comment 12 2009-05-05 22:38:16 PDT
Committing to http://svn.webkit.org/repository/webkit/trunk ... M LayoutTests/ChangeLog A LayoutTests/fast/encoding/euckr-name-expected.txt A LayoutTests/fast/encoding/euckr-name.html M WebCore/ChangeLog M WebCore/dom/Document.cpp M WebCore/platform/text/TextEncoding.cpp M WebCore/platform/text/TextEncoding.h Committed r43279
Note You need to log in before you can comment on or make changes to this bug.