1. Go to http://www.qubi.com (Korean Railroad) 2. The ad frame in the middle of the page has garbled characters (UTF-8 interpreted as EUC-KR) That frame is at http://file.qubi.com/sg_framework/sg_framework_top/season2/adserver_www_010.html and uses 'document.write()' for <script>. It has a meta charset declaration ('charset=euc-kr' at the top). When constructing a URL for an ad to show, it uses document.charset to pass to the server as a cgi param (charset). Because Webkit (that uses ICU) maps euc-kr to windows-949 (its superset), document.charset returns 'windows-949'. An example URL constructed as a result is like this: http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=windows-949 Unfortunately, it's not recognized by most Korean web servers. Firefox (although it treats EUC-KR as windows-949 for converting to Unicode) still uses euc-kr and the following url is constructed and the web server at qubi.com emits EUC-KR strings back. http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=EUC-KR http://adx.qubi.com/openx/www/delivery/ajs.php?zoneid=35&cb=1000&charset=UTF-8 also works. I was mildly worried about this issue, but bit the bullet (of unforking Chrome's copy of TextCodecICU.cpp to match Webkit trunk) because I thought there'd not be many web servers relying on docuement.charset value. It appears that some ad serving web servers (in Korea) use this technique to show ads in pages in both UTF-8 and EUC-KR In the past (before Chrome unforked its copy of TextCodecICU.cpp), it modified ICU's charset alias table to treat EUC-KR the same as windows-949 but left alone TextCodecICU.cpp (as a result, document.charset returns 'EUC-KR' in Chrome in the past). Because Safari can't touch the charset alias table on ICU, it's not applicable to Webkit in general. A quick (and perhaps dirty fix) would be to add an exception to Document::encoding() to make it return 'EUC-KR' when encoding name is 'windows-949'. Perhaps, there's a better way to deal with this in TextCodecICU.cpp. I haven't yet given much thought to that possibility. It's a Chromium bug 11242 : http://code.google.com/p/chromium/issues/detail?id=11242
*** Bug 25488 has been marked as a duplicate of this bug. ***
I forgot to mention that 'windows-949' is not yet an IANA-registered name (unlike windows-12xx and windows-874). So, we can't blame web servers for not recognizing the name.
Created attachment 29943 [details] patch
If this patch is intended for review, you should mark it r=?. No target reviewer is necessary you can leave that field blank.
This seems like a pretty heavy way to make this change. I don't know how expensive it is to create a TextEncoding object, but it seems there would be a more surgical way to make this change. Like a TextEncoding::displayName() method or something.
Thank you for your suggestion. (In reply to comment #5) > This seems like a pretty heavy way to make this change. I don't know how > expensive it is to create a TextEncoding object, but it seems there would be a > more surgical way to make this change. Like a TextEncoding::displayName() > method or something. Yeah, I've been looking for a better way to solve this issue and that's why I haven't asked for review, yet. displayName() is one of ways I've been considering. Constructing TextEncoding is not very expensive (I believe. It takes canonicalizing, which is done by looking up the alias table, and slashAsCurrencyCheck). Anyway, it seems that adding displayName() is a cleaner way than what I have now. (I don't like changing Document.cpp in my patch).
Created attachment 29952 [details] updated patch (added displayName() to TextEncoding class) added displayName() to TextEncoding class.
Comment on attachment 29952 [details] updated patch (added displayName() to TextEncoding class) + * dom/Document.cpp: + (WebCore::Document::encoding): + * platform/text/TextEncoding.cpp: + (WebCore::TextEncoding::displayName): + * platform/text/TextEncoding.h: It is best to describe individual changes to functions in ChangeLog - this is why prepare-ChangeLog generates this list. +const char* TextEncoding::displayName() const { The brace should go to next line. + // We treat EUC-KR as windows-949 (its superset), but need to expose + // 'EUC-KR' because 'windows-949' is not recognized by most Korean + // web servers. This comment is somewhat misleading, because most Korean Web servers do not care about how the engine calls this encoding internally, and have no way of knowing that. It is only a certain (quite weird) technique of serving JavaScript ads that is affected. + static const char* const a = atomicCanonicalTextEncodingName("windows-949"); The TextEncoding class is used from multiple threads, so using static data without any protection or explicit initialization is dangerous. This method is not currently called from secondary threads, and this problem is already present in other methods of this class, so I'm not requesting that you fix it now - but we should do that some day. + return m_name != a ? m_name : "EUC-KR"; A "==" check or an if with early return would be slightly easier to read. + const char* displayName() const; This function name is a bit misleading, it sounds like a name used for rendering. Maybe domName() would be better? If you prefer displayName(), please add a short comment in the header. +2009-05-01 Jungshik Shin <jshin@chromium.org> + + Reviewed by NOBODY (OOPS!). + + http://bugs.webkit.org/show_bug.cgi?id=25487 + + For euc-kr and other 8bit Korean encodings + (similar to euc-kr/windows-949), make document.charset return + EUC-KR instead of windows-949. The latter is not recognized by + Korean web servers. No need to copy a description of WebCore changes to LayoutTests ChangeLog. + var charset = document.characterSet; + if (!charset) + charset = document.charset; + if (!charset) + charset = document.inputEncoding; + document.write("Encoding: " + charset + " (should be EUC-KR)"); Perhaps we should check all of these properties for having a correct value, not just the first one that happens to be non-undefined? All of the above are nitpicks, but I have one comment that may be substantial. This patch makes WebKit behavior weird and incompatible in the following case: <meta charset="x-windows-949"> <script>alert(document.characterSet)</script> Firefox alerts "x-windows-949", WebKit ToT alerts "windows-949", but with this patch, it will be "EUC-KR", which I think is a regression. There are three options that I can see: 1) Ignore this problem for now, as it's unlikely that there are any servers using x-windows-949 encoding name. 2) Make the document remember the original name for its encoding. 3) Make TextEncoding do that. If the servers in question supported "w-windows-949", we could also consider making that the canonical name, but unfortunately, adx.qubi.com does not recognize it. r- for now, as even if you prefer option 1, there is enough nitpicks to justify another quick review round.
Created attachment 29959 [details] updated patch addressing ap's concerns Thank you for the review. Can you take another look? I addressed all the comments except for the last one. I took #1 because x-windows-949 is very rarely - if ever - used by a web server. It's basically my 'invention' (used internally in Gecko not meant to be exposed although it's when 'x-windows-949' or 'ks_c_5601-1987' - a wrong name to use to begin with - is specified by a web server).
Comment on attachment 29959 [details] updated patch addressing ap's concerns - return d->encoding().name(); + return d->encoding().displayName(); Please make sure that WebKit builds before landing - this should be domName(). r=me
Created attachment 29961 [details] patch same as before except for the one line fix (ready for check-in) Thank you for catching that. I ran the layout test without rebuilding Webkit. Stupid me. I rebuilt and reran the test. Transferring r+ because that's just s/displayName/domName/ in Document.cpp
Committing to http://svn.webkit.org/repository/webkit/trunk ... M LayoutTests/ChangeLog A LayoutTests/fast/encoding/euckr-name-expected.txt A LayoutTests/fast/encoding/euckr-name.html M WebCore/ChangeLog M WebCore/dom/Document.cpp M WebCore/platform/text/TextEncoding.cpp M WebCore/platform/text/TextEncoding.h Committed r43279