Bug 22962 - Web page encoded as "Big 5 HKSCS" is not decoded properly
Summary: Web page encoded as "Big 5 HKSCS" is not decoded properly
Status: RESOLVED INVALID
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 528+ (Nightly build)
Hardware: Mac OS X 10.5
: P2 Normal
Assignee: Nobody
URL: http://www.mingpaonews.com/20081222/g...
Keywords: HasReduction, InRadar
Depends on:
Blocks:
 
Reported: 2008-12-22 08:11 PST by David Kilzer (:ddkilzer)
Modified: 2022-09-16 15:20 PDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Kilzer (:ddkilzer) 2008-12-22 08:11:32 PST
* SUMMARY
Web page with "Big5" encoding specified in <meta> tag (and Content-Type sent as "text/html") is not detected as having "Big 5 HKSCS" encoding and is thus not decoded properly.  The same page loaded in Firefox 3 is detected and decoded properly.

* STEPS TO REPRODUCE
1. Launch Safari/WebKit.
2. Open URL:  http://www.mingpaonews.com/20081222/gaa1h.htm

* RESULTS
Note square boxes in the text of the story, and how the text differs after switching to "Big 5 HKSCS" encoding via the "Text Encoding" item in the View menu.

* REGRESSION
Unknown.  Tested Safari 3.2.1 on Mac OS X 10.5.6 and a local debug build of WebKit r39423.  Both showed the same behavior.

* NOTES
Firefox 3 gets it right, so WebKit should be using a similar heuristic.
Comment 1 David Kilzer (:ddkilzer) 2008-12-22 08:19:56 PST
<rdar://problem/6462924>
Comment 2 Alexey Proskuryakov 2008-12-22 08:50:50 PST
This page uses an encoding that is different from either Big5 variant supported by Safari - note the replacement characters that appear after forcing the encoding to Big 5 HKSCS.
Comment 3 Alexey Proskuryakov 2008-12-22 08:54:45 PST
Dave, do you know for a fact that Firefox decodes the text 100% correctly? Or just that it has no square boxes, question marks and other obvious brokenness?
Comment 4 David Kilzer (:ddkilzer) 2008-12-22 09:04:34 PST
(In reply to comment #3)
> Dave, do you know for a fact that Firefox decodes the text 100% correctly? Or
> just that it has no square boxes, question marks and other obvious brokenness?

Scrolling down the page, I see replacement characters in Firefox 3 as well.  They're "?" characters without black diamonds around them.
Comment 5 David Kilzer (:ddkilzer) 2008-12-22 09:04:57 PST
I wonder if MSIE 6/7/8 handle this page any better?

Comment 6 Alexey Proskuryakov 2008-12-22 09:11:26 PST
(In reply to comment #4)
> Scrolling down the page, I see replacement characters in Firefox 3 as well. 
> They're "?" characters without black diamonds around them.

Are you sure about that? These looked like normal question marks to me.
Comment 7 David Kilzer (:ddkilzer) 2008-12-22 09:19:09 PST
(In reply to comment #6)
> (In reply to comment #4)
> > Scrolling down the page, I see replacement characters in Firefox 3 as well. 
> > They're "?" characters without black diamonds around them.
> 
> Are you sure about that? These looked like normal question marks to me.

No, I am not sure.  I do not read Chinese.  :)

I don't see any "square boxes" or question-marks-in-black-diamonds on the page in Firefox 3.  I *do* see a character that looks like "No" with the "o" superscript and underlined (&#8470;) in the Firefox page that doesn't appear in the Safari page with "Big 5 HKSCS" encoding.

Also note that the black diamonds in Desktop Safari when switching text encoding to "Big 5 HKSCS" are simply colons on the Firefox 3 page.  Could this be a missing glyph or a decoding bug?
Comment 8 David Kilzer (:ddkilzer) 2008-12-22 09:21:01 PST
The equivalent character from Desktop Safari (to the "No" character in Firefox 3):  &#22050;
Comment 9 Eric Seidel (no email) 2012-10-24 12:44:03 PDT
It's unclear to me if this is still an issue.
Comment 10 Sam Sneddon [:gsnedders] 2022-09-16 15:20:48 PDT
Archive.org doesn't seem to have archived this either, so it's not meaningfully actionable as I can tell.