Bug 8972

Summary: REGRESSION: invalid UTF-8 sequences are not displayed
Product: WebKit Reporter: tim bates <timothy.c.bates>
Component: Page LoadingAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Minor CC: adele, ap, darin, nickshanks
Priority: P1 Keywords: InRadar, Regression
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
URL: http://www.decheung.com/2006/05/jungle_disk_sto.html
Attachments:
Description Flags
Broken UTF-8
none
windows-1252
none
ShiftJIS
none
proposed patch
none
proposed patch
none
proposed patch darin: review+

tim bates
Reported 2006-05-18 03:26:07 PDT
if you visit the URL in Tiger release Safari, you see a a balck diamond question mark character in the sentence "Store an unlimited amount of data for only 15" This is also shown in the source Store an unlimited amount of data for only 15&#65533; per gigabyte Under 420+, the cents character is simply missing from the view and the source. Not sure what the bug, if any, is here.
Attachments
Broken UTF-8 (121 bytes, text/html)
2006-05-21 03:39 PDT, Alexey Proskuryakov
no flags
windows-1252 (118 bytes, text/html)
2006-05-21 03:40 PDT, Alexey Proskuryakov
no flags
ShiftJIS (114 bytes, text/html)
2006-05-21 03:40 PDT, Alexey Proskuryakov
no flags
proposed patch (25.08 KB, patch)
2006-06-17 08:17 PDT, Alexey Proskuryakov
no flags
proposed patch (26.25 KB, patch)
2006-06-17 23:42 PDT, Alexey Proskuryakov
no flags
proposed patch (26.82 KB, patch)
2006-06-18 01:01 PDT, Alexey Proskuryakov
darin: review+
Alexey Proskuryakov
Comment 1 2006-05-18 04:40:54 PDT
This was an intentional change, see bug 3556. However, I cannot confirm the comment that other browsers ignore invalid UTF-8 sequences - WinIE and Mac Firefox do display either question marks or empty boxes at both bug URLs for me.
Darin Adler
Comment 2 2006-05-18 09:18:27 PDT
I'd like to match other browsers. I don't know how I could have gotten it wrong originally, though. I'm quite sure there were sites with black question marks in Safari only and nothing there in other browsers. Maybe there are different categories of illegal UTF-8 sequences that are handled differently? Someone should do some research on this and find out what I got wrong originally.
Darin Adler
Comment 3 2006-05-18 09:21:43 PDT
Strange, I tested the http://www.cheap-hotel-rooms.com/Reno/Peppermill-Hotel.htm page mentioned in the original bug. It shows plain old "?" characters in Firefox 1.5.0.3 on Macintosh where we used to use our black diamond question mark. (Around the text "including a 120-screen cube".) But I could have sworn I tested this back when I fixed the bug. Is there a chance Firefox changed its behavior? I probably never tested Windows Internet Explorer behavior.
Alexey Proskuryakov
Comment 4 2006-05-18 12:59:59 PDT
(In reply to comment #3) > Is there a chance Firefox changed its behavior? Firefox 1.0.3 and 1.0.5 also display question marks for me. I don't have other versions archived.
Alexey Proskuryakov
Comment 5 2006-05-21 03:39:40 PDT
Created attachment 8442 [details] Broken UTF-8 Looks like various kinds of UTF-8 brokenness all give question marks in Firefox 1.5. Invalid WinLatin bytes get discarded (but it is so in shipping Safari, too); recovery from broken ShiftJIS is very different in Firefox, shipping Safari and ToT WebKit.
Alexey Proskuryakov
Comment 6 2006-05-21 03:40:04 PDT
Created attachment 8443 [details] windows-1252
Alexey Proskuryakov
Comment 7 2006-05-21 03:40:54 PDT
Created attachment 8444 [details] ShiftJIS
Alice Liu
Comment 8 2006-06-06 09:37:57 PDT
Nicholas Shanks
Comment 9 2006-06-10 10:57:27 PDT
I would like to see this fixed. I believe the "fix" to bug 3556 should never have been authorised, as I don't believe it was a bug. It is very important to know that the page is not being displayed in the correct encoding so that I can try alternates manually. Not displaying the black diamonds disguises this and means the user is not aware that data is missing, which could potentially be very bad! Firefox's question marks are not very conspicuous, but at least they are there.
Alexey Proskuryakov
Comment 10 2006-06-17 08:17:03 PDT
Created attachment 8885 [details] proposed patch
Darin Adler
Comment 11 2006-06-17 17:18:08 PDT
Comment on attachment 8885 [details] proposed patch appendOmittingUnwanted should be renamed to appendOmittingBOM -- that was its original name way back in the mists of time before we added null (now gone) and replacement character (now gone) to the list of characters to strip.
Alexey Proskuryakov
Comment 12 2006-06-17 23:42:39 PDT
Created attachment 8895 [details] proposed patch Renamed appendOmittingUnwanted().
Alexey Proskuryakov
Comment 13 2006-06-18 00:33:22 PDT
Now that the layout tests work again, I've found that this change uncovers an apparent bug in XML entity handling, looking into it...
Alexey Proskuryakov
Comment 14 2006-06-18 01:01:07 PDT
Created attachment 8897 [details] proposed patch Now with a getXHTMLEntity() fix.
Darin Adler
Comment 15 2006-06-18 16:41:48 PDT
Comment on attachment 8897 [details] proposed patch r=me
Alexey Proskuryakov
Comment 16 2006-06-19 09:10:55 PDT
Committed revision 14911.
Note You need to log in before you can comment on or make changes to this bug.