Bug 3556

Summary: black diamond question mark shown for invalid UTF-8 sequences
Product: WebKit Reporter: Darin Adler <darin>
Component: DOMAssignee: Darin Adler <darin>
Status: VERIFIED FIXED    
Severity: Normal CC: ap, cdumez, nickshanks
Priority: P2    
Version: 412   
Hardware: Mac   
OS: OS X 10.4   
URL: http://www.cheap-hotel-rooms.com/Reno/Peppermill-Hotel.htm
Attachments:
Description Flags
Patch to ignore U+FFFD characters coming out of the decoder sullivan: review+

Darin Adler
Reported 2005-06-15 21:03:09 PDT
The link above is one site that has invalid UTF-8 sequences. There are many others. Also seen on news.google.com. Other browsers just seem to ignore these sequences. So we should too.
Attachments
Patch to ignore U+FFFD characters coming out of the decoder (3.07 KB, patch)
2005-06-15 21:07 PDT, Darin Adler
sullivan: review+
Darin Adler
Comment 1 2005-06-15 21:05:46 PDT
The bad sequences are partway down the page, where it says "including a 120-screen cube". I imagine they are em dashes, probably in Windows Latin-1 encoding.
Darin Adler
Comment 2 2005-06-15 21:07:13 PDT
Created attachment 2379 [details] Patch to ignore U+FFFD characters coming out of the decoder
Nicholas Shanks
Comment 3 2005-06-16 07:10:07 PDT
I see these everywhere. Just hiding them is not really optimal though: 1) Go to safari preferences 2) Set default encoding to UTF-8 3) Browse the internet for a bit You will see that many sites aren't sending encoding information, Safari is ignoring the Content- Encoding HTTP header override <meta> tag, or it's ignoring the XML charset information for xhtml served as text/html, (or all of the above, I can't really tell). Whatever the cause, it would make websites harder to read if the user was not aware that a character was missing/mis-encoded. Words would just appear with letters missing, and their meanings might change! One solution I can think of would be to note all the invalid characters encountered and try to match up a likely encoding based on document language perhaps, then suggest a document re-interpretation to the user. This is something that should be reported as an error when in web developer mode too.
Darin Adler
Comment 4 2005-06-16 10:07:27 PDT
Yes, automatically determining the correct encoding for web pages would be pretty neat. But that's not what this bug is about. This bug is about matching other browsers' behavior on various sites. All the other browsers, and older versions of Safari, simply ignore those bytes. We stopped ignoring them and started putting in black diamond question marks because of a change in the underlying OS. Please file a new bug report with specific suggestions about your enhancement idea. I don't think that idea and the concept that "skipping these characters is not good enough" should prevent us from fixing this regression and once-again matching the behavior of other browsers. Lets not continue that discussion here unless there's a really good reason to do so.
John Sullivan
Comment 5 2005-06-16 10:49:46 PDT
Comment on attachment 2379 [details] Patch to ignore U+FFFD characters coming out of the decoder r=me, excellent comment
Nicholas Shanks
Comment 6 2005-06-16 11:39:19 PDT
(In reply to comment #4) > I don't think that idea and the concept that "skipping these characters is not good enough" > should prevent us from fixing this regression and once-again matching the behavior of > other browsers. Oh, I agree. I was just saying it was not optimal, and that further work could be done to improve the situation. Was definitely not suggestion that the patch shouldn't be applied! Apologies if I gave that impression. I shall open a bug about automatic encoding detection.
Joost de Valk (AlthA)
Comment 7 2005-07-03 08:10:28 PDT
Darin, please mark this as verified if you think it is ;).
Darin Adler
Comment 8 2005-08-04 18:16:10 PDT
In Radar as <rdar://problem/4206050> 8A345: Bad (question mark in black diamond) characters in news.google.com
Alexey Proskuryakov
Comment 9 2006-06-19 09:11:42 PDT
This change was reverted in bug 8972.
Lucas Forschler
Comment 10 2019-02-06 09:04:03 PST
Mass moving XML DOM bugs to the "DOM" Component.
Note You need to log in before you can comment on or make changes to this bug.