Bug 3556

Summary: black diamond question mark shown for invalid UTF-8 sequences
Product: WebKit Reporter: Darin Adler <darin>
Component: DOMAssignee: Darin Adler <darin>
Status: VERIFIED FIXED    
Severity: Normal CC: ap, cdumez, nickshanks
Priority: P2    
Version: 412   
Hardware: Mac   
OS: OS X 10.4   
URL: http://www.cheap-hotel-rooms.com/Reno/Peppermill-Hotel.htm
Attachments:
Description Flags
Patch to ignore U+FFFD characters coming out of the decoder sullivan: review+

Description Darin Adler 2005-06-15 21:03:09 PDT
The link above is one site that has invalid UTF-8 sequences. There are many others. Also seen on 
news.google.com.

Other browsers just seem to ignore these sequences. So we should too.
Comment 1 Darin Adler 2005-06-15 21:05:46 PDT
The bad sequences are partway down the page, where it says "including a 120-screen cube". I imagine 
they are em dashes, probably in Windows Latin-1 encoding.
Comment 2 Darin Adler 2005-06-15 21:07:13 PDT
Created attachment 2379 [details]
Patch to ignore U+FFFD characters coming out of the decoder
Comment 3 Nicholas Shanks 2005-06-16 07:10:07 PDT
I see these everywhere. Just hiding them is not really optimal though:

1) Go to safari preferences
2) Set default encoding to UTF-8
3) Browse the internet for a bit

You will see that many sites aren't sending encoding information, Safari is ignoring the Content-
Encoding HTTP header override <meta> tag, or it's ignoring the XML charset information for xhtml 
served as text/html, (or all of the above, I can't really tell). Whatever the cause, it would make websites 
harder to read if the user was not aware that a character was missing/mis-encoded. Words would just 
appear with letters missing, and their meanings might change!

One solution I can think of would be to note all the invalid characters encountered and try to match up 
a likely encoding based on document language perhaps, then suggest a document re-interpretation to 
the user.
This is something that should be reported as an error when in web developer mode too.
Comment 4 Darin Adler 2005-06-16 10:07:27 PDT
Yes, automatically determining the correct encoding for web pages would be pretty neat.

But that's not what this bug is about. This bug is about matching other browsers' behavior on various 
sites. All the other browsers, and older versions of Safari, simply ignore those bytes. We stopped ignoring 
them and started putting in black diamond question marks because of a change in the underlying OS.

Please file a new bug report with specific suggestions about your enhancement idea. I don't think that idea 
and the concept that "skipping these characters is not good enough" should prevent us from fixing this 
regression and once-again matching the behavior of other browsers. Lets not continue that discussion 
here unless there's a really good reason to do so.
Comment 5 John Sullivan 2005-06-16 10:49:46 PDT
Comment on attachment 2379 [details]
Patch to ignore U+FFFD characters coming out of the decoder

r=me, excellent comment
Comment 6 Nicholas Shanks 2005-06-16 11:39:19 PDT
(In reply to comment #4)
> I don't think that idea and the concept that "skipping these characters is not good enough"
> should prevent us from fixing this regression and once-again matching the behavior of
> other browsers.

Oh, I agree. I was just saying it was not optimal, and that further work could be done to improve the 
situation. Was definitely not suggestion that the patch shouldn't be applied! Apologies if I gave that 
impression.
I shall open a bug about automatic encoding detection.
Comment 7 Joost de Valk (AlthA) 2005-07-03 08:10:28 PDT
Darin, please mark this as verified if you think it is ;).
Comment 8 Darin Adler 2005-08-04 18:16:10 PDT
In Radar as <rdar://problem/4206050> 8A345: Bad (question mark in black diamond) characters in 
news.google.com
Comment 9 Alexey Proskuryakov 2006-06-19 09:11:42 PDT
This change was reverted in bug 8972.
Comment 10 Lucas Forschler 2019-02-06 09:04:03 PST
Mass moving XML DOM bugs to the "DOM" Component.