Summary: | black diamond question mark shown for invalid UTF-8 sequences | ||||||
---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Darin Adler <darin> | ||||
Component: | DOM | Assignee: | Darin Adler <darin> | ||||
Status: | VERIFIED FIXED | ||||||
Severity: | Normal | CC: | ap, cdumez, nickshanks | ||||
Priority: | P2 | ||||||
Version: | 412 | ||||||
Hardware: | Mac | ||||||
OS: | OS X 10.4 | ||||||
URL: | http://www.cheap-hotel-rooms.com/Reno/Peppermill-Hotel.htm | ||||||
Attachments: |
|
Description
Darin Adler
2005-06-15 21:03:09 PDT
The bad sequences are partway down the page, where it says "including a 120-screen cube". I imagine they are em dashes, probably in Windows Latin-1 encoding. Created attachment 2379 [details]
Patch to ignore U+FFFD characters coming out of the decoder
I see these everywhere. Just hiding them is not really optimal though: 1) Go to safari preferences 2) Set default encoding to UTF-8 3) Browse the internet for a bit You will see that many sites aren't sending encoding information, Safari is ignoring the Content- Encoding HTTP header override <meta> tag, or it's ignoring the XML charset information for xhtml served as text/html, (or all of the above, I can't really tell). Whatever the cause, it would make websites harder to read if the user was not aware that a character was missing/mis-encoded. Words would just appear with letters missing, and their meanings might change! One solution I can think of would be to note all the invalid characters encountered and try to match up a likely encoding based on document language perhaps, then suggest a document re-interpretation to the user. This is something that should be reported as an error when in web developer mode too. Yes, automatically determining the correct encoding for web pages would be pretty neat. But that's not what this bug is about. This bug is about matching other browsers' behavior on various sites. All the other browsers, and older versions of Safari, simply ignore those bytes. We stopped ignoring them and started putting in black diamond question marks because of a change in the underlying OS. Please file a new bug report with specific suggestions about your enhancement idea. I don't think that idea and the concept that "skipping these characters is not good enough" should prevent us from fixing this regression and once-again matching the behavior of other browsers. Lets not continue that discussion here unless there's a really good reason to do so. Comment on attachment 2379 [details]
Patch to ignore U+FFFD characters coming out of the decoder
r=me, excellent comment
(In reply to comment #4) > I don't think that idea and the concept that "skipping these characters is not good enough" > should prevent us from fixing this regression and once-again matching the behavior of > other browsers. Oh, I agree. I was just saying it was not optimal, and that further work could be done to improve the situation. Was definitely not suggestion that the patch shouldn't be applied! Apologies if I gave that impression. I shall open a bug about automatic encoding detection. Darin, please mark this as verified if you think it is ;). In Radar as <rdar://problem/4206050> 8A345: Bad (question mark in black diamond) characters in news.google.com Mass moving XML DOM bugs to the "DOM" Component. |