Bug 3556

Summary:

black diamond question mark shown for invalid UTF-8 sequences

Product:

WebKit

Reporter:

Darin Adler <darin>

Component:

DOM

Assignee:

Darin Adler <darin>

Status:

VERIFIED FIXED

Severity:

Normal

CC:

ap, cdumez, nickshanks

Priority:

Version:

412

Hardware:

Mac

OS:

OS X 10.4

URL:

http://www.cheap-hotel-rooms.com/Reno/Peppermill-Hotel.htm

Attachments:

Description	Flags
Patch to ignore U+FFFD characters coming out of the decoder	sullivan: review+

Darin Adler

Reported 2005-06-15 21:03:09 PDT

The link above is one site that has invalid UTF-8 sequences. There are many others. Also seen on news.google.com. Other browsers just seem to ignore these sequences. So we should too.

Attachments
Patch to ignore U+FFFD characters coming out of the decoder (3.07 KB, patch) 2005-06-15 21:07 PDT, Darin Adler	sullivan: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Darin Adler

Comment 1 2005-06-15 21:05:46 PDT

The bad sequences are partway down the page, where it says "including a 120-screen cube". I imagine they are em dashes, probably in Windows Latin-1 encoding.

Darin Adler

Comment 2 2005-06-15 21:07:13 PDT

Created attachment 2379 [details] Patch to ignore U+FFFD characters coming out of the decoder

Nicholas Shanks

Comment 3 2005-06-16 07:10:07 PDT

I see these everywhere. Just hiding them is not really optimal though: 1) Go to safari preferences 2) Set default encoding to UTF-8 3) Browse the internet for a bit You will see that many sites aren't sending encoding information, Safari is ignoring the Content- Encoding HTTP header override <meta> tag, or it's ignoring the XML charset information for xhtml served as text/html, (or all of the above, I can't really tell). Whatever the cause, it would make websites harder to read if the user was not aware that a character was missing/mis-encoded. Words would just appear with letters missing, and their meanings might change! One solution I can think of would be to note all the invalid characters encountered and try to match up a likely encoding based on document language perhaps, then suggest a document re-interpretation to the user. This is something that should be reported as an error when in web developer mode too.

Darin Adler

Comment 4 2005-06-16 10:07:27 PDT

Yes, automatically determining the correct encoding for web pages would be pretty neat. But that's not what this bug is about. This bug is about matching other browsers' behavior on various sites. All the other browsers, and older versions of Safari, simply ignore those bytes. We stopped ignoring them and started putting in black diamond question marks because of a change in the underlying OS. Please file a new bug report with specific suggestions about your enhancement idea. I don't think that idea and the concept that "skipping these characters is not good enough" should prevent us from fixing this regression and once-again matching the behavior of other browsers. Lets not continue that discussion here unless there's a really good reason to do so.

John Sullivan

Comment 5 2005-06-16 10:49:46 PDT

Comment on attachment 2379 [details] Patch to ignore U+FFFD characters coming out of the decoder r=me, excellent comment

Nicholas Shanks

Comment 6 2005-06-16 11:39:19 PDT

(In reply to comment #4) > I don't think that idea and the concept that "skipping these characters is not good enough" > should prevent us from fixing this regression and once-again matching the behavior of > other browsers. Oh, I agree. I was just saying it was not optimal, and that further work could be done to improve the situation. Was definitely not suggestion that the patch shouldn't be applied! Apologies if I gave that impression. I shall open a bug about automatic encoding detection.

Joost de Valk (AlthA)

Comment 7 2005-07-03 08:10:28 PDT

Darin, please mark this as verified if you think it is ;).

Darin Adler

Comment 8 2005-08-04 18:16:10 PDT

In Radar as <rdar://problem/4206050> 8A345: Bad (question mark in black diamond) characters in news.google.com

Alexey Proskuryakov

Comment 9 2006-06-19 09:11:42 PDT

This change was reverted in bug 8972.

Lucas Forschler

Comment 10 2019-02-06 09:04:03 PST

Mass moving XML DOM bugs to the "DOM" Component.

Note You need to log in before you can comment on or make changes to this bug.