Bug 3590

Summary: should allow <meta> tags for encoding even after </head>
Product: WebKit Reporter: Nicholas Shanks <nickshanks>
Component: Layout and RenderingAssignee: Darin Adler <darin>
Status: VERIFIED FIXED    
Severity: Normal CC: ap
Priority: P2    
Version: 412   
Hardware: Mac   
OS: OS X 10.4   
URL: http://caracol.com.co/noticias/180350.asp
Attachments:
Description Flags
testcase
none
META outside HEAD
none
META outside HEAD patch
darin: review+
A better test for META outside HEAD none

Description Nicholas Shanks 2005-06-17 06:37:54 PDT
Adapted from my comments to bug 3556:

1) Go to safari preferences
2) Set default encoding to UTF-8
3) Browse the internet for a bit,
    for example http://caracol.com.co/noticias/180350.asp

You will see that many sites aren't sending encoding information, Safari is ignoring the Content-
Encoding HTTP header override <meta> tag, or it's ignoring the XML charset information for xhtml 
served as text/html, (or all of the above, I can't really tell). Whatever the cause, now that bug 3556 has 
been fixed it makes websites a little harder to read, as the user is not aware that the wrong encoding 
was being used. Words appear with letters missing, and this might change their meanings!

One solution I can think of would be to note all the invalid characters encountered and try to match up 
a likely encoding, based on document language perhaps, then suggest a document re-interpretation to 
the user. You might implement a list of trigger words where for example a certain sequence of bytes is 
almost certainly the word добрый in KOI8-R, for example. 
The lack of encoding information is something that should be reported as an error when in web 
developer mode too. (p.s. Safari needs a web developer mode :-)
Comment 1 Joost de Valk (AlthA) 2005-06-17 11:36:24 PDT
I don't see any charset in the header:

HTTP/1.1 200 OK
Date: Fri, 17 Jun 2005 18:34:25 GMT
Server: Microsoft-IIS/6.0
pragma: no-cache
cache-control: max-age=0,no-cache,private,must-revalidate
Content-Length: 37364
Content-Type: text/html
Set-Cookie: ASPSESSIONIDSCDSADQR=LJNNDOHBOMKOOPPOMMIHCCKF; path=/
Cache-control: private  

nor in the head:

<html>
<head>
<title>Noticias - Caracol Radio</title>
<base href="http://www.caracol.com.co/">
<META NAME="DESCRIPTION" CONTENT="Caracol Radio :: Diez muertos y decenas de heridos por el 
terremoto de 7.9 grados en el norte de Chile">
<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">
<meta http-equiv=refresh content=300>
<link rel="stylesheet" href="/stylos.css" type="text/css">
<script src="/scripts/base.js" type="text/javascript"></script>
</head>

In the testcase i will attach in a few secs, i have added a content-type meta in the head, which makes 
safari render the page just fine...

Comment 2 Joost de Valk (AlthA) 2005-06-17 11:37:04 PDT
Created attachment 2445 [details]
testcase
Comment 3 Alexey Proskuryakov 2005-06-17 11:54:17 PDT
I have researched automatic charset detection for a while, here are my findings:

1. Safari does handle the encoding specified in Content-Type "meta http-equiv" header. However, see 
rdar://4127219 - Safari only looks for META inside HEAD. This may be formally correct, but some sites 
put the META after HEAD, and other browsers support this.

2. WebCore seems to have encoding auto-detection for Japanese.

3. I could not find where WebCore considers the HTTP Content-Type header (which should override the 
HTML META, not the other way).

The question I have is when to attempt automatic guessing. In my experience, incorrect encoding 
doesn't usually lead to invalid characters. But trying to always detect the language and encoding may 
cause unpleasant user experience, especially with multi-language texts.

As an aside, it seems that ICU is considering support for automatic charset detection (see the recent 
entries at http://icu.sourceforge.net/meetings/).
Comment 4 Joost de Valk (AlthA) 2005-06-23 10:10:16 PDT
sounds good to me :)
Comment 5 Alexey Proskuryakov 2005-08-06 04:20:23 PDT
Created attachment 3239 [details]
META outside HEAD

As for #1 (looking for "meta http-equiv" only within HEAD), here's a relevant
Mozilla issue: <https://bugzilla.mozilla.org/show_bug.cgi?id=98700>. 

The attached testcase renders fine in Firefox, but not in Safari (to make it
render in Safari, manually choose KOI8-R text encoding). The real life site is
<http://www.oper.ru>.
Comment 6 Alexey Proskuryakov 2005-08-09 12:15:53 PDT
Created attachment 3294 [details]
META outside HEAD patch

With this patch, Decoder::decode() doesn't stop looking for a charset after a
</head>.

Also, source fixed to compile with DECODE_DEBUG.
Comment 7 Darin Adler 2005-08-10 10:38:14 PDT
Comment on attachment 3294 [details]
META outside HEAD patch

This change looks great, but we need to pair each change with a new layout
test. As far as I can tell the test attached to this bug is for a different
issue.

I think we're going to have to make more separate bug reports for these issues,
and I'll set this patch to review+ once I see it alongside a suitable layout
test.
Comment 8 Alexey Proskuryakov 2005-08-10 12:49:54 PDT
Created attachment 3324 [details]
A better test for META outside HEAD

Includes English comments for easy manual testing and expected DumpRenderTree
output.
Comment 9 Alexey Proskuryakov 2005-08-10 13:08:36 PDT
Comment on attachment 3294 [details]
META outside HEAD patch

So far, I cannot confirm any of the mentioned possible issues with the current
implementation, except for the META outside HEAD one. If any are found, I'll
file them separately, to keep this issue focused on auto-detection.
Comment 10 Darin Adler 2005-08-10 13:25:36 PDT
Comment on attachment 3294 [details]
META outside HEAD patch

Looks good. r=me
Comment 11 Darin Adler 2005-08-14 01:22:55 PDT
Retitling to reflect what's actually being fixed here. The bigger "cosmic issue" will have to be covered by 
other bug reports.