ASSIGNED Bug 26694
should we scan beyond 1kB for meta charset?
https://bugs.webkit.org/show_bug.cgi?id=26694
Summary should we scan beyond 1kB for meta charset?
Jungshik Shin
Reported 2009-06-24 14:57:03 PDT
Some web pages have meta charset deeper than 1024 bytes. Some examples were reported to Chromium recently: http://crbug.com/15163 ( 1.5 kB) http://crbug.com/15173 ( almost 6kB)
Attachments
johnnyding
Comment 1 2009-06-25 03:25:45 PDT
More pages have same problem http://www.ty.sx.cn/, character declaration is at position of 2.5K bytes. http://www.zoo.gov.tw/, character declaration is at position of 2.9K bytes. http://www.78195.com/,character declaration is at position of 5.6K bytes. http://jzyyczs.mywowo.com/article.asp?id=661 character declaration is at position of 1.2K bytes. http://www.china-cba.net/ccbp/, character declaration is at position of 2.1K bytes. http://www.ce.cn/life/rwsx/mxss/200812/29/t20081229_17821966.shtml, character declaration is at position of 1.4K bytes.
Alexey Proskuryakov
Comment 2 2009-06-25 06:51:06 PDT
It would be interesting to measure how much return increasing the buffer size gives. All research I saw before measured how far the charset declaration is from the beginning of file, without taking into account whether that's in HEAD or not.
Alexey Proskuryakov
Comment 3 2009-09-16 17:00:12 PDT
*** Bug 29283 has been marked as a duplicate of this bug. ***
Jeremy Moskovich
Comment 4 2010-10-28 08:03:59 PDT
*** Bug 34835 has been marked as a duplicate of this bug. ***
Jungshik Shin
Comment 6 2011-08-05 11:14:10 PDT
Another page with this problem: http://home.pchome.com.tw/mysite/francine21/about_music/music.htm It has charset=Big5 beyond the 1st 1kB. Perhaps, I really need to try to run some map reduce on Google's corpus to see what percentage of pages suffer from this problem.
yosin
Comment 7 2012-02-07 20:32:56 PST
I ran mapreduce and got following result: Posn %Docs %Coverage <=1,024 94.80% 94.80% <=2,048 3.71% 98.50% <=3,072 0.72% 99.22% <=4,096 0.26% 99.48% <=5,120 0.13% 99.61% <=6,144 0.05% 99.66% <=7,168 0.03% 99.69% <=8,192 0.03% 99.72% <=9,216 0.02% 99.75% <=10,240 0.07% 99.82% As you see, if we increase limit to 3072, we can cover 99.22% of the Internet.
Alexey Proskuryakov
Comment 8 2012-02-08 00:53:38 PST
Are you saying that 5.2% of pages are broken in WebKit based browsers because of this? :)
yosin
Comment 9 2012-02-08 01:16:15 PST
No. This value is simple statistics of byte position of end of meta tag element which contains charset declaration, <meta charset=".."> <meta http-equiv="Content-Type" content="...">. WebKit does well. (^_^) WebKit has other sources: 1. HTTP response header 2. Default encoding from browser Note: 56.62% of URLs have "right" charset within "Content-Type" HTTP response header. BTW, "charset" declarations are sometimes wrong. In HTTP, 1.2% of charset is invalid charset name, e.g. "utf8", "foobar", blah. 12.45% of charset specify another charset, e.g. "shift_jis" for "utf-8". In HTML, 1.25% are invalid charset, 17.83% charset specify another charset. We're working for fixing invalid charset/missing charset case in https://bugs.webkit.org/show_bug.cgi?id=75594
Jungshik Shin
Comment 10 2012-02-08 10:13:55 PST
I think it'd be easier to interpret the statistics if you run the tally ONLY with documents that do NOT have 'charset' declaration in Content-Type HTTP response header fields. Because charset value in HTTP response header has a higher priority than meta charset declaration, the position of 'meta charset' header does not matter at all if charset header is present in HTTP response. Alternatively, you can treat any document with charset in HTTP header as if meta charset is declared at position 0 when getting the statistics. That way, the benefit of going beyond 1024 bytes would be clearly shown in terms of the relative frequency among the all the web documents. So, I like this second approach. BTW, I don't think the fact that sometimes 'default charset' (that is used when no other information is available) matches the actual document encoding is relevant to this bug. Because 'default charset' value is user-dependent and what works for one user does not work for another user with different default.
Jungshik Shin
Comment 11 2012-02-08 10:15:41 PST
> 1.2% of charset is invalid charset name, e.g. "utf8", "foobar", blah. BTW, 'utf8' is treated as 'UTF-8' in Webkit (unless there's a change not to treat it as 'UTF-8' that I missed.)
yosin
Comment 12 2012-02-08 20:49:29 PST
I ran mapreduce again to checking meta/charset position if HTTP response header doesn't have charset information. <=1,024 95.71% 95.71% <=2,048 2.97% 98.67% <=3,072 0.67% 99.35% <=4,096 0.25% 99.60% <=5,120 0.10% 99.70% <=6,144 0.06% 99.76% <=7,168 0.04% 99.80% <=8,192 0.02% 99.82% <=9,216 0.02% 99.84% <=10,240 0.02% 99.86% How do we agree to use increased value(e.g. 2048, 3072) or use current value(1024)? Do we want to have another granularity (e.g. 100 byte) of coverage? I vote to increase 3072.
Alexey Proskuryakov
Comment 13 2012-02-08 23:08:09 PST
The numbers are still surprising. We can't have 4.29% of web pages broken because of this. As mentioned in comment 2, what matters is how many of these are outside <head>.
Jungshik Shin
Comment 14 2012-02-09 11:02:34 PST
(In reply to comment #12) > I ran mapreduce again to checking meta/charset position if HTTP response header doesn't have charset information. It's a bit hard to interpret. What's the denominator here? Is it the number of documents without HTTP charset header? If so, it can be misleading. Can you run your mapreduce again with my second proposal in comment #10? That is, count all the documents with HTTP charset as having 'meta charset' at position 0 (i.e position < 1024). And, use the total number of documents as the denominator regardless of whether they have http charset header or not. Thank you. > > <=1,024 95.71% 95.71% > <=2,048 2.97% 98.67% > <=3,072 0.67% 99.35% > <=4,096 0.25% 99.60% > <=5,120 0.10% 99.70% > <=6,144 0.06% 99.76% > <=7,168 0.04% 99.80% > <=8,192 0.02% 99.82% > <=9,216 0.02% 99.84% > <=10,240 0.02% 99.86% > > How do we agree to use increased value(e.g. 2048, 3072) or use current value(1024)? Do we want to have another granularity (e.g. 100 byte) of coverage? > > I vote to increase 3072.
yosin
Comment 15 2012-02-10 02:21:12 PST
Sorry for confusion. Thanks Jungshink for suggestion. Note: Please ignore previous results. It counted no charset in HTTP/HTML as position 0. Denominator = All HTML documents HTTP 70.45% 70.45% <=1,024 25.86% 96.31% <=2,048 0.88% 97.19% <=3,072 0.20% 97.39% <=4,096 0.07% 97.46% <=5,120 0.03% 97.49% <=6,144 0.02% 97.51% <=7,168 0.01% 97.52% <=8,192 0.01% 97.53% <=9,216 0.01% 97.53% <=10,240 0.01% 97.54% <=11,264 0.01% 97.55% <=12,288 0.01% 97.55% <=13,312 0.01% 97.57% <=14,336 0.00% 97.57% <=15,360 0.00% 97.57% <=16,384 0.01% 97.58% None 2.42% 100.00%
Alexey Proskuryakov
Comment 16 2012-02-10 08:59:04 PST
So, it's 1.27%, but we still don't know how many of these are inside <head>.
yosin
Comment 17 2012-02-12 17:58:27 PST
This statistics counts "charset" before "</head>". Sorry, it is hard to implementing HTMLMetaCharsetParser in mapreduce. It seems simple encoding sniffing algorithm described in HTML5 specification may be enough (http://www.w3.org/TR/html5/parsing.html#determining-the-character-encoding)
Jungshik Shin
Comment 18 2012-03-07 16:09:34 PST
Based on the statistics, I propose increasing the scanning range to 3072 bytes. I'll make a patch and upload it here. As for what's proposed in comment #17 (switching to a simple scanning regardless of whether there's a 'content' tag such as body, p, div), I'll file a new bug if not filed yet.
Alexey Proskuryakov
Comment 19 2012-03-07 16:15:59 PST
I disagree. We still don't have useful statistics data (because most of these are inside <head>, and are thus already checked. There is extremely little evidence of practical breakage, so such a change will only make page loading slower.
Kenji Baheux
Comment 20 2012-03-07 16:58:59 PST
Would it be practical to do the following for a few specific locale/countries: * extract reasonably sized random samples from the different buckets * run them through a script that would count how many would fail the current implementation (outside of head or illegal tags that prematurely closes head)? Note: the breakdown by locale or country is to reduce the size of the initial sample to something manageable yet meaningful. Alexey: assuming we can make this work, do you have any concerns or suggestions about the approach? Thanks.
Henri Sivonen
Comment 21 2012-03-29 04:08:52 PDT
(In reply to comment #12) > I vote to increase 3072. How about doing the spec-compliant thing and prescanning up to the first 1024 bytes and reloading the page if the tree builder finds a charset meta later? That's what Firefox does.
Alexey Proskuryakov
Comment 22 2012-03-29 09:02:08 PDT
I strongly oppose implementing this behavior in WebKit. It's quite unnecessary, and very complicated.
Ms2ger (he/him; ⌚ UTC+1/+2)
Comment 23 2012-03-29 10:16:29 PDT
(In reply to comment #22) > I strongly oppose implementing the spec. Oh, we all knew you would.
Kenji Baheux
Comment 24 2012-03-29 13:56:34 PDT
Alexey: we need to find how "quite unnecessary" the whole point of this bug is. So, could you comment back on my proposal? And if you don't think this is the right approach, please propose an alternative (bare minimum approach) to back up the argument with actual data. Thanks.
Alexey Proskuryakov
Comment 25 2012-03-29 14:05:59 PDT
Anecdotally, there is no evidence that WebKit users are suffering due to wrong charset being used for HTML. Also, the data on meta position within the document posted here does not help much, because it does not say anything about how many pages we can potentially fix. So, I don't see need for any action, unless there is better data justifying why we need to do something.
Kenji Baheux
Comment 26 2012-03-29 14:12:26 PDT
Alexey: we also have anecdotal evidence that a significant number of users are being burned with the current pitfalls (plural) of webkit's handling of charset detection but that's not convincing anyone on either side so, => Do you agree that data from an approach such as #20 would be unequivocal? => Yes or No? If No, please propose something that would. Thanks.
Alexey Proskuryakov
Comment 27 2012-03-29 14:34:36 PDT
Your question in comment 20 is whether doing the experiment is practical. I don't know what is practical for you, so I cannot answer it. If your real question is whether this kind of research would be convincing, then I don't know. We get huge pressure from the standards community to "make the web better" by removing features lots of sites rely upon, to deprecate every encoding except for UTF-8, etc. Us caring for sites with invalid markup that are at the same time using non-UTF8 encodings would likely make some of these people very unhappy. Anyway, solid data would be welcome. Another welcome thing would be to cross-reference the results with Alexa top site ratings (are there any broken sites in top 100 for the country? top 10000?). And if there are important sites broken, are they also broken for other reasons, such as using Mozilla or IE non-standard extensions?
Note You need to log in before you can comment on or make changes to this bug.