78584 – [Encoding] We should run text encoding detector for iframes and child frame.

NEW 78584

[Encoding] We should run text encoding detector for iframes and child frame.

https://bugs.webkit.org/show_bug.cgi?id=78584

Summary [Encoding] We should run text encoding detector for iframes and child frame.

yosin

Reported 2012-02-14 00:44:15 PST

There are garbled text in iframe/child frame even if auto text encoding detection. Below is sample URIs for re-producing: o http://www.tku.ac.jp/~z-jinnai/ Main document having charset declaration ISO-2022-JP o http://www.tku.ac.jp/~z-jinnai/06.09.13.htm IFrame document. No charset declaration. Encoding is Shift_JIS Here is observation. o When loading iframe document, TextResourceDectoder states are - m_source = EncodingFromParentFrame - m_hintEncoding = NULL o Because of TextResouceDecorder::setHintEncoding is called with - hintDecoder.m_source = EncodingFromMetaTag A comment of setHintEncoding says hint encoding should only be from auto detection. I'm not sure why it does so. If we set hint encoding regardless of encoding source, this page won't have garbled text.

Attachments
Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2012-02-14 11:07:06 PST

Charset is normally inherited from main frame (if same origin). When a site has pages in subframes that don't match explicitly specified main frame encoding, it's just an authoring error. I don't think that any browser should go as far as "fix" such cases.

yosin

Comment 2 2012-02-14 17:36:14 PST

In this case, authors are different, e.g. teacher and students. Defaulting to parent frame's charset is meaningful. However, it should not prevent to run auto detector, if users enable auto detector. My proposal is use parent's charset as hint for auto detector and run auto detector if document has no charset declaration. There are two way to fix this issue: (1) Change ShouldAutoDetect: bool TextResourceDecoder::ShouldAutoDetect() { return m_usesEncodingDetector && (m_source == DefaultEncoding || m_source == EncodingFromParentFrame); } (2) Change setHintEncoding // We use parent document's encoding information for hint of child document encoding. void TextResourceDecoder::setHintEncoding(const TextResourceDecoder* hintDecoder) { if (hintDecoder) { m_hintEncoding = hintDecoder->encoding().name(); } }

Alexey Proskuryakov

Comment 3 2012-02-14 21:18:05 PST

> However, it should not prevent to run auto detector, if users enable auto detector. This is something I'll take issue with. Proliferation of encoding detection in one browser essentially randomizes what users and authors see. It's barely acceptable to sniff when there is no encoding indication at all, but not when there is an established behavior already. More encoding detection is bad for the Open Web, not good.

yosin

Comment 4 2012-02-14 22:55:29 PST

I agree not to implement smarter encoding sniffer in WebKit. In this case, my experiment is: FF10: Same as WebKit IE9: Display correctly OP11: Display correctly It seems using parent frame's charset for default charset is not established way. I can't say both IE9 and OP11 sniffing encoding instead of using parent's charset. Although, once we do sniffing, WK get same results as IE9/OP11.

yosin

Comment 5 2012-02-14 23:02:31 PST

Correction. (Sorry, I just upgrade to FF10 by automatic upgrade.) FF10 w/AutoDect Display correctly. So, WK does different.

Alexey Proskuryakov

Comment 6 2012-02-14 23:14:10 PST

> OP11: Display correctly This doesn't match what I'm seeing in Opera 11 (unless you meant that it correctly displays garbage in subframe). > FF10 w/AutoDect Display correctly. Autodetect is a non-default setting in Firefox. I don't have IE here to verify what it does. Racing for "best" encoding detection is harmful. It's non-standard, unpredictable, and not how the Web should (and can!) work. Pages where it's needed are a rare exception.

yosin

Comment 7 2012-02-14 23:25:06 PST

I've not tried to create best encoding sniffer. Rather, I would like to have clear behavior. It seems TextResourceDecoder (and associated HTMLMetaCharsetParser) has some of ad-hoc thing. It seems we should propose to WHATWG how user agent handles parent's charset on child resource handling. How do you think?

Alexey Proskuryakov

Comment 8 2012-02-14 23:54:07 PST

HTML5 has an uncharacteristically vague algorithm (see <http://www.whatwg.org/specs/web-apps/current-work/#determining-the-character-encoding>). It lets UA use arbitrary "other algorithms" for encoding detection, and it also mandates an extremely error-prone and unnecessary algorithm for changing encoding on the fly <http://www.whatwg.org/specs/web-apps/current-work/#change-the-encoding>. I think that we should strive for simplifying this, but so far, even Safari implementation experience with a drastically simpler approach hasn't convinced the spec editor.

Ian 'Hixie' Hickson

Comment 9 2012-02-15 11:30:25 PST

It hasn't convinced me because other vendors have said they need it to get more compat than you have. :-)

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Page Loading

Assignee

Nobody

Reported

2012-02-14 00:44 PST

Modified

2023-01-22 20:14 PST History

CC List

4 users Show

URL

http://www.tku.ac.jp/~z-jinnai/

Keywords

Depends on

245305

Blocks

Dependencies

tree graph