We check for BOM both in TextResourceDecoder and in TextDecoder, which makes no sense. Patch forthcoming.
Created attachment 28161 [details] proposed patch
Created attachment 28162 [details] proposed patch Added a check for null m_codec in TextResourceDecoder::flush().
Created attachment 28163 [details] proposed patch This time for sure!
Comment on attachment 28163 [details] proposed patch > + * loader/CachedScript.cpp: > + (WebCore::CachedScript::CachedScript): > + (WebCore::CachedScript::setEncoding): > + (WebCore::CachedScript::encoding): > + (WebCore::CachedScript::script): > + * loader/CachedScript.h: > + * loader/appcache/ManifestParser.cpp: > + (WebCore::parseManifest): > + Use TextResourceDecoder, as TextEncoding::decode() no longer checks for BOM. Isn't this a bit more expensive? Are there really no differences in behavior other than the BOM check by using the TextResourceDecoder? > if (m_source != UserChosenEncoding && m_source != AutoDetectedEncoding && encoding().isJapanese()) > detectJapaneseEncoding(data, len); > > - ASSERT(encoding().isValid()); > + ASSERT(m_encoding.isValid()); There's a mix here of m_encoding and encoding(). We should consistently use m_encoding, I think. I'll say r=me for now. I'm thinking that maybe we should remove the decode functions from TextEncoding, since that's just a convenience for newTextCodec. What callers really need that and are they better served by having that helper rather than calling newTextCodec directly?
(In reply to comment #4) > Isn't this a bit more expensive? Are there really no differences in behavior > other than the BOM check by using the TextResourceDecoder? For file types that don't have charset declarations internally, the only difference I can think of is enabling of Japanese encoding auto-detection. I do not expect to see any measurable difference in performance, although it's possible that I underestimate some book-keeping work that is done by TextResourceDecoder. > I'm thinking that maybe we should remove the decode > functions from TextEncoding, since that's just a convenience for newTextCodec. > What callers really need that and are they better served by having that helper > rather than calling newTextCodec directly? The primary reason I didn't look into removing TextEncoding::decode() was that I wasn't sure what to do with TextEncoding:: encode().
(In reply to comment #5) > For file types that don't have charset declarations internally, the only > difference I can think of is enabling of Japanese encoding auto-detection. We don't want that auto-detection for scripts, do we? That's the behavior change from this patch, right? > The primary reason I didn't look into removing TextEncoding::decode() was that > I wasn't sure what to do with TextEncoding:: encode(). I think we can leave encode there even if we do remove decode. There was already asymmetry between the two due to the presence of the TextDecoder object. And some day we might find some great thing to do with encode.
(In reply to comment #6) > We don't want that auto-detection for scripts, do we? That's the behavior > change from this patch, right? Yes, it's a somewhat hidden change indeed. I can't say for sure whether we want it or not - there doesn't seem to be a reason to handle JS and other text subresources differently in this regard. There are certainly security concerns related to script decoding, but they apply to inline scripts as well as to external ones. We don't have any tests for Japanese encoding auto-detection, and I'd be reluctant to add any tests that aren't based on actual Web content. > I think we can leave encode there even if we do remove decode. OK. I think that can can be done in a separate patch though (which I don't promise to make right away).
(In reply to comment #7) > > I think we can leave encode there even if we do remove decode. > > OK. I think that can can be done in a separate patch though (which I don't > promise to make right away). Absolutely. No hurry on this.
Committed revision 41551. I added a note about text encoding detection to the ChangeLog.