Bug 24285 - Text resource loading checks for BOM twice
Summary: Text resource loading checks for BOM twice
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Alexey Proskuryakov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-02 02:28 PST by Alexey Proskuryakov
Modified: 2009-03-10 07:11 PDT (History)
1 user (show)

See Also:


Attachments
proposed patch (33.28 KB, patch)
2009-03-02 02:54 PST, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
proposed patch (33.41 KB, patch)
2009-03-02 03:02 PST, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
proposed patch (33.36 KB, patch)
2009-03-02 03:42 PST, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2009-03-02 02:28:14 PST
We check for BOM both in TextResourceDecoder and in TextDecoder, which makes no sense. Patch forthcoming.
Comment 1 Alexey Proskuryakov 2009-03-02 02:54:15 PST
Created attachment 28161 [details]
proposed patch
Comment 2 Alexey Proskuryakov 2009-03-02 03:02:47 PST
Created attachment 28162 [details]
proposed patch

Added a check for null m_codec in TextResourceDecoder::flush().
Comment 3 Alexey Proskuryakov 2009-03-02 03:42:22 PST
Created attachment 28163 [details]
proposed patch

This time for sure!
Comment 4 Darin Adler 2009-03-10 05:38:57 PDT
Comment on attachment 28163 [details]
proposed patch

> +        * loader/CachedScript.cpp:
> +        (WebCore::CachedScript::CachedScript):
> +        (WebCore::CachedScript::setEncoding):
> +        (WebCore::CachedScript::encoding):
> +        (WebCore::CachedScript::script):
> +        * loader/CachedScript.h:
> +        * loader/appcache/ManifestParser.cpp:
> +        (WebCore::parseManifest):
> +        Use TextResourceDecoder, as TextEncoding::decode() no longer checks for BOM.

Isn't this a bit more expensive? Are there really no differences in behavior other than the BOM check by using the TextResourceDecoder?

>      if (m_source != UserChosenEncoding && m_source != AutoDetectedEncoding && encoding().isJapanese())
>          detectJapaneseEncoding(data, len);
>  
> -    ASSERT(encoding().isValid());
> +    ASSERT(m_encoding.isValid());

There's a mix here of m_encoding and encoding(). We should consistently use m_encoding, I think.

I'll say r=me for now. I'm thinking that maybe we should remove the decode functions from TextEncoding, since that's just a convenience for newTextCodec. What callers really need that and are they better served by having that helper rather than calling newTextCodec directly?
Comment 5 Alexey Proskuryakov 2009-03-10 05:57:14 PDT
(In reply to comment #4)
> Isn't this a bit more expensive? Are there really no differences in behavior
> other than the BOM check by using the TextResourceDecoder?

For file types that don't have charset declarations internally, the only difference I can think of is enabling of Japanese encoding auto-detection.

I do not expect to see any measurable difference in performance, although it's possible that I underestimate some book-keeping work that is done by TextResourceDecoder.

> I'm thinking that maybe we should remove the decode
> functions from TextEncoding, since that's just a convenience for newTextCodec.
> What callers really need that and are they better served by having that helper
> rather than calling newTextCodec directly?

The primary reason I didn't look into removing TextEncoding::decode() was that I wasn't sure what to do with TextEncoding:: encode().
Comment 6 Darin Adler 2009-03-10 06:00:18 PDT
(In reply to comment #5)
> For file types that don't have charset declarations internally, the only
> difference I can think of is enabling of Japanese encoding auto-detection.

We don't want that auto-detection for scripts, do we? That's the behavior change from this patch, right?

> The primary reason I didn't look into removing TextEncoding::decode() was that
> I wasn't sure what to do with TextEncoding:: encode().

I think we can leave encode there even if we do remove decode. There was already asymmetry between the two due to the presence of the TextDecoder object. And some day we might find some great thing to do with encode.
Comment 7 Alexey Proskuryakov 2009-03-10 06:19:02 PDT
(In reply to comment #6)
> We don't want that auto-detection for scripts, do we? That's the behavior
> change from this patch, right?

Yes, it's a somewhat hidden change indeed. I can't say for sure whether we want it or not - there doesn't seem to be a reason to handle JS and other text subresources differently in this regard. There are certainly security concerns related to script decoding, but they apply to inline scripts as well as to external ones.

We don't have any tests for Japanese encoding auto-detection, and I'd be reluctant to add any tests that aren't based on actual Web content.

> I think we can leave encode there even if we do remove decode.

OK. I think that can can be done in a separate patch though (which I don't promise to make right away).
Comment 8 Darin Adler 2009-03-10 06:36:29 PDT
(In reply to comment #7)
> > I think we can leave encode there even if we do remove decode.
> 
> OK. I think that can can be done in a separate patch though (which I don't
> promise to make right away).

Absolutely. No hurry on this.
Comment 9 Alexey Proskuryakov 2009-03-10 07:11:38 PDT
Committed revision 41551.

I added a note about text encoding detection to the ChangeLog.