Bug 24420 - Consider doing encoding detection before decoding starts
Summary: Consider doing encoding detection before decoding starts
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 528+ (Nightly build)
Hardware: Macintosh All
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-06 01:30 PST by Jungshik Shin
Modified: 2010-05-25 13:39 PDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2009-03-06 01:30:03 PST
Spun off bug 16482.

Character encoding auto-detection is done after a part of an input file is decoded. We have to find a way to do the encoding detection before any decoding starts.
Comment 1 Jungshik Shin 2009-03-12 10:20:24 PDT
bug 16482 comment #30 

How about doing a similar "pre-scanning"(?) with auto-detection? We can ask
autodetector to  return "an indicator of undeterministic/unknown) until it
reaches a confidence level higher than a certain value (that thresold should be
internal to each implementation of autodetector). Then,  we can set a flag
(similar to m_checkedForHeadCharset). Would it be a good plan for bug 24420? 
 
bug 16482 comment #31

Yes, it may we ll be. Are you well familiar with how Gecko encoding detector
works? How much data does it normally need to choose an encoding with
confidence (and how much data would you expect our detector to need)?



Comment 2 Jungshik Shin 2009-03-12 10:38:25 PDT
(In reply to comment #1)

> bug 16482 comment #31
> 
> Yes, it may we ll be. Are you well familiar with how Gecko encoding detector
> works? How much data does it normally need to choose an encoding with
> confidence (and how much data would you expect our detector to need)?

I have an overall understanding of  Gecko's algorithms (byte unigram and byte bigram), but don't know those criteria you asked about. However, I know two (or three) people who developed them and can ask as a short cut instead of reading the source code. 

As for ICU's encoding detector, it uses byte-trigrams and my not-so-scientific experiment indicates that it can be pretty reliable after 1kB or so. Needless to say, if the first 1kB is (almost) entirely made up of ASCII bytes, it'd not work as well. ICU gives back a confidence level. The way it's calculated is not so elaborate, though. So, we need to employee some heuristics when using it. 

Comment 3 Alexey Proskuryakov 2009-03-12 10:51:07 PDT
Looks like ICU also has an upper limit: <http://sourceforge.net/mailarchive/message.php?msg_name=49382ACE.2080804%40icu-project.org>.