Spun off bug 16482.
Character encoding auto-detection is done after a part of an input file is decoded. We have to find a way to do the encoding detection before any decoding starts.
bug 16482 comment #30
How about doing a similar "pre-scanning"(?) with auto-detection? We can ask
autodetector to return "an indicator of undeterministic/unknown) until it
reaches a confidence level higher than a certain value (that thresold should be
internal to each implementation of autodetector). Then, we can set a flag
(similar to m_checkedForHeadCharset). Would it be a good plan for bug 24420?
bug 16482 comment #31
Yes, it may we ll be. Are you well familiar with how Gecko encoding detector
works? How much data does it normally need to choose an encoding with
confidence (and how much data would you expect our detector to need)?
(In reply to comment #1)
> bug 16482 comment #31
> Yes, it may we ll be. Are you well familiar with how Gecko encoding detector
> works? How much data does it normally need to choose an encoding with
> confidence (and how much data would you expect our detector to need)?
I have an overall understanding of Gecko's algorithms (byte unigram and byte bigram), but don't know those criteria you asked about. However, I know two (or three) people who developed them and can ask as a short cut instead of reading the source code.
As for ICU's encoding detector, it uses byte-trigrams and my not-so-scientific experiment indicates that it can be pretty reliable after 1kB or so. Needless to say, if the first 1kB is (almost) entirely made up of ASCII bytes, it'd not work as well. ICU gives back a confidence level. The way it's calculated is not so elaborate, though. So, we need to employee some heuristics when using it.
Looks like ICU also has an upper limit: <http://sourceforge.net/mailarchive/message.php?msg_name=49382ACE.2080804%40icu-project.org>.