NEW 24420
Consider doing encoding detection before decoding starts
https://bugs.webkit.org/show_bug.cgi?id=24420
Summary Consider doing encoding detection before decoding starts
Jungshik Shin
Reported 2009-03-06 01:30:03 PST
Spun off bug 16482. Character encoding auto-detection is done after a part of an input file is decoded. We have to find a way to do the encoding detection before any decoding starts.
Attachments
Jungshik Shin
Comment 1 2009-03-12 10:20:24 PDT
bug 16482 comment #30 How about doing a similar "pre-scanning"(?) with auto-detection? We can ask autodetector to return "an indicator of undeterministic/unknown) until it reaches a confidence level higher than a certain value (that thresold should be internal to each implementation of autodetector). Then, we can set a flag (similar to m_checkedForHeadCharset). Would it be a good plan for bug 24420? bug 16482 comment #31 Yes, it may we ll be. Are you well familiar with how Gecko encoding detector works? How much data does it normally need to choose an encoding with confidence (and how much data would you expect our detector to need)?
Jungshik Shin
Comment 2 2009-03-12 10:38:25 PDT
(In reply to comment #1) > bug 16482 comment #31 > > Yes, it may we ll be. Are you well familiar with how Gecko encoding detector > works? How much data does it normally need to choose an encoding with > confidence (and how much data would you expect our detector to need)? I have an overall understanding of Gecko's algorithms (byte unigram and byte bigram), but don't know those criteria you asked about. However, I know two (or three) people who developed them and can ask as a short cut instead of reading the source code. As for ICU's encoding detector, it uses byte-trigrams and my not-so-scientific experiment indicates that it can be pretty reliable after 1kB or so. Needless to say, if the first 1kB is (almost) entirely made up of ASCII bytes, it'd not work as well. ICU gives back a confidence level. The way it's calculated is not so elaborate, though. So, we need to employee some heuristics when using it.
Alexey Proskuryakov
Comment 3 2009-03-12 10:51:07 PDT
Note You need to log in before you can comment on or make changes to this bug.