WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
NEW
24420
Consider doing encoding detection before decoding starts
https://bugs.webkit.org/show_bug.cgi?id=24420
Summary
Consider doing encoding detection before decoding starts
Jungshik Shin
Reported
2009-03-06 01:30:03 PST
Spun off
bug 16482
. Character encoding auto-detection is done after a part of an input file is decoded. We have to find a way to do the encoding detection before any decoding starts.
Attachments
Add attachment
proposed patch, testcase, etc.
Jungshik Shin
Comment 1
2009-03-12 10:20:24 PDT
bug 16482 comment #30
How about doing a similar "pre-scanning"(?) with auto-detection? We can ask autodetector to return "an indicator of undeterministic/unknown) until it reaches a confidence level higher than a certain value (that thresold should be internal to each implementation of autodetector). Then, we can set a flag (similar to m_checkedForHeadCharset). Would it be a good plan for
bug 24420
?
bug 16482 comment #31
Yes, it may we ll be. Are you well familiar with how Gecko encoding detector works? How much data does it normally need to choose an encoding with confidence (and how much data would you expect our detector to need)?
Jungshik Shin
Comment 2
2009-03-12 10:38:25 PDT
(In reply to
comment #1
)
>
bug 16482 comment #31
> > Yes, it may we ll be. Are you well familiar with how Gecko encoding detector > works? How much data does it normally need to choose an encoding with > confidence (and how much data would you expect our detector to need)?
I have an overall understanding of Gecko's algorithms (byte unigram and byte bigram), but don't know those criteria you asked about. However, I know two (or three) people who developed them and can ask as a short cut instead of reading the source code. As for ICU's encoding detector, it uses byte-trigrams and my not-so-scientific experiment indicates that it can be pretty reliable after 1kB or so. Needless to say, if the first 1kB is (almost) entirely made up of ASCII bytes, it'd not work as well. ICU gives back a confidence level. The way it's calculated is not so elaborate, though. So, we need to employee some heuristics when using it.
Alexey Proskuryakov
Comment 3
2009-03-12 10:51:07 PDT
Looks like ICU also has an upper limit: <
http://sourceforge.net/mailarchive/message.php?msg_name=49382ACE.2080804%40icu-project.org
>.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug