24420 – Consider doing encoding detection before decoding starts

NEW 24420

Consider doing encoding detection before decoding starts

https://bugs.webkit.org/show_bug.cgi?id=24420

Summary Consider doing encoding detection before decoding starts

Jungshik Shin

Reported 2009-03-06 01:30:03 PST

Spun off bug 16482. Character encoding auto-detection is done after a part of an input file is decoded. We have to find a way to do the encoding detection before any decoding starts.

Attachments
Add attachment proposed patch, testcase, etc.

Jungshik Shin

Comment 1 2009-03-12 10:20:24 PDT

bug 16482 comment #30 How about doing a similar "pre-scanning"(?) with auto-detection? We can ask autodetector to return "an indicator of undeterministic/unknown) until it reaches a confidence level higher than a certain value (that thresold should be internal to each implementation of autodetector). Then, we can set a flag (similar to m_checkedForHeadCharset). Would it be a good plan for bug 24420? bug 16482 comment #31 Yes, it may we ll be. Are you well familiar with how Gecko encoding detector works? How much data does it normally need to choose an encoding with confidence (and how much data would you expect our detector to need)?

Jungshik Shin

Comment 2 2009-03-12 10:38:25 PDT

(In reply to comment #1) > bug 16482 comment #31 > > Yes, it may we ll be. Are you well familiar with how Gecko encoding detector > works? How much data does it normally need to choose an encoding with > confidence (and how much data would you expect our detector to need)? I have an overall understanding of Gecko's algorithms (byte unigram and byte bigram), but don't know those criteria you asked about. However, I know two (or three) people who developed them and can ask as a short cut instead of reading the source code. As for ICU's encoding detector, it uses byte-trigrams and my not-so-scientific experiment indicates that it can be pretty reliable after 1kB or so. Needless to say, if the first 1kB is (almost) entirely made up of ASCII bytes, it'd not work as well. ICU gives back a confidence level. The way it's calculated is not so elaborate, though. So, we need to employee some heuristics when using it.

Alexey Proskuryakov

Comment 3 2009-03-12 10:51:07 PDT

Looks like ICU also has an upper limit: <http://sourceforge.net/mailarchive/message.php?msg_name=49382ACE.2080804%40icu-project.org>.

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Mac

OS All

Product WebKit

Component Page Loading

Assignee

Nobody

Reported

2009-03-06 01:30 PST

Modified

2022-09-16 15:34 PDT History

CC List

1 user Show

URL

Keywords

Depends on

Blocks