RESOLVED INVALID27366
TextEncodingDetector that uses Universal Charset Detector
https://bugs.webkit.org/show_bug.cgi?id=27366
Summary TextEncodingDetector that uses Universal Charset Detector
Kwang Yul Seo
Reported 2009-07-17 03:44:16 PDT
Add a TextEncodingDetector implementation that uses Universal Charset Detector from Mozilla. The source code is taken from Mozilla: mozilla-central/extensions/universalchardet/src/base/ Universal Charset Detector is not usually available as a shared C/C++ library, so I included all code. The original code consists of many files, but I merged all source files into a single file to add it to the build system of many WebKit ports easily. I changed the coding style to follow WebKit Style Guidelines and I ran cpplint.py to ensure that there are no style errors. However, I've not changed the class and method names of Mozilla code. I think it is better to preserve the original class and method names. Please tell if there is the policy on this issue. Currently, there is only one implementation of TextEncodingDetector, TextEncodingDetectorICU. Ports without ICU can use TextEncodingDetectorUniversal by default because it imposes no external dependency.
Attachments
TextEncodingDetectorUniversal (521.17 KB, patch)
2009-07-17 03:48 PDT, Kwang Yul Seo
mjs: review-
Kwang Yul Seo
Comment 1 2009-07-17 03:48:07 PDT
Created attachment 32926 [details] TextEncodingDetectorUniversal No change to the build.
Maciej Stachowiak
Comment 2 2009-07-21 00:49:45 PDT
Comment on attachment 32926 [details] TextEncodingDetectorUniversal What's our plan for maintaining this code? Will we sync with upstream periodically or maintain it ourselves? Either way, it seems like a bad idea to put all the code in one file. It seems like it will make maintenance harder. Please resubmit with files split properly. I have no comment on the merits as to whether including this is a good idea, probably international text experts should chime in. Is there any information available on Universal Charset Detector, what it does, and how it works?
Kwang Yul Seo
Comment 3 2009-07-21 03:57:10 PDT
Okay. I will resubmit the files. Uiversal Charset Detector is a language/encoding detector. There is a good paper on this. http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html Universal Charset Detector is just another encoding detector which performs much better than the current ICU encoding detctor. There was a discussion on the merits of a encoding detector: https://bugs.webkit.org/show_bug.cgi?id=16482 I think syncing with upstream periodically is a good strategy here because the code is quite stable.
Note You need to log in before you can comment on or make changes to this bug.