Bug 39131 - Refactor text encoding detection logic in FileReader
Summary: Refactor text encoding detection logic in FileReader
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore JavaScript (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Jian Li
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-14 13:09 PDT by Jian Li
Modified: 2010-05-19 17:25 PDT (History)
4 users (show)

See Also:


Attachments
Proposed Patch (3.68 KB, patch)
2010-05-14 13:11 PDT, Jian Li
ap: review-
jianli: commit-queue-
Details | Formatted Diff | Diff
Proposed Patch (10.01 KB, patch)
2010-05-19 15:51 PDT, Jian Li
jianli: commit-queue-
Details | Formatted Diff | Diff
Proposed Patch (9.37 KB, patch)
2010-05-19 16:40 PDT, Jian Li
ap: review+
jianli: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jian Li 2010-05-14 13:09:58 PDT
This is the cleanup for the FileReader patch (https://bugs.webkit.org/show_bug.cgi?id=38157) per the comment from Adam Barth:

WebCore/html/FileReader.cpp:256
 +          if (m_rawData.size() >= 2 && m_rawData[0] == '\xFE' && m_rawData[1] == '\xFF') {
Yuck.  Is this in the spec?  We should move this to a more generic location.
Comment 1 Jian Li 2010-05-14 13:11:21 PDT
Created attachment 56101 [details]
Proposed Patch
Comment 2 Alexey Proskuryakov 2010-05-14 16:49:06 PDT
Comment on attachment 56101 [details]
Proposed Patch

Encoding sniffing is definitely not something the TextEncoding class should responsible for. And I think that we already have all the code for this - can't you just feed the data to TextResourceDecoder as text/plain?
Comment 3 Jian Li 2010-05-19 15:51:26 PDT
Created attachment 56525 [details]
Proposed Patch

Changed to call TextResourceDecoder::decode.
Comment 4 Alexey Proskuryakov 2010-05-19 16:26:23 PDT
+            // Per the File API spec, we SHOULD use the supplied encoding and do not check for BOM if the encoding is provided and valid.

Since this is SHOULD, and not a MUST, we could also ignore this requirement, like we do for Web content (it's also debatable whether it's right to always have the BOM override every other indication of character set, but that's what we do).

I don't have a strong opinion, just wanted to point out that it might be good to be consistent within WebKit.
Comment 5 Jian Li 2010-05-19 16:40:34 PDT
Created attachment 56531 [details]
Proposed Patch

Your suggestion makes sense. I changed to ignore this requirement in order to be consistent with other part of WebKit.
Comment 6 Alexey Proskuryakov 2010-05-19 16:54:29 PDT
Comment on attachment 56531 [details]
Proposed Patch

+    m_result = m_decoder->decode(&m_rawData.at(0), m_rawData.size());

One also needs to call flush() to make sure that e.g. incomplete UTF-8 sequences at the end get converted to U+FFFD REPLACEMENT CHARACTER, and to be generally compatible with TextResourceDecoder design.

r=me if you call flush() and append the result.
Comment 7 Jian Li 2010-05-19 17:25:15 PDT
Fixed as suggested and committed as http://trac.webkit.org/changeset/59797.