5187 – UTF-8 in long text files breaks at some point

VERIFIED FIXED 5187

UTF-8 in long text files breaks at some point

https://bugs.webkit.org/show_bug.cgi?id=5187

Summary UTF-8 in long text files breaks at some point

Alexey Proskuryakov

Reported 2005-09-29 11:34:54 PDT

(also in Radar as rdar://4104720). Steps to reproduce: 1. Open http://www.opensource.apple.com/darwinsource/10.4/ICU-6.2.4/icuSources/data/locales/ru.txt 2. Scroll down At some point, the text gets completely garbled. This file has a BOM, so the browser default encoding shouldn't matter.

Attachments
Proposed patch (WebCore part) (2.70 KB, patch) 2005-10-01 08:58 PDT, Alexey Proskuryakov	no flags	Details Formatted Diff Diff
Proposed patch (WebKit part) (3.93 KB, patch) 2005-10-01 08:59 PDT, Alexey Proskuryakov	no flags	Details Formatted Diff Diff
Proposed patch (WebCore part) (2.45 KB, patch) 2005-10-02 00:58 PDT, Alexey Proskuryakov	mjs: review-	Details Formatted Diff Diff
proposed patch (14.36 KB, patch) 2005-10-03 00:50 PDT, Alexey Proskuryakov	darin: review-	Details Formatted Diff Diff
revised patch (14.10 KB, patch) 2005-10-06 12:22 PDT, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
Show Obsolete (4) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2005-09-30 14:03:46 PDT

The problem seems to be in [WebTextView appendReceivedData:fromDataSource:]: [self replaceCharactersInRange:NSMakeRange([[self string] length], 0) withString:[dataSource _stringWithData:data]]; Here, _stringWithData creates a temporary QTextCodec for each data chunk, which is wrong for several reasons. First, the chunk border may split a character encoded as UTF-8 into meaningless parts. Second, QTextCodec may guess the encoding from a BOM, which is only present in the first chunk, obviously. So far, I do not see how to cross the WebBridge and get incremental QTextCodec decoding.

Alexey Proskuryakov

Comment 2 2005-10-01 06:12:55 PDT

It may be appropriate to solve a related issue together with this one: WebTextRepresentation doesn't seem to respect user-default encoding. To reproduce: 1. Set the default encoding to Cyrillic (Windows) 2. Open http://nypop.com/webkit/sbrf.txt Expected results: "TECT TECT" (first word in Cyrillic letters, second in Latin ones)

Alexey Proskuryakov

Comment 3 2005-10-01 08:58:34 PDT

Created attachment 4131 [details] Proposed patch (WebCore part)

Alexey Proskuryakov

Comment 4 2005-10-01 08:59:11 PDT

Created attachment 4132 [details] Proposed patch (WebKit part)

Eric Seidel (no email)

Comment 5 2005-10-01 14:13:23 PDT

Comment on attachment 4131 [details] Proposed patch (WebCore part) Just a couple thoughts. 1. I don't think there is any need to have a private class/private pointer in the original. WebCoreDecoder is never going to be API, no need to have a layer of indirection. 2. This was also uncessary: - khtml::Decoder *decoder = new khtml::Decoder(); + Decoder *decoder = new Decoder(); I'm not sure what the policy is on "using namespace" calls in Obj-C++ files, if there is one.

Alexey Proskuryakov

Comment 6 2005-10-02 00:58:50 PDT

Created attachment 4142 [details] Proposed patch (WebCore part) 1. Removed the unnesessary indirection level. 2. Looks like most Obj-C++ files have "using namespace" directives. My guess is that WebCoreEncodings.mm didn't use it because it was so tiny.

Maciej Stachowiak

Comment 7 2005-10-02 20:30:29 PDT

> +- (void)flushReceivedDataFromDataSource:(WebDataSource *)dataSource; This function doesn't appear to use dataSource - why pass it? I think it would be better to put the WebCoreDecoder class in its own header called WebCoreDecoder.h. Also perhaps decodeData: could become a class method on WebCoreDecoder instead of sitting around on the WebCoreEncodings class. These are only style issues, the substance of the patch all looks great to me.

Alexey Proskuryakov

Comment 8 2005-10-03 00:50:30 PDT

Created attachment 4173 [details] proposed patch Addressed the comments, merged into a single patch. Didn't move [WebCoreEncodings decodeData:] to WebCoreDecoder, because it has substantially different semantics (uses khtml::Decoder rather than QTextDecoder).

Darin Adler

Comment 9 2005-10-06 10:25:09 PDT

Comment on attachment 4173 [details] proposed patch The long term fix for this is to implement text viewing using the HTML view -- we've planned to make this change for a while, but in the mean time this is a great fix to have. Since the WebCoreDecoder class is new code, the headers should say 2005, not 2003. No need to initialize variables to 0 in Objective-C, they are all set to 0 automatically. I think the code in appendReceivedData would read better if textEncodingName was set inside the if instead of having two separate branches that both create a decoder. WebCoreDecoder needs a finalize method that deletes the QTextCodec for when we're used in GC mode. For files in the KWQ directory, we should use the real filenames with "" instead of the Qt names with <>. So #import "KWQTextCodec.h". initWithEncodingName should use QTextCodec::codecForName. I don't think that it's worth it to use KGlobal and KCharsets just to get the Latin-1 default. It's just two lines of code. Or we could consider having WebCoreDecoder return nil when passed an encoding name it doesn't know, and have the "fall back on Latin-1" rule out in WebTextView.

Alexey Proskuryakov

Comment 10 2005-10-06 12:22:56 PDT

Created attachment 4239 [details] revised patch

Darin Adler

Comment 11 2005-10-06 13:12:42 PDT

Comment on attachment 4239 [details] revised patch r=me

Note You need to log in before you can comment on or make changes to this bug.