|Summary:||UTF-8 in long text files breaks at some point|
|Product:||WebKit||Reporter:||Alexey Proskuryakov <ap>|
|Component:||WebKit Misc.||Assignee:||Dave Hyatt <hyatt>|
|OS:||OS X 10.4|
Description Alexey Proskuryakov 2005-09-29 11:34:54 PDT
(also in Radar as rdar://4104720). Steps to reproduce: 1. Open http://www.opensource.apple.com/darwinsource/10.4/ICU-6.2.4/icuSources/data/locales/ru.txt 2. Scroll down At some point, the text gets completely garbled. This file has a BOM, so the browser default encoding shouldn't matter.
Comment 1 Alexey Proskuryakov 2005-09-30 14:03:46 PDT
The problem seems to be in [WebTextView appendReceivedData:fromDataSource:]: [self replaceCharactersInRange:NSMakeRange([[self string] length], 0) withString:[dataSource _stringWithData:data]]; Here, _stringWithData creates a temporary QTextCodec for each data chunk, which is wrong for several reasons. First, the chunk border may split a character encoded as UTF-8 into meaningless parts. Second, QTextCodec may guess the encoding from a BOM, which is only present in the first chunk, obviously. So far, I do not see how to cross the WebBridge and get incremental QTextCodec decoding.
Comment 2 Alexey Proskuryakov 2005-10-01 06:12:55 PDT
It may be appropriate to solve a related issue together with this one: WebTextRepresentation doesn't seem to respect user-default encoding. To reproduce: 1. Set the default encoding to Cyrillic (Windows) 2. Open http://nypop.com/webkit/sbrf.txt Expected results: "TECT TECT" (first word in Cyrillic letters, second in Latin ones)
Comment 3 Alexey Proskuryakov 2005-10-01 08:58:34 PDT
Created attachment 4131 [details] Proposed patch (WebCore part)
Comment 4 Alexey Proskuryakov 2005-10-01 08:59:11 PDT
Created attachment 4132 [details] Proposed patch (WebKit part)
Comment 5 Eric Seidel (no email) 2005-10-01 14:13:23 PDT
Comment on attachment 4131 [details] Proposed patch (WebCore part) Just a couple thoughts. 1. I don't think there is any need to have a private class/private pointer in the original. WebCoreDecoder is never going to be API, no need to have a layer of indirection. 2. This was also uncessary: - khtml::Decoder *decoder = new khtml::Decoder(); + Decoder *decoder = new Decoder(); I'm not sure what the policy is on "using namespace" calls in Obj-C++ files, if there is one.
Comment 6 Alexey Proskuryakov 2005-10-02 00:58:50 PDT
Created attachment 4142 [details] Proposed patch (WebCore part) 1. Removed the unnesessary indirection level. 2. Looks like most Obj-C++ files have "using namespace" directives. My guess is that WebCoreEncodings.mm didn't use it because it was so tiny.
Comment 7 Maciej Stachowiak 2005-10-02 20:30:29 PDT
> +- (void)flushReceivedDataFromDataSource:(WebDataSource *)dataSource; This function doesn't appear to use dataSource - why pass it? I think it would be better to put the WebCoreDecoder class in its own header called WebCoreDecoder.h. Also perhaps decodeData: could become a class method on WebCoreDecoder instead of sitting around on the WebCoreEncodings class. These are only style issues, the substance of the patch all looks great to me.
Comment 8 Alexey Proskuryakov 2005-10-03 00:50:30 PDT
Created attachment 4173 [details] proposed patch Addressed the comments, merged into a single patch. Didn't move [WebCoreEncodings decodeData:] to WebCoreDecoder, because it has substantially different semantics (uses khtml::Decoder rather than QTextDecoder).
Comment 9 Darin Adler 2005-10-06 10:25:09 PDT
Comment on attachment 4173 [details] proposed patch The long term fix for this is to implement text viewing using the HTML view -- we've planned to make this change for a while, but in the mean time this is a great fix to have. Since the WebCoreDecoder class is new code, the headers should say 2005, not 2003. No need to initialize variables to 0 in Objective-C, they are all set to 0 automatically. I think the code in appendReceivedData would read better if textEncodingName was set inside the if instead of having two separate branches that both create a decoder. WebCoreDecoder needs a finalize method that deletes the QTextCodec for when we're used in GC mode. For files in the KWQ directory, we should use the real filenames with "" instead of the Qt names with <>. So #import "KWQTextCodec.h". initWithEncodingName should use QTextCodec::codecForName. I don't think that it's worth it to use KGlobal and KCharsets just to get the Latin-1 default. It's just two lines of code. Or we could consider having WebCoreDecoder return nil when passed an encoding name it doesn't know, and have the "fall back on Latin-1" rule out in WebTextView.
Comment 10 Alexey Proskuryakov 2005-10-06 12:22:56 PDT
Created attachment 4239 [details] revised patch