Bug 5187

Summary: UTF-8 in long text files breaks at some point
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: WebKit Misc.Assignee: Dave Hyatt <hyatt>
Severity: Major    
Priority: P2    
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
URL: http://www.opensource.apple.com/darwinsource/10.4/ICU-6.2.4/icuSources/data/locales/ru.txt
Description Flags
Proposed patch (WebCore part)
Proposed patch (WebKit part)
Proposed patch (WebCore part)
mjs: review-
proposed patch
darin: review-
revised patch darin: review+

Description Alexey Proskuryakov 2005-09-29 11:34:54 PDT
(also in Radar as rdar://4104720).

Steps to reproduce:
1. Open http://www.opensource.apple.com/darwinsource/10.4/ICU-6.2.4/icuSources/data/locales/ru.txt
2. Scroll down

At some point, the text gets completely garbled.

This file has a BOM, so the browser default encoding shouldn't matter.
Comment 1 Alexey Proskuryakov 2005-09-30 14:03:46 PDT
The problem seems to be in [WebTextView appendReceivedData:fromDataSource:]:

        [self replaceCharactersInRange:NSMakeRange([[self string] length], 0)
                            withString:[dataSource _stringWithData:data]];

Here, _stringWithData creates a temporary QTextCodec for each data chunk, which is wrong for several 
reasons. First, the chunk border may split a character encoded as UTF-8 into meaningless parts. Second, 
QTextCodec may guess the encoding from a BOM, which is only present in the first chunk, obviously.

So far, I do not see how to cross the WebBridge and get incremental QTextCodec decoding.
Comment 2 Alexey Proskuryakov 2005-10-01 06:12:55 PDT
It may be appropriate to solve a related issue together with this one: WebTextRepresentation doesn't seem 
to respect user-default encoding. To reproduce:
1. Set the default encoding to Cyrillic (Windows)
2. Open http://nypop.com/webkit/sbrf.txt

Expected results: "TECT TECT" (first word in Cyrillic letters, second in Latin ones)
Comment 3 Alexey Proskuryakov 2005-10-01 08:58:34 PDT
Created attachment 4131 [details]
Proposed patch (WebCore part)
Comment 4 Alexey Proskuryakov 2005-10-01 08:59:11 PDT
Created attachment 4132 [details]
Proposed patch (WebKit part)
Comment 5 Eric Seidel (no email) 2005-10-01 14:13:23 PDT
Comment on attachment 4131 [details]
Proposed patch (WebCore part)

Just a couple thoughts.

1.  I don't think there is any need to have a private class/private pointer in
the original.  WebCoreDecoder is never going to be API, no need to have a layer
of indirection.

2.  This was also uncessary:
-    khtml::Decoder *decoder = new khtml::Decoder();
+    Decoder *decoder = new Decoder();

I'm not sure what the policy is on "using namespace" calls in Obj-C++ files, if
there is one.
Comment 6 Alexey Proskuryakov 2005-10-02 00:58:50 PDT
Created attachment 4142 [details]
Proposed patch (WebCore part)

1. Removed the unnesessary indirection level.
2. Looks like most Obj-C++ files have "using namespace" directives. My guess is
that WebCoreEncodings.mm didn't use it because it was so tiny.
Comment 7 Maciej Stachowiak 2005-10-02 20:30:29 PDT
> +- (void)flushReceivedDataFromDataSource:(WebDataSource *)dataSource;

This function doesn't appear to use dataSource - why pass it?

I think it would be better to put the WebCoreDecoder class in its own header called WebCoreDecoder.h. 
Also perhaps decodeData: could become a class method on WebCoreDecoder instead of sitting around on 
the WebCoreEncodings class.

These are only style issues, the substance of the patch all looks great to me.
Comment 8 Alexey Proskuryakov 2005-10-03 00:50:30 PDT
Created attachment 4173 [details]
proposed patch

Addressed the comments, merged into a single patch.

Didn't move [WebCoreEncodings decodeData:] to WebCoreDecoder, because it has
substantially different semantics (uses khtml::Decoder rather than
Comment 9 Darin Adler 2005-10-06 10:25:09 PDT
Comment on attachment 4173 [details]
proposed patch

The long term fix for this is to implement text viewing using the HTML view --
we've planned to make this change for a while, but in the mean time this is a
great fix to have.

Since the WebCoreDecoder class is new code, the headers should say 2005, not

No need to initialize variables to 0 in Objective-C, they are all set to 0

I think the code in appendReceivedData would read better if textEncodingName
was set inside the if instead of having two separate branches that both create
a decoder.

WebCoreDecoder needs a finalize method that deletes the QTextCodec for when
we're used in GC mode.

For files in the KWQ directory, we should use the real filenames with ""
instead of the Qt names with <>. So #import "KWQTextCodec.h".

initWithEncodingName should use QTextCodec::codecForName. I don't think that
it's worth it to use KGlobal and KCharsets just to get the Latin-1 default.
It's just two lines of code. Or we could consider having WebCoreDecoder return
nil when passed an encoding name it doesn't know, and have the "fall back on
Latin-1" rule out in WebTextView.
Comment 10 Alexey Proskuryakov 2005-10-06 12:22:56 PDT
Created attachment 4239 [details]
revised patch
Comment 11 Darin Adler 2005-10-06 13:12:42 PDT
Comment on attachment 4239 [details]
revised patch