Bug 5187 - UTF-8 in long text files breaks at some point
Summary: UTF-8 in long text files breaks at some point
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Major
Assignee: Dave Hyatt
URL: http://www.opensource.apple.com/darwi...
Depends on:
Reported: 2005-09-29 11:34 PDT by Alexey Proskuryakov
Modified: 2005-10-10 09:33 PDT (History)
0 users

See Also:

Proposed patch (WebCore part) (2.70 KB, patch)
2005-10-01 08:58 PDT, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
Proposed patch (WebKit part) (3.93 KB, patch)
2005-10-01 08:59 PDT, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
Proposed patch (WebCore part) (2.45 KB, patch)
2005-10-02 00:58 PDT, Alexey Proskuryakov
mjs: review-
Details | Formatted Diff | Diff
proposed patch (14.36 KB, patch)
2005-10-03 00:50 PDT, Alexey Proskuryakov
darin: review-
Details | Formatted Diff | Diff
revised patch (14.10 KB, patch)
2005-10-06 12:22 PDT, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2005-09-29 11:34:54 PDT
(also in Radar as rdar://4104720).

Steps to reproduce:
1. Open http://www.opensource.apple.com/darwinsource/10.4/ICU-6.2.4/icuSources/data/locales/ru.txt
2. Scroll down

At some point, the text gets completely garbled.

This file has a BOM, so the browser default encoding shouldn't matter.
Comment 1 Alexey Proskuryakov 2005-09-30 14:03:46 PDT
The problem seems to be in [WebTextView appendReceivedData:fromDataSource:]:

        [self replaceCharactersInRange:NSMakeRange([[self string] length], 0)
                            withString:[dataSource _stringWithData:data]];

Here, _stringWithData creates a temporary QTextCodec for each data chunk, which is wrong for several 
reasons. First, the chunk border may split a character encoded as UTF-8 into meaningless parts. Second, 
QTextCodec may guess the encoding from a BOM, which is only present in the first chunk, obviously.

So far, I do not see how to cross the WebBridge and get incremental QTextCodec decoding.
Comment 2 Alexey Proskuryakov 2005-10-01 06:12:55 PDT
It may be appropriate to solve a related issue together with this one: WebTextRepresentation doesn't seem 
to respect user-default encoding. To reproduce:
1. Set the default encoding to Cyrillic (Windows)
2. Open http://nypop.com/webkit/sbrf.txt

Expected results: "TECT TECT" (first word in Cyrillic letters, second in Latin ones)
Comment 3 Alexey Proskuryakov 2005-10-01 08:58:34 PDT
Created attachment 4131 [details]
Proposed patch (WebCore part)
Comment 4 Alexey Proskuryakov 2005-10-01 08:59:11 PDT
Created attachment 4132 [details]
Proposed patch (WebKit part)
Comment 5 Eric Seidel (no email) 2005-10-01 14:13:23 PDT
Comment on attachment 4131 [details]
Proposed patch (WebCore part)

Just a couple thoughts.

1.  I don't think there is any need to have a private class/private pointer in
the original.  WebCoreDecoder is never going to be API, no need to have a layer
of indirection.

2.  This was also uncessary:
-    khtml::Decoder *decoder = new khtml::Decoder();
+    Decoder *decoder = new Decoder();

I'm not sure what the policy is on "using namespace" calls in Obj-C++ files, if
there is one.
Comment 6 Alexey Proskuryakov 2005-10-02 00:58:50 PDT
Created attachment 4142 [details]
Proposed patch (WebCore part)

1. Removed the unnesessary indirection level.
2. Looks like most Obj-C++ files have "using namespace" directives. My guess is
that WebCoreEncodings.mm didn't use it because it was so tiny.
Comment 7 Maciej Stachowiak 2005-10-02 20:30:29 PDT
> +- (void)flushReceivedDataFromDataSource:(WebDataSource *)dataSource;

This function doesn't appear to use dataSource - why pass it?

I think it would be better to put the WebCoreDecoder class in its own header called WebCoreDecoder.h. 
Also perhaps decodeData: could become a class method on WebCoreDecoder instead of sitting around on 
the WebCoreEncodings class.

These are only style issues, the substance of the patch all looks great to me.
Comment 8 Alexey Proskuryakov 2005-10-03 00:50:30 PDT
Created attachment 4173 [details]
proposed patch

Addressed the comments, merged into a single patch.

Didn't move [WebCoreEncodings decodeData:] to WebCoreDecoder, because it has
substantially different semantics (uses khtml::Decoder rather than
Comment 9 Darin Adler 2005-10-06 10:25:09 PDT
Comment on attachment 4173 [details]
proposed patch

The long term fix for this is to implement text viewing using the HTML view --
we've planned to make this change for a while, but in the mean time this is a
great fix to have.

Since the WebCoreDecoder class is new code, the headers should say 2005, not

No need to initialize variables to 0 in Objective-C, they are all set to 0

I think the code in appendReceivedData would read better if textEncodingName
was set inside the if instead of having two separate branches that both create
a decoder.

WebCoreDecoder needs a finalize method that deletes the QTextCodec for when
we're used in GC mode.

For files in the KWQ directory, we should use the real filenames with ""
instead of the Qt names with <>. So #import "KWQTextCodec.h".

initWithEncodingName should use QTextCodec::codecForName. I don't think that
it's worth it to use KGlobal and KCharsets just to get the Latin-1 default.
It's just two lines of code. Or we could consider having WebCoreDecoder return
nil when passed an encoding name it doesn't know, and have the "fall back on
Latin-1" rule out in WebTextView.
Comment 10 Alexey Proskuryakov 2005-10-06 12:22:56 PDT
Created attachment 4239 [details]
revised patch
Comment 11 Darin Adler 2005-10-06 13:12:42 PDT
Comment on attachment 4239 [details]
revised patch