Bug 5744 - XMLHttpRequest does not apply page encoding after assigning via innerHtml
Summary: XMLHttpRequest does not apply page encoding after assigning via innerHtml
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Major
Assignee: Alexey Proskuryakov
URL: http://idptest.bl.uk
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-14 17:37 PST by Michael Kaye
Modified: 2006-07-06 04:10 PDT (History)
1 user (show)

See Also:


Attachments
proposed patch (5.82 KB, patch)
2005-11-30 13:50 PST, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
proposed patch (5.80 KB, patch)
2005-12-02 13:40 PST, Alexey Proskuryakov
darin: review-
Details | Formatted Diff | Diff
updated patch (7.17 KB, patch)
2005-12-03 13:17 PST, Alexey Proskuryakov
eric: review-
Details | Formatted Diff | Diff
updated patch (7.19 KB, patch)
2005-12-04 01:03 PST, Alexey Proskuryakov
darin: review-
Details | Formatted Diff | Diff
updated patch (7.26 KB, patch)
2005-12-04 07:06 PST, Alexey Proskuryakov
darin: review-
Details | Formatted Diff | Diff
updated patch (7.20 KB, patch)
2005-12-04 13:39 PST, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
updated patch (7.27 KB, patch)
2005-12-16 12:28 PST, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff
ChangeLog entry (968 bytes, text/plain)
2005-12-19 11:05 PST, Alexey Proskuryakov
no flags Details
archived test suite (1.71 KB, application/x-gzip)
2005-12-19 11:14 PST, Alexey Proskuryakov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Kaye 2005-11-14 17:37:12 PST
goto to the following site :

1) http:idptest.bl.uk
2) Choose Show Catalogues from left menu.
3) Select "Enoki_1962" from catalogue list and click submit button.

4) The page will load and make note of the text in the bottom right area entitled Catalogue Entry. The 
text here is unicode text and onload is rendered correctly. Take note of the "e" character with the 
circumflex above it on the third line.

5) Click "C1" from the list on the left (Catalogue List). Now check the text on the right again and you will 
see it is nolonger encoded properly (check the same "e" character.

The data is extracted via a normal XMLHttpRequest and assigned via an div.innerHtml assignment so 
maybe the issue is with the assignment and not the load!

nb if site is not available try again later as it is a test server...
Comment 1 Michael Kaye 2005-11-14 17:38:09 PST
that should be http://idptest.bl.uk 
Comment 2 Alexey Proskuryakov 2005-11-20 08:27:29 PST
Does Firefox actually use the document encoding as a default for XMLHttpRequest encoding, or maybe it 
just defaults to UTF-8?
Comment 3 Alexey Proskuryakov 2005-11-28 14:15:22 PST
Yes, it looks like Firefox defaults to UTF-8 for responses, while Safari defaults to us-ascii.

I'm going to investigate this, as well as overriding encoding with overrideMimeType(), and also how 
req.responseXML is decoded.
Comment 4 Alexey Proskuryakov 2005-11-29 14:01:52 PST
Test case: <http://nypop.com/~ap/webkit/xmlhttprequestenc/xmlhttprequestenc.html>.

Works in Firefox; both Safari and Opera fail pretty spectacularly.
Comment 5 Alexey Proskuryakov 2005-11-30 13:50:07 PST
Created attachment 4885 [details]
proposed patch

Fixes a bunch of issues related to applying the correct encoding to
XMLHTTPRequest responses, including a complete inability to parse a response if
overrideMimeType('text/xml; charset=<some_charset>') was used.

There are still some FIXMEs concerning request encoding; I'm going to
investigate these separately.
Comment 6 Eric Seidel (no email) 2005-11-30 16:57:57 PST
Comment on attachment 4885 [details]
proposed patch

One style violation:
+    if (encoding.isNull() && job) {
+      encoding = job->queryMetaData("charset");
+    }

(no {} around one line ifs)


Are you sure you want
mimeType.isNull()
instead of:
mimeType.isEmpty()
?

pos += 7;
A comment such as // strlen("charset") might be helpful here.

Curious, did you adapt getCharset from somewhere?  It seems like functionality
which has to already exist somewhere else.
Comment 7 Alexey Proskuryakov 2005-11-30 21:37:21 PST
(In reply to comment #6)
> One style violation:
> (no {} around one line ifs)

  There's a sequence of ifs, some single-line, some not... To me, it's easier to read if they all folow a 
single style, even if it violates the guideline. Are you sure it's not a legitimate exception?

> Are you sure you want
> mimeType.isNull()
> instead of:
> mimeType.isEmpty()
> ?

  No, I'm not sure... The existing code uses both, and they appear equivalent here, but I may 
misunderstand something.

> A comment such as // strlen("charset") might be helpful here.

  Agreed. I'm not updating the patch just for this, but will add the comment after the more substantial 
concerns are resolved.

> Curious, did you adapt getCharset from somewhere?  It seems like functionality
> which has to already exist somewhere else.

  Yes, this worries me, too. I started from code in Decoder::decode(), but it doesn't look like it can be 
easily factored out and directly reused here. Another implementation is in Foundation ([NSURLResponse 
textEncodingName]), but I couldn't find a good way to reuse that, as well.
Comment 8 Alexey Proskuryakov 2005-12-02 13:40:52 PST
Created attachment 4911 [details]
proposed patch

- Updated to cleanly apply to ToT.
- Replaced isNull() with isEmpty(), for better or worse :).
- Removed an unrelated change that sneaked into the previous version.
- Addressed stylistic comments.
Comment 9 Darin Adler 2005-12-03 11:21:56 PST
Comment on attachment 4911 [details]
proposed patch

I suggest you should change getResponseHeader to return a QString or DOMString,
get rid of the JSLocks and change the getResponseHeader callers that need a
JavaScript string to call jsString explicitly.

Otherwise, this patch looks great.
Comment 10 Alexey Proskuryakov 2005-12-03 13:17:40 PST
Created attachment 4920 [details]
updated patch

Changed getResponseHeader to:

bool getResponseHeader(const QString& name, QString& value) const;
Comment 11 Eric Seidel (no email) 2005-12-03 15:34:59 PST
Comment on attachment 4920 [details]
updated patch

alexey and I talked about this extensively on IRC.  Mostly about ways to clean
up the getCharset() function and make it simpler.  I'm going to r- this patch
for now, and he'll post a better one tomorrow.
Comment 12 Alexey Proskuryakov 2005-12-04 01:03:29 PST
Created attachment 4925 [details]
updated patch

Slightly streamlined getCharset().
Comment 13 Alexey Proskuryakov 2005-12-04 01:13:49 PST
Comment on attachment 4925 [details]
updated patch

We have discussed code reuse in getCharset() with MacDome on IRC. Similar code
is available in khtml::Decoder and as an SPI in Foundation.

However, the Foundation version is much less permissive (doesn't even seem to
handle quoted values), and the Decoder one has somewhat different logic that
makes code reuse quite problematic IMO.

So, I'm proposing this version - while a longer-term solution may be to fix and
leverage that Foundation SPI. And/or to split out the Decoder logic to be
reusable.
Comment 14 Darin Adler 2005-12-04 04:29:12 PST
Comment on attachment 4925 [details]
updated patch

The algorithm in getCharset isn't quite right. It's not correct to look for
"charset" and then look for a "=" after it -- there might be the letter "x"
just before the string "charset", and this code would allow that. You should
fix that aspect of the getCharset function.

For getResponseHeader, I do think a separate bool is nice and clear, but there
is the distinct "null" value in QString that you could have used instead. I
think I'd prefer that because it makes the JavaScript version more consistent
with the C++ version.

Future direction: To fit in with the rest of the DOM bindings, the way this
class should work some day is that there is an XMLHttpRequest class that is not
part of the JavaScript bindings, but rather part of the "DOM" implementation.
Functions like getResponseHeader which are exported should both take and return
DOM strings, just like the functions in classes like DOM::NodeImpl and
DOM::DocumentImpl.
Comment 15 Alexey Proskuryakov 2005-12-04 07:06:16 PST
Created attachment 4937 [details]
updated patch

Addressed the comments (except for future direction).
Comment 16 Darin Adler 2005-12-04 12:39:17 PST
Comment on attachment 4937 [details]
updated patch

Still looks wrong to me. This looks for charset, and if it's not at the
beginning of a word, just returns from the entire function. That doesn't make
sense to me. I think maybe you want to do "continue" in that case to look for
the next occurence of the string "charset".

I also think the test for word boundary seems strange. Why include all control
characters, semicolon, and quote marks? This would parse this:

    ignoreString="charset=UTF-8"

as a charset option, and there's no reason to parse in a way that allows that.

The skip whitespace code actually skips all characters <= ' ' which includes
things other than whitespace.

There's no reason to make this function strange in these respects. It could be
coded to match RFC 2616 instead.
Comment 17 Alexey Proskuryakov 2005-12-04 13:39:12 PST
Created attachment 4948 [details]
updated patch

Sorry, stupid bugs (thought about writing "continue", but then forgot).

As for testing for <= ' ' - well, that's what khtml::Decoder does, and it
sounds like a good idea to make the logic similar. Actually, Decoder also
permits "xcharset"...
Comment 18 Darin Adler 2005-12-05 06:51:00 PST
Comment on attachment 4948 [details]
updated patch

If this function is going to share quirks with the decoder, then I think it
should be moved to be a static member function of Decoder so we can fix both at
the same time. I know you were discussing where it should go, and I think that
would be the right thing to do.

In general I'm not happy with these getMIMEType and getCharset functions.
getMIMEType splits the entire string into a list and then just uses the first
piece -- that's a bit inefficient since you really just want to search for the
first ";" and get the whitespace-stripped sequence up until that.

But then getCharset goes the other way and instead of parsing the
semicolon-separated string into pieces and looking at each one for charset, it
scans the entire string for the substring "charset" and then tries to figure
things out from there.

In both cases we could write something that was both correct and relatively
efficient.

But for this patch, I think we're OK. This is good enough ot fix the bug. Lets
land it. r=me
Comment 19 Alexey Proskuryakov 2005-12-05 13:07:24 PST
Filed Content-Type parsing cleanup as bug 5954.
Comment 20 Alexey Proskuryakov 2005-12-16 12:28:05 PST
Created attachment 5118 [details]
updated patch

Same as 4948, but updated for the recent ValueImp->JSValue renaming.
Comment 21 Darin Adler 2005-12-18 14:12:49 PST
Is there some kind of automated test we can do? Perhaps using XMLHttpRequest to load a local file?
Comment 22 Alexey Proskuryakov 2005-12-19 03:52:21 PST
(In reply to comment #21)
Obviously, only some aspects of the behavior could be tested with local files... And I'm not sure if 
expecting XMLHttpRequest to work with local files is reasonable (my quick test worked in ToT WebKit, but 
not in Firefox or Opera).
Comment 23 Michael Kaye 2005-12-19 04:33:16 PST
re testing, you can test against the now live site that originally had the problem.

goto to the following site :

1) http://idp.bl.uk
2) Choose Show Catalogues from left menu.
3) Select "Enoki_1962" from catalogue list and click submit button.

4) The page will load and make note of the text in the bottom right area entitled Catalogue Entry.

5) Click "C1..." from the list on the left (Catalogue List) and check whether the data is still encoded 
correctly
Comment 24 Alexey Proskuryakov 2005-12-19 11:05:37 PST
Created attachment 5155 [details]
ChangeLog entry
Comment 25 Alexey Proskuryakov 2005-12-19 11:14:09 PST
Created attachment 5156 [details]
archived test suite

To be served by Apache; don't overlook the .htaccess file.
Comment 26 Michael Kaye 2006-07-05 15:57:45 PDT
Hi I'm the original reporter of this bug. Although it was marked as fixed a while ago, I note the problem still remains with the latest version of Safari (2.0.4 (419.3)), Webkit and Mac OS X (10.4.7).

Should this be the case of have the fixes not been rolled into the latest builds.
Comment 27 Alexey Proskuryakov 2006-07-05 21:43:46 PDT
Yes, the fix is not in latest released MacOS yet. If I were to guess (Apple doesn't give details on such decisions), it might be partially because a workaround exists - as seen at <http://nypop.com/~ap/webkit/xmlhttprequestenc/xmlhttprequestenc.html>, some cases do work. You may try asking for a fix on Apple's release branch via <http://bugreport.apple.com>.

The fix should be present in OmniWeb 5.5 beta (again, no idea about its release date).
Comment 28 Michael Kaye 2006-07-06 04:10:37 PDT
Alexey,

Thanks for you comments. I have registered the bug with apple (hopefully they will action it).

Regards, Michael.