Bug 5744

Summary:

XMLHttpRequest does not apply page encoding after assigning via innerHtml

Product:

WebKit

Reporter:

Michael Kaye <mk>

Component:

WebKit Misc.

Assignee:

Alexey Proskuryakov <ap>

Status:

RESOLVED FIXED

Severity:

Major

CC:

mitz

Priority:

Version:

420+

Hardware:

Mac

OS:

OS X 10.4

URL:

http://idptest.bl.uk

Attachments:

Description	Flags
proposed patch	none
proposed patch	darin: review-
updated patch	eric: review-
updated patch	darin: review-
updated patch	darin: review-
updated patch	none
updated patch	darin: review+
ChangeLog entry	none
archived test suite	none

Michael Kaye

Reported 2005-11-14 17:37:12 PST

goto to the following site : 1) http:idptest.bl.uk 2) Choose Show Catalogues from left menu. 3) Select "Enoki_1962" from catalogue list and click submit button. 4) The page will load and make note of the text in the bottom right area entitled Catalogue Entry. The text here is unicode text and onload is rendered correctly. Take note of the "e" character with the circumflex above it on the third line. 5) Click "C1" from the list on the left (Catalogue List). Now check the text on the right again and you will see it is nolonger encoded properly (check the same "e" character. The data is extracted via a normal XMLHttpRequest and assigned via an div.innerHtml assignment so maybe the issue is with the assignment and not the load! nb if site is not available try again later as it is a test server...

Attachments
proposed patch (5.82 KB, patch) 2005-11-30 13:50 PST, Alexey Proskuryakov	no flags	Details Formatted Diff Diff
proposed patch (5.80 KB, patch) 2005-12-02 13:40 PST, Alexey Proskuryakov	darin: review-	Details Formatted Diff Diff
updated patch (7.17 KB, patch) 2005-12-03 13:17 PST, Alexey Proskuryakov	eric: review-	Details Formatted Diff Diff
updated patch (7.19 KB, patch) 2005-12-04 01:03 PST, Alexey Proskuryakov	darin: review-	Details Formatted Diff Diff
updated patch (7.26 KB, patch) 2005-12-04 07:06 PST, Alexey Proskuryakov	darin: review-	Details Formatted Diff Diff
updated patch (7.20 KB, patch) 2005-12-04 13:39 PST, Alexey Proskuryakov	no flags	Details Formatted Diff Diff
updated patch (7.27 KB, patch) 2005-12-16 12:28 PST, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
ChangeLog entry (968 bytes, text/plain) 2005-12-19 11:05 PST, Alexey Proskuryakov	no flags	Details
archived test suite (1.71 KB, application/x-gzip) 2005-12-19 11:14 PST, Alexey Proskuryakov	no flags	Details
Show Obsolete (6) View All Add attachment proposed patch, testcase, etc.

Michael Kaye

Comment 1 2005-11-14 17:38:09 PST

that should be http://idptest.bl.uk

Alexey Proskuryakov

Comment 2 2005-11-20 08:27:29 PST

Does Firefox actually use the document encoding as a default for XMLHttpRequest encoding, or maybe it just defaults to UTF-8?

Alexey Proskuryakov

Comment 3 2005-11-28 14:15:22 PST

Yes, it looks like Firefox defaults to UTF-8 for responses, while Safari defaults to us-ascii. I'm going to investigate this, as well as overriding encoding with overrideMimeType(), and also how req.responseXML is decoded.

Alexey Proskuryakov

Comment 4 2005-11-29 14:01:52 PST

Test case: <http://nypop.com/~ap/webkit/xmlhttprequestenc/xmlhttprequestenc.html>. Works in Firefox; both Safari and Opera fail pretty spectacularly.

Alexey Proskuryakov

Comment 5 2005-11-30 13:50:07 PST

Created attachment 4885 [details] proposed patch Fixes a bunch of issues related to applying the correct encoding to XMLHTTPRequest responses, including a complete inability to parse a response if overrideMimeType('text/xml; charset=<some_charset>') was used. There are still some FIXMEs concerning request encoding; I'm going to investigate these separately.

Eric Seidel (no email)

Comment 6 2005-11-30 16:57:57 PST

Comment on attachment 4885 [details] proposed patch One style violation: + if (encoding.isNull() && job) { + encoding = job->queryMetaData("charset"); + } (no {} around one line ifs) Are you sure you want mimeType.isNull() instead of: mimeType.isEmpty() ? pos += 7; A comment such as // strlen("charset") might be helpful here. Curious, did you adapt getCharset from somewhere? It seems like functionality which has to already exist somewhere else.

Alexey Proskuryakov

Comment 7 2005-11-30 21:37:21 PST

(In reply to comment #6) > One style violation: > (no {} around one line ifs) There's a sequence of ifs, some single-line, some not... To me, it's easier to read if they all folow a single style, even if it violates the guideline. Are you sure it's not a legitimate exception? > Are you sure you want > mimeType.isNull() > instead of: > mimeType.isEmpty() > ? No, I'm not sure... The existing code uses both, and they appear equivalent here, but I may misunderstand something. > A comment such as // strlen("charset") might be helpful here. Agreed. I'm not updating the patch just for this, but will add the comment after the more substantial concerns are resolved. > Curious, did you adapt getCharset from somewhere? It seems like functionality > which has to already exist somewhere else. Yes, this worries me, too. I started from code in Decoder::decode(), but it doesn't look like it can be easily factored out and directly reused here. Another implementation is in Foundation ([NSURLResponse textEncodingName]), but I couldn't find a good way to reuse that, as well.

Alexey Proskuryakov

Comment 8 2005-12-02 13:40:52 PST

Created attachment 4911 [details] proposed patch - Updated to cleanly apply to ToT. - Replaced isNull() with isEmpty(), for better or worse :). - Removed an unrelated change that sneaked into the previous version. - Addressed stylistic comments.

Darin Adler

Comment 9 2005-12-03 11:21:56 PST

Comment on attachment 4911 [details] proposed patch I suggest you should change getResponseHeader to return a QString or DOMString, get rid of the JSLocks and change the getResponseHeader callers that need a JavaScript string to call jsString explicitly. Otherwise, this patch looks great.

Alexey Proskuryakov

Comment 10 2005-12-03 13:17:40 PST

Created attachment 4920 [details] updated patch Changed getResponseHeader to: bool getResponseHeader(const QString& name, QString& value) const;

Eric Seidel (no email)

Comment 11 2005-12-03 15:34:59 PST

Comment on attachment 4920 [details] updated patch alexey and I talked about this extensively on IRC. Mostly about ways to clean up the getCharset() function and make it simpler. I'm going to r- this patch for now, and he'll post a better one tomorrow.

Alexey Proskuryakov

Comment 12 2005-12-04 01:03:29 PST

Created attachment 4925 [details] updated patch Slightly streamlined getCharset().

Alexey Proskuryakov

Comment 13 2005-12-04 01:13:49 PST

Comment on attachment 4925 [details] updated patch We have discussed code reuse in getCharset() with MacDome on IRC. Similar code is available in khtml::Decoder and as an SPI in Foundation. However, the Foundation version is much less permissive (doesn't even seem to handle quoted values), and the Decoder one has somewhat different logic that makes code reuse quite problematic IMO. So, I'm proposing this version - while a longer-term solution may be to fix and leverage that Foundation SPI. And/or to split out the Decoder logic to be reusable.

Darin Adler

Comment 14 2005-12-04 04:29:12 PST

Comment on attachment 4925 [details] updated patch The algorithm in getCharset isn't quite right. It's not correct to look for "charset" and then look for a "=" after it -- there might be the letter "x" just before the string "charset", and this code would allow that. You should fix that aspect of the getCharset function. For getResponseHeader, I do think a separate bool is nice and clear, but there is the distinct "null" value in QString that you could have used instead. I think I'd prefer that because it makes the JavaScript version more consistent with the C++ version. Future direction: To fit in with the rest of the DOM bindings, the way this class should work some day is that there is an XMLHttpRequest class that is not part of the JavaScript bindings, but rather part of the "DOM" implementation. Functions like getResponseHeader which are exported should both take and return DOM strings, just like the functions in classes like DOM::NodeImpl and DOM::DocumentImpl.

Alexey Proskuryakov

Comment 15 2005-12-04 07:06:16 PST

Created attachment 4937 [details] updated patch Addressed the comments (except for future direction).

Darin Adler

Comment 16 2005-12-04 12:39:17 PST

Comment on attachment 4937 [details] updated patch Still looks wrong to me. This looks for charset, and if it's not at the beginning of a word, just returns from the entire function. That doesn't make sense to me. I think maybe you want to do "continue" in that case to look for the next occurence of the string "charset". I also think the test for word boundary seems strange. Why include all control characters, semicolon, and quote marks? This would parse this: ignoreString="charset=UTF-8" as a charset option, and there's no reason to parse in a way that allows that. The skip whitespace code actually skips all characters <= ' ' which includes things other than whitespace. There's no reason to make this function strange in these respects. It could be coded to match RFC 2616 instead.

Alexey Proskuryakov

Comment 17 2005-12-04 13:39:12 PST

Created attachment 4948 [details] updated patch Sorry, stupid bugs (thought about writing "continue", but then forgot). As for testing for <= ' ' - well, that's what khtml::Decoder does, and it sounds like a good idea to make the logic similar. Actually, Decoder also permits "xcharset"...

Darin Adler

Comment 18 2005-12-05 06:51:00 PST

Comment on attachment 4948 [details] updated patch If this function is going to share quirks with the decoder, then I think it should be moved to be a static member function of Decoder so we can fix both at the same time. I know you were discussing where it should go, and I think that would be the right thing to do. In general I'm not happy with these getMIMEType and getCharset functions. getMIMEType splits the entire string into a list and then just uses the first piece -- that's a bit inefficient since you really just want to search for the first ";" and get the whitespace-stripped sequence up until that. But then getCharset goes the other way and instead of parsing the semicolon-separated string into pieces and looking at each one for charset, it scans the entire string for the substring "charset" and then tries to figure things out from there. In both cases we could write something that was both correct and relatively efficient. But for this patch, I think we're OK. This is good enough ot fix the bug. Lets land it. r=me

Alexey Proskuryakov

Comment 19 2005-12-05 13:07:24 PST

Filed Content-Type parsing cleanup as bug 5954.

Alexey Proskuryakov

Comment 20 2005-12-16 12:28:05 PST

Created attachment 5118 [details] updated patch Same as 4948, but updated for the recent ValueImp->JSValue renaming.

Darin Adler

Comment 21 2005-12-18 14:12:49 PST

Is there some kind of automated test we can do? Perhaps using XMLHttpRequest to load a local file?

Alexey Proskuryakov

Comment 22 2005-12-19 03:52:21 PST

(In reply to comment #21) Obviously, only some aspects of the behavior could be tested with local files... And I'm not sure if expecting XMLHttpRequest to work with local files is reasonable (my quick test worked in ToT WebKit, but not in Firefox or Opera).

Michael Kaye

Comment 23 2005-12-19 04:33:16 PST

re testing, you can test against the now live site that originally had the problem. goto to the following site : 1) http://idp.bl.uk 2) Choose Show Catalogues from left menu. 3) Select "Enoki_1962" from catalogue list and click submit button. 4) The page will load and make note of the text in the bottom right area entitled Catalogue Entry. 5) Click "C1..." from the list on the left (Catalogue List) and check whether the data is still encoded correctly

Alexey Proskuryakov

Comment 24 2005-12-19 11:05:37 PST

Created attachment 5155 [details] ChangeLog entry

Alexey Proskuryakov

Comment 25 2005-12-19 11:14:09 PST

Created attachment 5156 [details] archived test suite To be served by Apache; don't overlook the .htaccess file.

Michael Kaye

Comment 26 2006-07-05 15:57:45 PDT

Hi I'm the original reporter of this bug. Although it was marked as fixed a while ago, I note the problem still remains with the latest version of Safari (2.0.4 (419.3)), Webkit and Mac OS X (10.4.7). Should this be the case of have the fixes not been rolled into the latest builds.

Alexey Proskuryakov

Comment 27 2006-07-05 21:43:46 PDT

Yes, the fix is not in latest released MacOS yet. If I were to guess (Apple doesn't give details on such decisions), it might be partially because a workaround exists - as seen at <http://nypop.com/~ap/webkit/xmlhttprequestenc/xmlhttprequestenc.html>, some cases do work. You may try asking for a fix on Apple's release branch via <http://bugreport.apple.com>. The fix should be present in OmniWeb 5.5 beta (again, no idea about its release date).

Michael Kaye

Comment 28 2006-07-06 04:10:37 PDT

Alexey, Thanks for you comments. I have registered the bug with apple (hopefully they will action it). Regards, Michael.

Note You need to log in before you can comment on or make changes to this bug.