Bug 34653 - Charset parameter set by setRequestHeader as "KOI8-R" is NOT changed to UTF-8
Summary: Charset parameter set by setRequestHeader as "KOI8-R" is NOT changed to UTF-8
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: XML (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Chang Shu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-05 10:51 PST by sangeetha.sugavanam
Modified: 2010-08-31 17:36 PDT (History)
5 users (show)

See Also:


Attachments
Test files (1.55 KB, application/x-zip-compressed)
2010-02-05 11:01 PST, sangeetha.sugavanam
no flags Details
fix patch (2.28 KB, patch)
2010-03-11 11:33 PST, Chang Shu
ap: review-
Details | Formatted Diff | Diff
fix patch 2 (5.41 KB, patch)
2010-03-11 14:11 PST, Chang Shu
ap: review-
Details | Formatted Diff | Diff
fix patch 3 (5.45 KB, patch)
2010-03-12 07:42 PST, Chang Shu
ap: commit-queue-
Details | Formatted Diff | Diff
fix patch 4 (8.59 KB, patch)
2010-03-18 12:54 PDT, Chang Shu
no flags Details | Formatted Diff | Diff
fix patch 4 (8.57 KB, patch)
2010-03-18 12:58 PDT, Chang Shu
fishd: review-
Details | Formatted Diff | Diff
fix patch 5 (9.80 KB, patch)
2010-04-05 15:17 PDT, Chang Shu
fishd: review+
commit-queue: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description sangeetha.sugavanam 2010-02-05 10:51:20 PST
Steps to Reproduce:
1. Launch QtLauncher.
2. Load the webpage attached to the bug report(from the zip file).
3. Click the Button to execute the test.
4. Content-Type Header received from the server is displayed in the green box above the button.  

Expected Result:
===============
- Should display content-type and charset as "application/xml; charset=UTF-8".

Actual Result:
==============
- content-type and charset displayed as "application/xml; charset=KOI8-R".


Additional Info:
================
1. Test Assertion : Invoke Send with in parameter Data as Document - 
   If Content-Type header is set using setRequestHeader() and its value is not
malformed, 
   set the charset parameter of the header, by either changing the charset
parameter(if one is present) 
   or appending one, to the  encoding used to encode the document

2. The encoding used to encode the document is UTF-8


3. W3c spec link -
  
http://dev.w3.org/cvsweb/~checkout~/2006/webapi/XMLHttpRequest/Overview.html?rev=1.209&content-type=text/html;%20charset=iso-8859-1

4. Pls refer : Section 4.6.3 - Alogorithm 3 under "data is a Document"
Comment 1 sangeetha.sugavanam 2010-02-05 11:01:21 PST
Created attachment 48243 [details]
Test files
Comment 2 Alexey Proskuryakov 2010-02-05 21:52:13 PST
> - content-type and charset displayed as "application/xml; charset=KOI8-R".

The attached test doesn't mention KOI8-R anywhere. Is it the right test?
Comment 3 Tor Arne Vestbø 2010-03-10 06:30:38 PST
Please follow the QtWebKit bug reporting guidelines when reporting bugs.

See http://trac.webkit.org/wiki/QtWebKitBugs

Specifically:

  - The 'QtWebKit' component should only be used for bugs/features in the
    public QtWebKit API layer, not to signify that the bug is specific to
    the Qt port of WebKit

      http://trac.webkit.org/wiki/QtWebKitBugs#Component

  - Add the keyword 'Qt' to signal that it's a Qt-related bug

      http://trac.webkit.org/wiki/QtWebKitBugs#Keywords
Comment 4 Chang Shu 2010-03-11 11:33:49 PST
Created attachment 50520 [details]
fix patch
Comment 5 Chang Shu 2010-03-11 11:35:23 PST
(In reply to comment #2)
> > - content-type and charset displayed as "application/xml; charset=KOI8-R".
> 
> The attached test doesn't mention KOI8-R anywhere. Is it the right test?

KIO8-R can be found in the attached htm file around line 54:
client.setRequestHeader("Content-Type", "text/plain;charset=KOI8-R");
Comment 6 WebKit Review Bot 2010-03-11 11:36:30 PST
Attachment 50520 [details] did not pass style-queue:

Failed to run "WebKitTools/Scripts/check-webkit-style" exit_code: 1
WebCore/xml/XMLHttpRequest.cpp:435:  An else should appear on the same line as the preceding }  [whitespace/newline] [4]
Total errors found: 1 in 2 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 7 Alexey Proskuryakov 2010-03-11 11:58:43 PST
Comment on attachment 50520 [details]
fix patch

> KIO8-R can be found in the attached htm file around line 54:
>client.setRequestHeader("Content-Type", "text/plain;charset=KOI8-R");

Sorry, I was probably looking at some other test.

This bug has all Qt flags set on it, but the change is in cross-platform code. Please correct the bug's component and keywords.

Whether we want to make this change depends on what Firefox and IE do, not just on what the spec says. Please research their behavior, and post the results in the bug.

+        }
+        else {

The should be on the same line.

+            contentType = contentType.stripWhiteSpace();
+            int start = contentType.find("charset", 0);

HTTP header modification should abstracted out into a separate function. We don't have a great place to put it currently - HTTPParsers.{h,cpp} is the closest, but it's only about parsing. Maybe you should just make it a static funciton in XMLHttpRequest.cpp for now.

-                setRequestHeaderInternal("Content-Type", "application/xml");
+                setRequestHeaderInternal("Content-Type", "text/plain;charset=UTF-8");

This is a change that is not asked for in this bug, and one that is not explained in ChangeLog. Is there a reason you are making it in the same patch? It will need to be discussed on its own.

+        No new tests have been created. The orginal test case, 
+        http://waplabdc.nokia-boston.com/browser/users/akancher/sf_XHR/Send/Send_strData_Charset_Not_UTF8.htm,
+        requires ALL-HTTP echoing back from an asp server. However, cgi does not support this.

It is certainly possible to access a request Content-Type from a cgi, see e.g. LayoutTests/http/tests/xmlhttprequest/print-content-type.cgi.
Comment 8 Alexey Proskuryakov 2010-03-11 12:00:30 PST
> This bug has all Qt flags set on it, but the change is in cross-platform code.
> Please correct the bug's component and keywords.

I see that you did this while I was reviewing. Also changing platform now.
Comment 9 Chang Shu 2010-03-11 12:05:06 PST
> 
> +            contentType = contentType.stripWhiteSpace();
> +            int start = contentType.find("charset", 0);
> 
> HTTP header modification should abstracted out into a separate function. We
> don't have a great place to put it currently - HTTPParsers.{h,cpp} is the
> closest, but it's only about parsing. Maybe you should just make it a static
> funciton in XMLHttpRequest.cpp for now.

Sure. will do.

> 
> -                setRequestHeaderInternal("Content-Type", "application/xml");
> +                setRequestHeaderInternal("Content-Type",
> "text/plain;charset=UTF-8");
> 
> This is a change that is not asked for in this bug, and one that is not
> explained in ChangeLog. Is there a reason you are making it in the same patch?
> It will need to be discussed on its own.

Right. I planned to take this line out but forgot to do it while submitting the patch. I understand this is addressed in another bug
https://bugs.webkit.org/show_bug.cgi?id=11049

> 
> +        No new tests have been created. The orginal test case, 
> +       
> http://waplabdc.nokia-boston.com/browser/users/akancher/sf_XHR/Send/Send_strData_Charset_Not_UTF8.htm,
> +        requires ALL-HTTP echoing back from an asp server. However, cgi does
> not support this.
> 
> It is certainly possible to access a request Content-Type from a cgi, see e.g.
> LayoutTests/http/tests/xmlhttprequest/print-content-type.cgi.

wow, I will create a new test case then. thanks!
Comment 10 Chang Shu 2010-03-11 14:11:24 PST
Created attachment 50538 [details]
fix patch 2
Comment 11 Alexey Proskuryakov 2010-03-11 14:40:46 PST
Comment on attachment 50538 [details]
fix patch 2

+            m_requestHeaders.remove("Content-Type");
+            m_requestHeaders.set("Content-Type", contentType);

Is there actually necessary to remove before setting?

+    req.open("POST", "print-content-type.cgi", false);

This can't work - the cgi script is in resources/ subdirectory. Please fix and make the test resilient against such errors.

This can't go in without an answer to my question about other browsers' behavior.
Comment 12 Chang Shu 2010-03-11 19:03:32 PST
> +    req.open("POST", "print-content-type.cgi", false);
> 
> This can't work - the cgi script is in resources/ subdirectory. Please fix and
> make the test resilient against such errors.

For some reason, this cgi script is under xmlhttprequest, along with a couple of others. Most other cgi scripts are under xmlhttprequest/resources indeed.

> 
> This can't go in without an answer to my question about other browsers'
> behavior.

I tried on Firefox and IE. Firefox shows UTF-8, which is as specified, but IE shows KOI8-R.
Comment 13 Alexey Proskuryakov 2010-03-11 19:58:03 PST
> For some reason, this cgi script is under xmlhttprequest, along with a couple

Indeed it is!
Comment 14 Alexey Proskuryakov 2010-03-11 20:00:39 PST
OK. Siding with Firefox seems fine here.

Does it match the behavior of this patch in all these cases?

- setRequestHeader wasn't called;
- setRequestHeader was called, and Content-Type without a charset was passed;
- setRequestHeader was called, and Content-Type with a different charset was passed.
Comment 15 Chang Shu 2010-03-12 06:49:52 PST
Thanks for the comments! Please see my findings below:

> Does it match the behavior of this patch in all these cases?
> 
> - setRequestHeader wasn't called;

This is not covered by this patch. The current webkit behavior is showing "application/xml". But Firefox shows "text/plain; charset=UTF-8".

> - setRequestHeader was called, and Content-Type without a charset was passed;
> - setRequestHeader was called, and Content-Type with a different charset was
> passed.

Yes, after my patch, webkit shows "text/plain; charset=UTF-8" in both cases and so does firefox. I will go ahead submit the new patch after removing the "remove" line. Thanks!
Comment 16 Chang Shu 2010-03-12 07:42:26 PST
Created attachment 50598 [details]
fix patch 3
Comment 17 Alexey Proskuryakov 2010-03-12 13:42:38 PST
Comment on attachment 50598 [details]
fix patch 3

1. We usually don't fix coding style for code that isn't touched by a change, like isValidToken here. But it's fine to keep these, I'm just commenting in general.

2. "setCharsetinContentType" should be "setCharsetInMediaType". Note the capitalization, and matching the naming of existing functions.

3. This function doesn't do what it says. Instead of setting charset, it re-creates the media type string, potentially losing lots of other parameters.

#2 must be fixed before landing. But I can't decide if #3 is important enough to be fixed right away. This part regresses behavior and doesn't match the spec, but it may be harmless in practice.
Comment 18 Chang Shu 2010-03-12 13:55:40 PST
(In reply to comment #17)
> (From update of attachment 50598 [details])
> 1. We usually don't fix coding style for code that isn't touched by a change,
> like isValidToken here. But it's fine to keep these, I'm just commenting in
> general.

I was running check-webkit-style against the cpp file. So looks like I should run against the patch file.

> 
> 2. "setCharsetinContentType" should be "setCharsetInMediaType". Note the
> capitalization, and matching the naming of existing functions.

Sure.

> 3. This function doesn't do what it says. Instead of setting charset, it
> re-creates the media type string, potentially losing lots of other parameters.

Yeah, I assumed the media type is composed of two parts. Just wanted to simplified the code, otherwise I have to almost copy the whole extractcharsetfrommediatype over.

> 
> #2 must be fixed before landing. But I can't decide if #3 is important enough
> to be fixed right away. This part regresses behavior and doesn't match the
> spec, but it may be harmless in practice.

Is it ok I don't touch #3 in the next patch?
Comment 19 Alexey Proskuryakov 2010-03-12 14:14:22 PST
> Is it ok I don't touch #3 in the next patch?

I don't understand the question. But anyway, I'd like someone who is more familiar with real world HTTP usage (Darin Fisher?) to make this call.
Comment 20 Chang Shu 2010-03-12 14:18:44 PST
(In reply to comment #19)
> > Is it ok I don't touch #3 in the next patch?
> 
> I don't understand the question. But anyway, I'd like someone who is more
> familiar with real world HTTP usage (Darin Fisher?) to make this call.

I asked if I really needed to re-implement the function. I will wait for Darin Fisher' comments then. Thanks,
Comment 21 Darin Fisher (:fishd, Google) 2010-03-12 19:42:48 PST
Yeah, completely overriding the Content-Type header seems risky to me.  That said, I don't know of any real world examples where someone is using custom modifiers in the Content-Type header, but there is nothing to prevent someone from doing so today.

Does Firefox also set the charset on the request header?  How do they handle this case?
Comment 22 Chang Shu 2010-03-15 08:14:29 PDT
(In reply to comment #21)
> Yeah, completely overriding the Content-Type header seems risky to me.  That
> said, I don't know of any real world examples where someone is using custom
> modifiers in the Content-Type header, but there is nothing to prevent someone
> from doing so today.
> 
> Does Firefox also set the charset on the request header?  How do they handle
> this case?

Thanks for the comment, Darin. I have no problem to reimplement the setcharset function. I guess we should always favor a solid implementation over a simplified one. :)

From the test result, I can see Firefox always set the charset on the request header. I also read the code: https://hg.mozilla.org/mozilla-central/file/249428fdc4a8/content/base/src/nsXMLHttpRequest.cpp and believe they modified the header based on certain logic. However, since I didn't debug through the code, I wasn't 100% sure if we match their implementation exactly.
Comment 23 Chang Shu 2010-03-18 12:54:53 PDT
Created attachment 51076 [details]
fix patch 4
Comment 24 WebKit Review Bot 2010-03-18 12:55:42 PDT
Attachment 51076 [details] did not pass style-queue:

Failed to run "WebKitTools/Scripts/check-webkit-style" exit_code: 1
WebCore/xml/XMLHttpRequest.cpp:136:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
WebCore/xml/XMLHttpRequest.cpp:141:  One line control clauses should not use braces.  [whitespace/braces] [4]
Total errors found: 2 in 7 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 25 Chang Shu 2010-03-18 12:58:19 PDT
Created attachment 51079 [details]
fix patch 4
Comment 26 Alexey Proskuryakov 2010-03-18 13:00:04 PDT
Note that XHR spec explicitly says that all occurrences of charset in media type string should be replaced.
Comment 27 Darin Fisher (:fishd, Google) 2010-04-01 23:53:23 PDT
Comment on attachment 51079 [details]
fix patch 4

Sorry for the super slow turnaround on this code review.


> Index: WebCore/platform/network/HTTPParsers.cpp
...
> +void findCharsetInMediaType(const String& mediaType, int& charsetStart, int& charsetLen)
> +{
> +    charsetStart = 0;
> +    charsetLen = 0;
> +
>      int pos = 0;
>      int length = (int)mediaType.length();
>      
>      while (pos < length) {
>          pos = mediaType.find("charset", pos, false);
> +        if (pos <= 0) {
> +            charsetLen = 0;
> +            return;
> +        }
>          
>          // is what we found a beginning of a word?
>          if (mediaType[pos-1] > ' ' && mediaType[pos-1] != ';') {
> @@ -214,10 +226,12 @@ String extractCharsetFromMediaType(const
>          int endpos = pos;
>          while (pos != length && mediaType[endpos] > ' ' && mediaType[endpos] != '"' && mediaType[endpos] != '\'' && mediaType[endpos] != ';')
>              ++endpos;
> +
> +        charsetStart = pos;
> +        charsetLen = endpos - pos;
> +        return;
>      }
> +
> +    charsetLen = 0;
>  }

^^^ it looks likes it is unnecessary to assign 0 here since contentLen is
initialized to 0 at the top of the function.


> Index: WebCore/xml/XMLHttpRequest.cpp
> ===================================================================
> --- WebCore/xml/XMLHttpRequest.cpp	(revision 56125)
> +++ WebCore/xml/XMLHttpRequest.cpp	(working copy)
> @@ -128,6 +128,31 @@ static bool isSetCookieHeader(const Atom
>      return equalIgnoringCase(name, "set-cookie") || equalIgnoringCase(name, "set-cookie2");
>  }
>  
> +static void setCharsetInMediaType(String& mediaType, const String& charsetValue)
> +{
> +    int start, len;
> +    findCharsetInMediaType(mediaType, start, len);
> +
> +    if (!len) {
> +        // When no charset found, append new charset.
> +        mediaType.stripWhiteSpace();
> +        if (mediaType[mediaType.length() - 1] != ';')
> +            mediaType.append(";");
> +        mediaType.append(" charset=");
> +        mediaType.append(charsetValue);
> +    } else {
> +        // Found existing charset, replace with new charset.
> +        String newMediaType = mediaType.substring(0, start);
> +        newMediaType.append(charsetValue);
> +
> +        if (start + len < (int)mediaType.length()) {

^^^ nit: please use static_cast<int>(...) instead of the C-style cast.


> +            newMediaType.append("; ");
> +            newMediaType.append(mediaType.substring(start + len));

maybe it would be better to use the String::replace method?

another approach here might be to strip out all occurrences of a charset
attribute and then append the desired charset attribute.


> Index: LayoutTests/http/tests/xmlhttprequest/request-encoding2.html
...
> +    function test1()
> +    {
> +        req = new XMLHttpRequest;
> +        req.open("POST", "print-content-type.cgi", false);
> +
> +        req.send("");
> +
> +        document.getElementById("result1").firstChild.data = "Test1 (setRequestHeader was not called):";
> +        if (req.responseText == "application/xml\n")

Can you explain why this case does not have an implicit charset?
Comment 28 Darin Fisher (:fishd, Google) 2010-04-01 23:54:22 PDT
Please also include tests for the case where there are other Content-Type attributes bracketing the charset attribute since the code attempts to deal with such.
Comment 29 Chang Shu 2010-04-02 08:36:46 PDT
(In reply to comment #28)
> Please also include tests for the case where there are other Content-Type
> attributes bracketing the charset attribute since the code attempts to deal
> with such.

Thanks for the review, Darin. I will submit a new patch based on your comments.
For the above comment, I am not sure how to come up with a real-life case with attributes following charset. Can you give me an example, or I just use an imaginary content string, like content="text/html; charset=iso-8859-1; future-extension"?
Comment 30 Alexey Proskuryakov 2010-04-02 08:47:38 PDT
>> +        if (req.responseText == "application/xml\n")
>Can you explain why this case does not have an implicit charset?

Perhaps because application/xml defaults to UTF-8 anyway, and since browsers didn't send charset before, it's safer not to send one. I don't know what Firefox does now though.
Comment 31 Darin Fisher (:fishd, Google) 2010-04-02 08:57:15 PDT
(In reply to comment #30)
> >> +        if (req.responseText == "application/xml\n")
> >Can you explain why this case does not have an implicit charset?
> 
> Perhaps because application/xml defaults to UTF-8 anyway, and since browsers
> didn't send charset before, it's safer not to send one. I don't know what
> Firefox does now though.

Well, I noticed that with this patch if the user sets a Content-Type header without an explicit charset, we will add a charset attribute, but if they do not set a Content-Type header at all, then we will not add a charset attribute.  Is that how Firefox behaves?  Doesn't it seem a bit odd to be inconsistent in these two cases?  I agree that excluding UTF-8 shouldn't really matter though.
Comment 32 Darin Fisher (:fishd, Google) 2010-04-02 08:58:11 PDT
(In reply to comment #29)
...
> attributes following charset. Can you give me an example, or I just use an
> imaginary content string, like content="text/html; charset=iso-8859-1;
> future-extension"?

Yeah, making up a fake attribute was my intent.  That way you can test some of the cases the code is attempting to handle.
Comment 33 Chang Shu 2010-04-05 15:17:17 PDT
Created attachment 52579 [details]
fix patch 5

This patch implements multiple charset and extra fields handling.
Comment 34 Chang Shu 2010-04-05 15:18:47 PDT
(In reply to comment #31)
> (In reply to comment #30)
> > >> +        if (req.responseText == "application/xml\n")
> > >Can you explain why this case does not have an implicit charset?
> > 
> > Perhaps because application/xml defaults to UTF-8 anyway, and since browsers
> > didn't send charset before, it's safer not to send one. I don't know what
> > Firefox does now though.
> 
> Well, I noticed that with this patch if the user sets a Content-Type header
> without an explicit charset, we will add a charset attribute, but if they do
> not set a Content-Type header at all, then we will not add a charset attribute.
>  Is that how Firefox behaves?  Doesn't it seem a bit odd to be inconsistent in
> these two cases?  I agree that excluding UTF-8 shouldn't really matter though.

This can probably be addressed in this bug:
https://bugs.webkit.org/show_bug.cgi?id=11049
Comment 35 WebKit Commit Bot 2010-04-13 14:39:22 PDT
Comment on attachment 52579 [details]
fix patch 5

Rejecting patch 52579 from commit-queue.

Failed to run "['/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', '--reviewer', 'Darin Fisher', '--force']" exit_code: 1
Last 500 characters of output:
form/network/HTTPParsers.h
Hunk #1 FAILED at 48.
1 out of 1 hunk FAILED -- saving rejects to file WebCore/platform/network/HTTPParsers.h.rej
patching file WebCore/xml/XMLHttpRequest.cpp
Hunk #1 succeeded at 129 (offset 1 line).
Hunk #2 succeeded at 456 (offset 1 line).
patching file LayoutTests/ChangeLog
Hunk #1 succeeded at 1 with fuzz 3.
patching file LayoutTests/http/tests/xmlhttprequest/request-encoding2-expected.txt
patching file LayoutTests/http/tests/xmlhttprequest/request-encoding2.html

Full output: http://webkit-commit-queue.appspot.com/results/1584466
Comment 36 Chang Shu 2010-04-13 16:08:41 PDT
The commit was rejected due to conflict on merging.
I manually landed the patch after resolving conflict.

Committed revision: 57544
Comment 38 Chang Shu 2010-04-13 20:27:22 PDT
(In reply to comment #37)
> This broke tests on the build bots, such as
> <http://build.webkit.org/results/Tiger%20Intel%20Release/r57547%20(10780)/http/tests/xmlhttprequest/access-control-basic-whitelist-request-headers-pretty-diff.html>.

Hi, Mark, I see you have updated the test results for me. Thanks!
Comment 39 Alexey Proskuryakov 2010-08-31 17:36:35 PDT
This has caused bug 40947.