87351 – [BlackBerry] Cookie and Location header should be converted to latin-1/utf-8 in the same place.

RESOLVED INVALID 87351

[BlackBerry] Cookie and Location header should be converted to latin-1/utf-8 in the same place.

https://bugs.webkit.org/show_bug.cgi?id=87351

Summary [BlackBerry] Cookie and Location header should be converted to latin-1/utf-8 ...

Jason Liu

Reported 2012-05-24 00:50:32 PDT

The other headers may be set as utf-8, too. So, I think we should check all headers.

Attachments
Patch (4.86 KB, patch) 2012-05-24 02:05 PDT, Jason Liu	abarth: review-	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Jason Liu

Comment 1 2012-05-24 02:05:55 PDT

Created attachment 143764 [details] Patch

Jason Liu

Comment 2 2012-05-24 02:11:25 PDT

Hi, Joe We need to talk about this issue.

Joe Mason

Comment 3 2012-05-24 08:31:33 PDT

I definitely think this change is a good idea, since the main objection to doing it for all headers was cookies, and it turns out we do want to do it for cookies. Most well-defined headers (such as authentication) are defined to take only ASCII anyway. The only thing I'm not sure about is, this patch still does the latin1 or utf8 check in two different places, but it does it in a different way each time. Can we just use fromUTF8WithLatin1Fallback in initializePlatformRequest as well?

Adam Barth

Comment 4 2012-05-24 10:36:18 PDT

Comment on attachment 143764 [details] Patch This patch is wrong. HTTP headers are not UTF-8.

Joe Mason

Comment 5 2012-05-24 11:28:00 PDT

(In reply to comment #4) > (From update of attachment 143764 [details]) > This patch is wrong. HTTP headers are not UTF-8. Then we should at least make a whitelist (currently containing Location and Cookie, since those are the ones where we've found real-world sites sending UTF-8) and put them in the same place.

Adam Barth

Comment 6 2012-05-24 11:36:34 PDT

http://tools.ietf.org/html/rfc6265 specifies how to process the Cookie and Set-Cookie headers. It's not correct to decode the whole header value as UTF-8. The correct way to process the header as ASCII and then to use UTF-8 to decode portions of the header.

Jason Liu

Comment 7 2012-05-24 19:31:11 PDT

close it since we won't do this check.

Joe Mason

Comment 8 2012-05-25 07:35:51 PDT

Disagree. We already have a check that converts an entire Cookie header (but not Set-Cookie) to UTF-8 or Latin-1 depending on contents, in ResourceRequestBlackBerry, and a different check that does the same conversion for Location headers, in NetworkJob. If we're going to keep the Cookie check, we should at least merge this code to do the checks in the same place using the same method. Or if what Adam says is true, we need to fix the Cookie check to only work on part of the header at a time. But I argue that it's not important: (In reply to comment #6) > http://tools.ietf.org/html/rfc6265 specifies how to process the Cookie and Set-Cookie headers. It's not correct to decode the whole header value as UTF-8. The correct way to process the header as ASCII and then to use UTF-8 to decode portions of the header. ASCII is a subset of UTF-8, so I don't see the difference between processing it as ASCII and then using UTF-8 to decode bytes which are not valid ASCII, and just decoding as UTF-8. All it says in RFC6265 is: NOTE: Despite its name, the cookie-string is actually a sequence of octets, not a sequence of characters. To convert the cookie-string (or components thereof) into a sequence of characters (e.g., for presentation to the user), the user agent might wish to try using the UTF-8 character encoding [RFC3629] to decode the octet sequence. This decoding might fail, however, because not every sequence of octets is valid UTF-8. Which implies that we can decode the whole header as UTF-8. This contradicts the BNF, in fact, which defines cookie-octet to only allow ASCII, but some sites do send UTF-8 and others send Latin-1, so we have to deal. It's possible that some of the components of a Set-Cookie header, like the domain, should cause the cookie to be rejected if they're not plain ASCII, but we're not doing this check for Set-Cookie (yet?)

Joe Mason

Comment 9 2012-05-25 07:37:56 PDT

(In reply to comment #8) > This contradicts the BNF, in fact, which defines cookie-octet to only allow ASCII, but some sites do send UTF-8 and others send Latin-1, so we have to deal. Sorry, I shouldn't say "send" - we're talking about Cookie, not Set-Cookie. I mean "expect".

Adam Barth

Comment 10 2012-05-25 15:55:28 PDT

> ASCII is a subset of UTF-8, so I don't see the difference between processing it as ASCII and then using UTF-8 to decode bytes which are not valid ASCII, and just decoding as UTF-8. Those two operations are different. For example, consider a sequence of octets (like a BOM) in UTF-8 that, when decoded, doesn't produce any Unicode characters. If you first decode the header using UTF-8 and then attempt to parse it, you can get the wrong answer because those sequence of octets will have disappeared. For this reason, it's not possible to correctly process HTTP headers, be they the Cookie, Set-Cookie, or otherwise, in Unicode. You need to process them as sequences of octets in order to get the correct behavior. The design of handleNotifyHeaderReceived is broken and cannot be fixed without changing its type: void NetworkJob::handleNotifyHeaderReceived(const String& key, const String& value) Specifically, the key and the value need to be changed from Unicode strings to sequences of octets. I'm repeating myself, but it is not possible to correctly process HTTP header in Unicode. > All it says in RFC6265 is: > > NOTE: Despite its name, the cookie-string is actually a sequence of > octets, not a sequence of characters. To convert the cookie-string > (or components thereof) into a sequence of characters (e.g., for > presentation to the user), the user agent might wish to try using the > UTF-8 character encoding [RFC3629] to decode the octet sequence. > This decoding might fail, however, because not every sequence of > octets is valid UTF-8. Yes, I know what it says because I wrote it. > Which implies that we can decode the whole header as UTF-8. No, that's not what it says. It says explicitly that cookie-string is actually a sequence of octets, not a sequence of characters. If a user agent wishes to display the cookie-string to the user (e.g., using a font who's glyphs represent Unicode codepoints), then the user agent can try using UTF-8. However, nothing in that note says that it's possible to meet the request of the requirements in the spec by processing the cookie-string in Unicode. It doesn't say that because it's not possible. > This contradicts the BNF, in fact, which defines cookie-octet to only allow ASCII, but some sites do send UTF-8 and others send Latin-1, so we have to deal. Correct. Not all servers send Set-Cookie headers that comply with the BNF. That's why the RFC defines the precise handling of all sequences of octets that might be sent by servers. > It's possible that some of the components of a Set-Cookie header, like the domain, should cause the cookie to be rejected if they're not plain ASCII, but we're not doing this check for Set-Cookie (yet?) The design of this code is broken. The only way to correctly process HTTP header is as sequences of octets. Any attempt to process them in Unicode will not be correct. Period.

Adam Barth

Comment 11 2012-05-25 16:04:05 PDT

Eric points out on chat that you can get pretty close to correct behavior using Unicode. Most of the time, you'll get the write answer, but the problem is that you'll never be able to get everything right. Rather than mess around with a broken architecture, you should just stop trying to use Unicode to process HTTP and work in octets.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution INVALID

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Other

OS Other

Product WebKit

Component WebKit BlackBerry

Assignee

Nobody

Reported

2012-05-24 00:50 PDT

Modified

2015-01-17 20:41 PST History

CC List

7 users Show

URL

Keywords

Depends on

Blocks