RESOLVED INVALID178257
Content-Type parameter value parsing: backslash
https://bugs.webkit.org/show_bug.cgi?id=178257
Summary Content-Type parameter value parsing: backslash
Anne van Kesteren
Reported 2017-10-13 03:49:09 PDT
A document labeled as text/html;charset="\g\b\k" ends up with the "default" encoding, rather than GBK. This is wrong per HTTP and per the specification I'm working on that's a bit more tolerant than HTTP to accept cases such as text/html; More context: https://github.com/whatwg/mimesniff/issues/38.
Attachments
Alexey Proskuryakov
Comment 1 2017-10-13 09:39:21 PDT
Is this about HTTP, or about the Http-Equiv meta? If HTTP, this is implemented below WebKit, so the right place for the report is bugreport.apple.com.
Anne van Kesteren
Comment 2 2017-10-13 09:52:26 PDT
Are there guidelines somewhere about which product to use for HTTP library issues?
Alexey Proskuryakov
Comment 3 2017-10-13 09:56:58 PDT
I am not aware of such documentation, and don't have the same view into the system to confirm what the available options are. Please feel free to post the Apple bug report number here, and I'll keep an eye on it to make sure that it reaches the right component.
Anne van Kesteren
Comment 4 2017-10-13 10:05:48 PDT
34980046
Alexey Proskuryakov
Comment 5 2017-10-13 11:12:24 PDT
Thank you! Marking invalid as a non-WebKit issue. Please file a new bug for clarity if this also applies to Http-Equiv.
Anne van Kesteren
Comment 6 2017-10-14 06:42:09 PDT
It doesn't. Despite the name, they're not actually equivalent in their processing model. One final question, for a MIME type such as text/html;charset=" gbk" it's WebKit that would be responsible for trimming the whitespace around the encoding label, right? Or would I have to file that against the HTTP library as well?
Alexey Proskuryakov
Comment 7 2017-10-16 09:24:51 PDT
I think that this would be WebKit (why do these need to be trimmed though?) On a second thought, I'm actually not sure if unescaping can be performed at the networking level, given how the interface to it looks like. Maybe it needs to be WebKit. But also, why would we want to add tolerance where it's not needed for compatibility? Spec situation is a mess anyway.
Anne van Kesteren
Comment 8 2017-10-17 00:12:18 PDT
Escaping in quoted-string is a required part of the HTTP specification since forever. I guess we could try to object to it, if browsers consistently don't support it in quoted-string and it's basically always fiction... Stripping whitespace around a label is part of https://encoding.spec.whatwg.org/#concept-encoding-get.
Keith Rollin
Comment 9 2017-10-24 12:53:26 PDT
<rdar://34980046> ended back up with us. This seems appropriate, since WebKit is what parses the HTTP headers. As noted in the radar, the networking layer doesn't look at this. Copying my radar comment here for visibility, since I don't think Anne sees those through bugs.apple.com. ---------- See also <https://bugs.webkit.org/show_bug.cgi?id=178257>. Alexey advised Anne to post to bugs.apple.com, which caused it to end up here. Per that bug and the GitHub issue, it sounds like this overall concerns adherence to <https://encoding.spec.whatwg.org/#concept-encoding-get>. I think Anne saying that we don’t follow it correctly. On the other hand, from our source code (ParsedContentType.cpp, assuming that’s the right file), it looks like we follow <http://tools.ietf.org/html/rfc2045#section-5.1> closely. rfc2045 lays out the rules for parsing Content-Type. But it refers to a “quoted string”, which is not defined, other than to mention that it begins and ends with quotes, which are not part of the final value. From that, it looks like we don’t do anything to treat the contents specially other than to consider the two characters — \” — as part of the content rather than a delimiter. That is, any slash and the character following it are included in the final value. However, <https://tools.ietf.org/html/rfc7230#section-3.2.6> does provide a definition of quoted string: “A string of text is parsed as a single value if it is quoted using double-quote marks. … “The backslash octet ("\") can be used as a single-octet quoting mechanism within quoted-string and comment constructs. Recipients that process the value of a quoted-string MUST handle a quoted-pair as if it were replaced by the octet following the backslash.” So perhaps we’re not following this part of the HTTP spec. Note that ParsedContentType.cpp was checked-in in r86289 in 2011, and so is definitely a pre-existing condition. Lowering priority based on the age of this issue. I’m not sure who should handle this. The code was checked in from someone on the Chromium team. Starting with Geoff’s group.
Anne van Kesteren
Comment 10 2017-10-25 00:11:48 PDT
Okay, it sounds like not supporting backslashes is a problem in the HTTP layer of the OS, not WebKit. So the rdar should be good for that. I'll file a new WebKit bug in due course on not trimming whitespace from an encoding label when trying to determine the corresponding encoding and update this bug with the new bug number.
Anne van Kesteren
Comment 11 2017-10-25 00:12:29 PDT
I forgot to mention, I much appreciate you updating this bug report with that information. Thanks!
Anne van Kesteren
Comment 12 2017-11-20 04:34:03 PST
Finally got around to filing bug 179882.
Note You need to log in before you can comment on or make changes to this bug.