Bug 214860 - [EventSource] WebKit fails to UTF-8 encode the Last-Event-ID HTTP header value
Summary: [EventSource] WebKit fails to UTF-8 encode the Last-Event-ID HTTP header value
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL: https://html.spec.whatwg.org/multipag...
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-07-27 16:48 PDT by Chris Dumez
Modified: 2020-07-28 16:15 PDT (History)
8 users (show)

See Also:


Attachments
Patch (4.11 KB, patch)
2020-07-27 16:49 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (4.11 KB, patch)
2020-07-27 16:50 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (6.77 KB, patch)
2020-07-28 12:54 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (6.83 KB, patch)
2020-07-28 14:37 PDT, Chris Dumez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Dumez 2020-07-27 16:48:23 PDT
WebKit fails to UTF-8 encode the Last-Event-ID HTTP header value for EventSource.

The specification is here:
https://html.spec.whatwg.org/multipage/server-sent-events.html#reestablish-the-connection (step 5.3.)

This is causing us to fail a couple of EventSource web-platform-tests which are passing is both Chrome & Firefox.
Comment 1 Chris Dumez 2020-07-27 16:49:57 PDT
Created attachment 405329 [details]
Patch
Comment 2 Chris Dumez 2020-07-27 16:50:34 PDT
Created attachment 405330 [details]
Patch
Comment 4 Darin Adler 2020-07-27 16:53:11 PDT
Comment on attachment 405330 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=405330&action=review

> Source/WebCore/page/EventSource.cpp:106
> +        // Specification says this needs to be UTF8-encoded.
> +        auto lastEventID = m_lastEventId.utf8();
> +        request.setHTTPHeaderField(HTTPHeaderName::LastEventID, String(reinterpret_cast<const LChar*>(lastEventID.data()), lastEventID.length()));

This is not the correct way to do this. We should never have a WTF::String with characters that are UTF-8 encoded. It’s *defined* as Latin-1.
Comment 5 Darin Adler 2020-07-27 16:53:52 PDT
Comment on attachment 405330 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=405330&action=review

>> Source/WebCore/page/EventSource.cpp:106
>> +        request.setHTTPHeaderField(HTTPHeaderName::LastEventID, String(reinterpret_cast<const LChar*>(lastEventID.data()), lastEventID.length()));
> 
> This is not the correct way to do this. We should never have a WTF::String with characters that are UTF-8 encoded. It’s *defined* as Latin-1.

Where is the code that converts the String into either a CFString or into bytes that go on the wire?
Comment 6 Chris Dumez 2020-07-27 16:55:41 PDT
(In reply to Darin Adler from comment #5)
> Comment on attachment 405330 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=405330&action=review
> 
> >> Source/WebCore/page/EventSource.cpp:106
> >> +        request.setHTTPHeaderField(HTTPHeaderName::LastEventID, String(reinterpret_cast<const LChar*>(lastEventID.data()), lastEventID.length()));
> > 
> > This is not the correct way to do this. We should never have a WTF::String with characters that are UTF-8 encoded. It’s *defined* as Latin-1.
> 
> Where is the code that converts the String into either a CFString or into
> bytes that go on the wire?

I believe this would be ResourceRequest::doUpdatePlatformRequest() in ResourceRequestCocoa.mm:

    for (const auto& header : httpHeaderFields())
        [nsRequest setValue:header.value forHTTPHeaderField:header.key];
Comment 7 Chris Dumez 2020-07-27 16:57:47 PDT
(In reply to Chris Dumez from comment #6)
> (In reply to Darin Adler from comment #5)
> > Comment on attachment 405330 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=405330&action=review
> > 
> > >> Source/WebCore/page/EventSource.cpp:106
> > >> +        request.setHTTPHeaderField(HTTPHeaderName::LastEventID, String(reinterpret_cast<const LChar*>(lastEventID.data()), lastEventID.length()));
> > > 
> > > This is not the correct way to do this. We should never have a WTF::String with characters that are UTF-8 encoded. It’s *defined* as Latin-1.
> > 
> > Where is the code that converts the String into either a CFString or into
> > bytes that go on the wire?
> 
> I believe this would be ResourceRequest::doUpdatePlatformRequest() in
> ResourceRequestCocoa.mm:
> 
>     for (const auto& header : httpHeaderFields())
>         [nsRequest setValue:header.value forHTTPHeaderField:header.key];

Do you think we should special-case Last-Event-ID in ResourceRequest::doUpdatePlatformRequest() instead?
Comment 8 Darin Adler 2020-07-27 17:02:55 PDT
I started talking about this in Slack:

The bug seems to be in the CFNetwork layer.

When converting the WTF::String to the CFString the characters are not in any particular encoding; conceptually they are represented as a sequence of UTF-16 code units (although CFString can store them in whatever way it likes internally).

Converting the CFString to bytes to go on the wire is done inside CFNetwork; apparently incorrectly if it’s not using UTF-8!

Making a CFString that has UTF-16 code units that are actually UTF-8 code units with the high bytes set to zero is a possible way to work around that CFNetwork issue, I guess, but a dangerous workaround. It would be broken if CFNetwork was ever fixed to correctly encode strings as UTF-8 (the string would be double-encoded).
Comment 9 Chris Dumez 2020-07-27 17:12:58 PDT
(In reply to Darin Adler from comment #8)
> I started talking about this in Slack:
> 
> The bug seems to be in the CFNetwork layer.
> 
> When converting the WTF::String to the CFString the characters are not in
> any particular encoding; conceptually they are represented as a sequence of
> UTF-16 code units (although CFString can store them in whatever way it likes
> internally).
> 
> Converting the CFString to bytes to go on the wire is done inside CFNetwork;
> apparently incorrectly if it’s not using UTF-8!
> 
> Making a CFString that has UTF-16 code units that are actually UTF-8 code
> units with the high bytes set to zero is a possible way to work around that
> CFNetwork issue, I guess, but a dangerous workaround. It would be broken if
> CFNetwork was ever fixed to correctly encode strings as UTF-8 (the string
> would be double-encoded).

I am not convinced this is a CFNetwork bug. Headers are ISO-8859-1 encoded as per RFC:
- https://tools.ietf.org/html/rfc7230#section-3.2.4
"""
Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.
"""

Here, it is the HTML specification that is enforcing that we use UTF-8 encoding to this particular header value.
Comment 10 Alexey Proskuryakov 2020-07-27 18:06:35 PDT
rdar://problem/13799963
Comment 11 Alexey Proskuryakov 2020-07-27 18:08:29 PDT
I don't think that it's a CFNetwork *bug*, but I think that there should be a way to get this via API cleanly, see rdar://problem/13800011.
Comment 12 Chris Dumez 2020-07-28 12:54:15 PDT
Created attachment 405397 [details]
Patch
Comment 13 Chris Dumez 2020-07-28 12:59:46 PDT
As discussed on Slack, it is not safe to UTF-8 encode all header and is causing some test failures (Those tests seem to pass various octet sequences for a custom header and do not expect the value to be UTF-8 encoded).

The HTTP RFC suggests that any octet is valid in header field value (in theory), that historically it was Latin1, and that new header fields SHOULD only use US-ASCII
Comment 14 Alex Christensen 2020-07-28 14:20:53 PDT
Comment on attachment 405397 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=405397&action=review

> Source/WebCore/platform/network/cocoa/ResourceRequestCocoa.mm:198
> +        [nsRequest setValue:encodedValue.bridgingAutorelease() forHTTPHeaderField:header.key];

This will cause a spike in the autoreleasepool size.  Why not just encodedValue.get()?  Then it's immediately released after use.
Comment 15 Darin Adler 2020-07-28 14:24:13 PDT
Comment on attachment 405397 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=405397&action=review

> Source/WebCore/platform/network/cf/ResourceRequestCFNet.h:91
> +        // HTTP headers are normally latin1 but the HTML specification says to use UTF-8 encoding for the Last-Event-ID header value.

Latin-1, not "latin1".

Missing second half of the comment. Something like this:

// Constructing a string with the UTF-8 bytes but claiming that it’s Latin-1 is the way to get CFNetwork to put those UTF-8 bytes on the wire.

> Source/WebCore/platform/network/cocoa/ResourceRequestCocoa.mm:198
> +        [nsRequest setValue:encodedValue.bridgingAutorelease() forHTTPHeaderField:header.key];

Slightly wasteful to use autorelease here. We do need bridging, but not autorelease. Could write (__bridge CFStringRef)encodedValue.get() but it would be even better if it could be done more elegantly.
Comment 16 Chris Dumez 2020-07-28 14:37:48 PDT
Created attachment 405415 [details]
Patch
Comment 17 EWS 2020-07-28 16:15:17 PDT
Committed r265014: <https://trac.webkit.org/changeset/265014>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 405415 [details].