RESOLVED FIXED 214860
[EventSource] WebKit fails to UTF-8 encode the Last-Event-ID HTTP header value
https://bugs.webkit.org/show_bug.cgi?id=214860
Summary [EventSource] WebKit fails to UTF-8 encode the Last-Event-ID HTTP header value
Chris Dumez
Reported 2020-07-27 16:48:23 PDT
WebKit fails to UTF-8 encode the Last-Event-ID HTTP header value for EventSource. The specification is here: https://html.spec.whatwg.org/multipage/server-sent-events.html#reestablish-the-connection (step 5.3.) This is causing us to fail a couple of EventSource web-platform-tests which are passing is both Chrome & Firefox.
Attachments
Patch (4.11 KB, patch)
2020-07-27 16:49 PDT, Chris Dumez
no flags
Patch (4.11 KB, patch)
2020-07-27 16:50 PDT, Chris Dumez
no flags
Patch (6.77 KB, patch)
2020-07-28 12:54 PDT, Chris Dumez
no flags
Patch (6.83 KB, patch)
2020-07-28 14:37 PDT, Chris Dumez
no flags
Chris Dumez
Comment 1 2020-07-27 16:49:57 PDT
Chris Dumez
Comment 2 2020-07-27 16:50:34 PDT
Darin Adler
Comment 4 2020-07-27 16:53:11 PDT
Comment on attachment 405330 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=405330&action=review > Source/WebCore/page/EventSource.cpp:106 > + // Specification says this needs to be UTF8-encoded. > + auto lastEventID = m_lastEventId.utf8(); > + request.setHTTPHeaderField(HTTPHeaderName::LastEventID, String(reinterpret_cast<const LChar*>(lastEventID.data()), lastEventID.length())); This is not the correct way to do this. We should never have a WTF::String with characters that are UTF-8 encoded. It’s *defined* as Latin-1.
Darin Adler
Comment 5 2020-07-27 16:53:52 PDT
Comment on attachment 405330 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=405330&action=review >> Source/WebCore/page/EventSource.cpp:106 >> + request.setHTTPHeaderField(HTTPHeaderName::LastEventID, String(reinterpret_cast<const LChar*>(lastEventID.data()), lastEventID.length())); > > This is not the correct way to do this. We should never have a WTF::String with characters that are UTF-8 encoded. It’s *defined* as Latin-1. Where is the code that converts the String into either a CFString or into bytes that go on the wire?
Chris Dumez
Comment 6 2020-07-27 16:55:41 PDT
(In reply to Darin Adler from comment #5) > Comment on attachment 405330 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=405330&action=review > > >> Source/WebCore/page/EventSource.cpp:106 > >> + request.setHTTPHeaderField(HTTPHeaderName::LastEventID, String(reinterpret_cast<const LChar*>(lastEventID.data()), lastEventID.length())); > > > > This is not the correct way to do this. We should never have a WTF::String with characters that are UTF-8 encoded. It’s *defined* as Latin-1. > > Where is the code that converts the String into either a CFString or into > bytes that go on the wire? I believe this would be ResourceRequest::doUpdatePlatformRequest() in ResourceRequestCocoa.mm: for (const auto& header : httpHeaderFields()) [nsRequest setValue:header.value forHTTPHeaderField:header.key];
Chris Dumez
Comment 7 2020-07-27 16:57:47 PDT
(In reply to Chris Dumez from comment #6) > (In reply to Darin Adler from comment #5) > > Comment on attachment 405330 [details] > > Patch > > > > View in context: > > https://bugs.webkit.org/attachment.cgi?id=405330&action=review > > > > >> Source/WebCore/page/EventSource.cpp:106 > > >> + request.setHTTPHeaderField(HTTPHeaderName::LastEventID, String(reinterpret_cast<const LChar*>(lastEventID.data()), lastEventID.length())); > > > > > > This is not the correct way to do this. We should never have a WTF::String with characters that are UTF-8 encoded. It’s *defined* as Latin-1. > > > > Where is the code that converts the String into either a CFString or into > > bytes that go on the wire? > > I believe this would be ResourceRequest::doUpdatePlatformRequest() in > ResourceRequestCocoa.mm: > > for (const auto& header : httpHeaderFields()) > [nsRequest setValue:header.value forHTTPHeaderField:header.key]; Do you think we should special-case Last-Event-ID in ResourceRequest::doUpdatePlatformRequest() instead?
Darin Adler
Comment 8 2020-07-27 17:02:55 PDT
I started talking about this in Slack: The bug seems to be in the CFNetwork layer. When converting the WTF::String to the CFString the characters are not in any particular encoding; conceptually they are represented as a sequence of UTF-16 code units (although CFString can store them in whatever way it likes internally). Converting the CFString to bytes to go on the wire is done inside CFNetwork; apparently incorrectly if it’s not using UTF-8! Making a CFString that has UTF-16 code units that are actually UTF-8 code units with the high bytes set to zero is a possible way to work around that CFNetwork issue, I guess, but a dangerous workaround. It would be broken if CFNetwork was ever fixed to correctly encode strings as UTF-8 (the string would be double-encoded).
Chris Dumez
Comment 9 2020-07-27 17:12:58 PDT
(In reply to Darin Adler from comment #8) > I started talking about this in Slack: > > The bug seems to be in the CFNetwork layer. > > When converting the WTF::String to the CFString the characters are not in > any particular encoding; conceptually they are represented as a sequence of > UTF-16 code units (although CFString can store them in whatever way it likes > internally). > > Converting the CFString to bytes to go on the wire is done inside CFNetwork; > apparently incorrectly if it’s not using UTF-8! > > Making a CFString that has UTF-16 code units that are actually UTF-8 code > units with the high bytes set to zero is a possible way to work around that > CFNetwork issue, I guess, but a dangerous workaround. It would be broken if > CFNetwork was ever fixed to correctly encode strings as UTF-8 (the string > would be double-encoded). I am not convinced this is a CFNetwork bug. Headers are ISO-8859-1 encoded as per RFC: - https://tools.ietf.org/html/rfc7230#section-3.2.4 """ Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data. """ Here, it is the HTML specification that is enforcing that we use UTF-8 encoding to this particular header value.
Alexey Proskuryakov
Comment 10 2020-07-27 18:06:35 PDT
Alexey Proskuryakov
Comment 11 2020-07-27 18:08:29 PDT
I don't think that it's a CFNetwork *bug*, but I think that there should be a way to get this via API cleanly, see rdar://problem/13800011.
Chris Dumez
Comment 12 2020-07-28 12:54:15 PDT
Chris Dumez
Comment 13 2020-07-28 12:59:46 PDT
As discussed on Slack, it is not safe to UTF-8 encode all header and is causing some test failures (Those tests seem to pass various octet sequences for a custom header and do not expect the value to be UTF-8 encoded). The HTTP RFC suggests that any octet is valid in header field value (in theory), that historically it was Latin1, and that new header fields SHOULD only use US-ASCII
Alex Christensen
Comment 14 2020-07-28 14:20:53 PDT
Comment on attachment 405397 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=405397&action=review > Source/WebCore/platform/network/cocoa/ResourceRequestCocoa.mm:198 > + [nsRequest setValue:encodedValue.bridgingAutorelease() forHTTPHeaderField:header.key]; This will cause a spike in the autoreleasepool size. Why not just encodedValue.get()? Then it's immediately released after use.
Darin Adler
Comment 15 2020-07-28 14:24:13 PDT
Comment on attachment 405397 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=405397&action=review > Source/WebCore/platform/network/cf/ResourceRequestCFNet.h:91 > + // HTTP headers are normally latin1 but the HTML specification says to use UTF-8 encoding for the Last-Event-ID header value. Latin-1, not "latin1". Missing second half of the comment. Something like this: // Constructing a string with the UTF-8 bytes but claiming that it’s Latin-1 is the way to get CFNetwork to put those UTF-8 bytes on the wire. > Source/WebCore/platform/network/cocoa/ResourceRequestCocoa.mm:198 > + [nsRequest setValue:encodedValue.bridgingAutorelease() forHTTPHeaderField:header.key]; Slightly wasteful to use autorelease here. We do need bridging, but not autorelease. Could write (__bridge CFStringRef)encodedValue.get() but it would be even better if it could be done more elegantly.
Chris Dumez
Comment 16 2020-07-28 14:37:48 PDT
EWS
Comment 17 2020-07-28 16:15:17 PDT
Committed r265014: <https://trac.webkit.org/changeset/265014> All reviewed patches have been landed. Closing bug and clearing flags on attachment 405415 [details].
Note You need to log in before you can comment on or make changes to this bug.