Bug 224566 - REGRESSION(r275921-r275922): [ wk2 ] http/tests/security/webgl-remote-read-remote-image-allowed-with-credentials.html is flakey crashing
Summary: REGRESSION(r275921-r275922): [ wk2 ] http/tests/security/webgl-remote-read-re...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL:
Keywords: InRadar
: 224612 (view as bug list)
Depends on:
Blocks: 224556
  Show dependency treegraph
 
Reported: 2021-04-14 11:39 PDT by Robert Jenner
Modified: 2021-04-20 09:44 PDT (History)
10 users (show)

See Also:


Attachments
Patch (10.59 KB, patch)
2021-04-14 23:08 PDT, Chris Dumez
ews-feeder: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Jenner 2021-04-14 11:39:17 PDT
http/tests/security/webgl-remote-read-remote-image-allowed-with-credentials.html

is flakey crashing in wk2 on iOS 14 Simulator, Catalina and BigSur. 

HISTORY:
https://results.webkit.org/?suite=layout-tests&test=http%2Ftests%2Fsecurity%2Fwebgl-remote-read-remote-image-allowed-with-credentials.html


No crashlog was generated for this test. However, I was able to find a STDERR. 

STDERR TEXT:
stderr:
2021-04-14 07:23:46.915 com.apple.WebKit.Networking.Development[91141:40671502] DiskCookieStorage changing policy from 0 to 2, cookie file: file:///Users/buildbot/Library/HTTPStorages/com.apple.WebKit.WebKitTestRunner.binarycookies
LEAK: 1 WebPageProxy
Comment 1 Radar WebKit Bug Importer 2021-04-14 11:40:48 PDT
<rdar://problem/76657007>
Comment 2 Robert Jenner 2021-04-14 11:49:04 PDT
Updated test expectations to Pass Crash here:

https://trac.webkit.org/changeset/275955/webkit

Updated before investigating because this test was causing issues for EWS and commit-queue.
Comment 3 Robert Jenner 2021-04-14 15:12:25 PDT
(In reply to Robert Jenner from comment #2)
> Updated test expectations to Pass Crash here:
> 
> https://trac.webkit.org/changeset/275955/webkit
> 
> Updated before investigating because this test was causing issues for EWS
> and commit-queue.

Forgot to add expectations to Mac wk2. Updated here:
https://trac.webkit.org/changeset/275974/webkit
Comment 4 Chris Dumez 2021-04-14 15:52:58 PDT
No crash log? Weird.
Comment 5 Chris Dumez 2021-04-14 16:29:36 PDT
I was able to reproduce the crash locally on macOS with the command that was provided. I see this:
16:24:02.300 6056 com.apple.WebKit.WebContent.Development crash, pid = 6113
16:24:02.318 6056 killed pid 6059
16:26:17.392 6056 worker/26 http/tests/security/webgl-remote-read-remote-image-allowed-with-credentials.html crashed, (stderr lines):
16:26:17.392 6056   LEAK: 1 WebPageProxy
16:26:17.394 5731 [2000/2000] http/tests/security/webgl-remote-read-remote-image-allowed-with-credentials.html failed unexpectedly (com.apple.WebKit.WebContent.Development crashed [pid=6113])

But again, no generated crash log.
Comment 6 Chris Dumez 2021-04-14 16:33:24 PDT
Looking at the system logs I found:
com.apple.WebKit	IPC	fault	16:32:20.017530-0700	9721	com.apple.WebKit.GPU.Development	Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID
com.apple.WebKit	IPC	fault	16:32:20.447207-0700	9728	com.apple.WebKit.GPU.Development	Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID
com.apple.WebKit	IPC	fault	16:32:20.592596-0700	9729	com.apple.WebKit.GPU.Development	Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID
com.apple.WebKit	IPC	fault	16:32:20.593085-0700	9658	com.apple.WebKit.GPU.Development	Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID
com.apple.WebKit	IPC	fault	16:32:20.593305-0700	9790	com.apple.WebKit.WebContent.Development	Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID
Comment 7 Chris Dumez 2021-04-14 16:39:46 PDT
It is super odd that r275921 would have regressed this on macOS given that it was an iOS patch (VISIBILITY_PROPAGATION_VIEW is only 1 on iOS).
Comment 8 Chris Dumez 2021-04-14 16:50:51 PDT
On the flakiness dashboard, we can see a crash as early as r273654. That said, it is true the the crash became super frequent recently (r275922).

r275922 is not platform-specific and is actually more likely to have caused this.
Comment 9 Chris Dumez 2021-04-14 16:55:53 PDT
As I suspected, the regression point is wrong. I can still reproduce the crash after reverting r275921 locally. I will try reverting r275922 next.
Comment 10 Chris Dumez 2021-04-14 17:05:15 PDT
(In reply to Chris Dumez from comment #9)
> As I suspected, the regression point is wrong. I can still reproduce the
> crash after reverting r275921 locally. I will try reverting r275922 next.

I can still reproduce after reverting r275922. I am thus not sure what caused this.
Comment 11 Chris Dumez 2021-04-14 17:16:13 PDT
Could be just flakiness but it looks like reverting both r275921 and r275922 makes the flaky crash go away. Both patches may impact when we decide to launch to GPUProcess so the issue could be timing-related. I'll investigate more tomorrow.
Comment 12 Chris Dumez 2021-04-14 17:23:42 PDT
(In reply to Chris Dumez from comment #11)
> Could be just flakiness but it looks like reverting both r275921 and r275922
> makes the flaky crash go away. Both patches may impact when we decide to
> launch to GPUProcess so the issue could be timing-related. I'll investigate
> more tomorrow.

I think this has to do with when we decide to initiate the connection to the GPUProcess.

r275921 is mostly an iOS patch but it does one thing on macOS:
- It delays the construction of the RemoteRenderingBackendProxy until we actually need it. It used to get constructed as soon as we constructed the WebPage, now it only gets constructed when the GPUProcess launches or when we start doing canvas rendering.

r275922 is related too. Before r275922, constructing the RemoteRenderingBackendProxy would initiate the GPUProcess connection right away (in the constructor). After r275922, eagerly constructing a RemoteRenderingBackendProxy does not actually initiate the GPUProcess connection. We only launch the GPUProcess when we actually do canvas rendering via the RemoteRenderingBackendProxy.

I think this is why reverting r275921 without reverting r275922 does not fix the crash.

That said, now I have to figure out why launching the GPUProcess when we start canvas drawing instead of on WebPage creation actually causes this flaky crash. Even weirder, it is not a real crash but an XPC error (XPC_ERROR_CONNECTION_INVALID).
Comment 13 Robert Jenner 2021-04-14 18:58:43 PDT
I found two more tests that fail at the r275921 and r275922 revisions, but not r275920:

imported/w3c/web-platform-tests/webmessaging/with-ports/011.html
webgl/1.0.3/conformance/glsl/matrices/glsl-mat3-construction.html

As such, I have added these tests to the previous test expectations I set:
https://trac.webkit.org/changeset/275987/webkit
Comment 14 Chris Dumez 2021-04-14 22:05:40 PDT
I am not sure if this is the root cause but there is definitely an IPC ordering issue.

I see the WebProcess sending to the GPUProcess:
1. RemoteRenderingBackend::CreateImageBuffer
2. RemoteRenderingBackend::GetDataURLForImageBuffer

And the GPUProcess receiving then in the opposite order. As a result, the second IPC handling is unable to find the image buffer.

The first message is async and the second one is async so this is likely where the ordering issue comes from. Oddly though, we are using the IPC::SendOption::DispatchMessageEvenWhenWaitingForSyncReply flag when sending the CreateImageBuffer IPC, which *should* maintain ordering with sync messages.

That said, the receiver seems to be a threaded message receiver, which is not super common so it is possible we have a bug with this configuration.
Comment 15 Chris Dumez 2021-04-14 22:33:07 PDT
Adding Kimmo in cc as this seems to be using his new MessageReceiveQueue code (since RemoteRenderingBackend is a WorkQueueMessageReceiver).

With some local debugging, I can see there there seems to be an issue here in void Connection::processIncomingMessage(std::unique_ptr<Decoder> message):
```
if (auto* receiveQueue = m_receiveQueues.get(*message)) {
    receiveQueue->enqueueMessage(*this, WTFMove(message));
    return;
}
```

I would expect all RemoteRenderingBackend IPC messages to use this code path, since RemoteRenderingBackend is a WorkQueueMessageReceiver. However, I can see locally that the first couple of RemoteRenderingBackend IPC messages get a receiveQueue that is null. Starting with the 3rd RemoteRenderingBackend message, receiveQueue becomes non-null. Seems like we may have a race somewhere if my logging is correct.
Comment 16 Chris Dumez 2021-04-14 22:43:36 PDT
My patches did not actually change the timing of the GPUProcess launch in the context of the tests. There is something during the window.internals initialization that ends up launching the GPUProcess.

What did change with my patches is the timing of when the GPUConnectionToWebProcess::CreateRenderingBackend IPC get sent. It used to get sent as soon as the RemoteRenderingBackendProxy got created and now it only gets send right before the first IPC from the RemoteRenderingBackendProxy to the GPUProcess.

I think we've had a IPC synchronization issue on GPUProcess side that became obvious after my patches because the IPCs are now so close together.

In particular, we send a GPUConnectionToWebProcess::CreateRenderingBackend async IPC that creates the RemoteRenderingBackend on the GPUProcess side, on the main thread. When constructed, the RemoteRenderingBackend registers itself as a WorkQueueMessageReceiver.

Right after the WebProcess sends IPC from the WebProcess to the RemoteRenderingBackend. When we get those IPC on the IPC thread, the RemoteRenderingBackend may not have registered itself as a WorkQueueMessageReceiver yet on the main thread. As a result, RemoteRenderingBackend IPC messages may get processed out of order because some of them use the MessageReceiveQueue and some early IPC messages don't.
Comment 17 Chris Dumez 2021-04-14 23:08:25 PDT
Created attachment 426076 [details]
Patch
Comment 18 Kimmo Kinnunen 2021-04-14 23:16:33 PDT
Yes, good debugging. That's the reason for the reordering.

There was multiple of these reordering bugs.
The series of changes related to MessageReceiveQueue fixed most of them, but this one remains as the FIXME mentions. So basically the Connection implementation is still such that messages are delivered in random order.

I have a wip patch locally that fixes this reordering. However, it's not entirely simple to solve since the Connection behavior when dispatching messages while waiting wrt all global Connection instances is a bit underspecified. E.g. it's a complex patch to solve if we want to preserve the behavior that *ALL* Connections get their messages dispatched in main thread while one connection waits on a particular message.
Comment 19 Kimmo Kinnunen 2021-04-14 23:22:18 PDT
The FIXME in:
// FIXME: The matches here come from
// m_messagesToDispatchWhileWaitingForSyncReply.
Comment 20 Chris Dumez 2021-04-14 23:42:31 PDT
(In reply to Kimmo Kinnunen from comment #19)
> The FIXME in:
> // FIXME: The matches here come from
> // m_messagesToDispatchWhileWaitingForSyncReply.

Just focusing on this particular WorkQueueMessageReceiver issue, I think one possible fix at the IPC::Connection layer might be in Connection::processIncomingMessage(std::unique_ptr<Decoder> message):
```
if (auto* receiveQueue = m_receiveQueues.get(*message)) {
    receiveQueue->enqueueMessage(*this, WTFMove(message));
    return;
}
```

In the if scope, we should check if there are any pending messages in m_incomingMessages for this receiver name, if there are, then we could remove them from m_incomingMessages and then enqueue them on the receiveQueue *before* the current message. I think this would preserve ordering for async messages at least. We'd likely need a similar fix for sync messages (maybe checking m_messagesToDispatchWhileWaitingForSyncReply instead of m_incomingMessages).

Let me know what you think. Again though, I'd want to land the current patch (simple fix make the IPC sync) first to address crashes quickly. We can follow-up with a better / more generic solution.
Comment 21 EWS 2021-04-15 00:37:13 PDT
Committed r276007 (236559@main): <https://commits.webkit.org/236559@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 426076 [details].
Comment 22 Kimmo Kinnunen 2021-04-15 01:25:35 PDT
(In reply to Chris Dumez from comment #20)
> In the if scope, we should check if there are any pending messages in
> m_incomingMessages for this receiver name, if there are, then we could
> remove them from m_incomingMessages and then enqueue them on the
> receiveQueue *before* the current message.

I interpret your suggestion to mean the normal operation, e.g. the server is not waiting on a specific message. 

So the normal operation is already covered, or should be covered. If you're seeing the behavior in normal delivery, then I think something is still missing there and could be addressed simply. However, I tried to make this covered, so I don't know if this is the case.

In normal operation, if messages are scheduled to m_incomingMessages *before* the work queue is attached, this is business-as-usual. The m_incomingMessages are delivered when the work queue is attached, right at that moment, holding a lock. 

So if there are pending messages in m_incomingMessages, they should be processed at work queue install time and no additional changes should be needed in Connection::processIncomingMessage. (as per the intention, there could of course be bugs)



The reordering happens when server starts waiting for messages.
1) Normal operation queues messages to m_incomingMessages (regardless of options)
2) Wait starts
3a) Messages during wait are queued to m_incomingMessages, if they don't have options
3b) Messages during wait are queued to m_messagesToDispatchWhileWaitingForSyncReply if they have the specific option or if they're sync messages


So it is impossible to order messages correctly, if they pseudo-randomly get queued to two different queues. So even the messages with DispatchMessageEvenWhenWaitingForSyncReply
Comment 23 Kimmo Kinnunen 2021-04-15 01:52:00 PDT
(In reply to Chris Dumez from comment #14)
> I am not sure if this is the root cause but there is definitely an IPC
> ordering issue.
> 
> I see the WebProcess sending to the GPUProcess:
> 1. RemoteRenderingBackend::CreateImageBuffer
> 2. RemoteRenderingBackend::GetDataURLForImageBuffer
> 
> And the GPUProcess receiving then in the opposite order

This interpretation is a bit suspicious and perhaps warrants a double check.

The normal logical operation of creating a image buffer in Web Content process side is currently:

1) send(CreateImageBuffer())
2) waitForDidCreateImageBufferBackend()
3) do other operations

If we apply this logic to GetDataURL, without looking at the implementation it should be:

1) send(CreateImageBuffer())
2) waitForDidCreateImageBufferBackend()
3) send(GetDataURLForImageBuffer)


If this is the case, it should simply not be possible to observe any reordering, as the Web Content process will not send the GetDataURL message before GPU Process has gotten and replied the CreateImageBuffer.


However, there's no such requirement for CreateRemoteRenderingBackend + CreateImageBuffer, so I could imagine those being reordered.
Comment 24 Chris Dumez 2021-04-15 07:43:35 PDT
(In reply to Kimmo Kinnunen from comment #22)
> (In reply to Chris Dumez from comment #20)
> > In the if scope, we should check if there are any pending messages in
> > m_incomingMessages for this receiver name, if there are, then we could
> > remove them from m_incomingMessages and then enqueue them on the
> > receiveQueue *before* the current message.
> 
> I interpret your suggestion to mean the normal operation, e.g. the server is
> not waiting on a specific message. 
> 
> So the normal operation is already covered, or should be covered. If you're
> seeing the behavior in normal delivery, then I think something is still
> missing there and could be addressed simply. However, I tried to make this
> covered, so I don't know if this is the case.
> 
> In normal operation, if messages are scheduled to m_incomingMessages
> *before* the work queue is attached, this is business-as-usual. The
> m_incomingMessages are delivered when the work queue is attached, right at
> that moment, holding a lock. 

I guess you are talking about this code:
```
void Connection::addMessageReceiveQueue(MessageReceiveQueue& receiveQueue, ReceiverName receiverName, uint64_t destinationID)
{
    auto locker = holdLock(m_incomingMessagesMutex);
    m_incomingMessages = filterWithMessageReceiveQueue(*this, receiveQueue, receiverName, destinationID, WTFMove(m_incomingMessages));
    m_receiveQueues.add(receiveQueue, receiverName, destinationID);
}
```

In particular the `filterWithMessageReceiveQueue()` part. I did not realize we already did that and it is great.

However, the RemoteRenderingBackend::CreateImageBuffer IPC gets sent with the IPC::SendOption::DispatchMessageEvenWhenWaitingForSyncReply flag. As a result, the message does not get appended to m_incomingMessages. Instead, it gets appended to m_messagesToDispatchWhileWaitingForSyncReply :)

As I mentioned earlier in one of my comments, we have to take care of both regular async messages (in m_incomingMessages) and sync messages / async messages with DispatchMessageEvenWhenWaitingForSyncReply flag (in m_messagesToDispatchWhileWaitingForSyncReply).
Seems like our code currently only deals with regular async messages?
Comment 25 Chris Dumez 2021-04-15 14:00:22 PDT
(In reply to Chris Dumez from comment #24)
> (In reply to Kimmo Kinnunen from comment #22)
> > (In reply to Chris Dumez from comment #20)
> > > In the if scope, we should check if there are any pending messages in
> > > m_incomingMessages for this receiver name, if there are, then we could
> > > remove them from m_incomingMessages and then enqueue them on the
> > > receiveQueue *before* the current message.
> > 
> > I interpret your suggestion to mean the normal operation, e.g. the server is
> > not waiting on a specific message. 
> > 
> > So the normal operation is already covered, or should be covered. If you're
> > seeing the behavior in normal delivery, then I think something is still
> > missing there and could be addressed simply. However, I tried to make this
> > covered, so I don't know if this is the case.
> > 
> > In normal operation, if messages are scheduled to m_incomingMessages
> > *before* the work queue is attached, this is business-as-usual. The
> > m_incomingMessages are delivered when the work queue is attached, right at
> > that moment, holding a lock. 
> 
> I guess you are talking about this code:
> ```
> void Connection::addMessageReceiveQueue(MessageReceiveQueue& receiveQueue,
> ReceiverName receiverName, uint64_t destinationID)
> {
>     auto locker = holdLock(m_incomingMessagesMutex);
>     m_incomingMessages = filterWithMessageReceiveQueue(*this, receiveQueue,
> receiverName, destinationID, WTFMove(m_incomingMessages));
>     m_receiveQueues.add(receiveQueue, receiverName, destinationID);
> }
> ```
> 
> In particular the `filterWithMessageReceiveQueue()` part. I did not realize
> we already did that and it is great.
> 
> However, the RemoteRenderingBackend::CreateImageBuffer IPC gets sent with
> the IPC::SendOption::DispatchMessageEvenWhenWaitingForSyncReply flag. As a
> result, the message does not get appended to m_incomingMessages. Instead, it
> gets appended to m_messagesToDispatchWhileWaitingForSyncReply :)
> 
> As I mentioned earlier in one of my comments, we have to take care of both
> regular async messages (in m_incomingMessages) and sync messages / async
> messages with DispatchMessageEvenWhenWaitingForSyncReply flag (in
> m_messagesToDispatchWhileWaitingForSyncReply).
> Seems like our code currently only deals with regular async messages?

Looks like I figured out a proper way to fix the ordering issue. I will address via https://bugs.webkit.org/show_bug.cgi?id=224623 and make the CreateImageBuffer IPC async again.
Comment 26 Alexey Proskuryakov 2021-04-20 09:44:25 PDT
*** Bug 224612 has been marked as a duplicate of this bug. ***