224566 – REGRESSION(r275921-r275922): [ wk2 ] http/tests/security/webgl-remote-read-remote-image-allowed-with-credentials.html is flakey crashing

Robert Jenner

Reported 2021-04-14 11:39:17 PDT

http/tests/security/webgl-remote-read-remote-image-allowed-with-credentials.html is flakey crashing in wk2 on iOS 14 Simulator, Catalina and BigSur. HISTORY: https://results.webkit.org/?suite=layout-tests&test=http%2Ftests%2Fsecurity%2Fwebgl-remote-read-remote-image-allowed-with-credentials.html No crashlog was generated for this test. However, I was able to find a STDERR. STDERR TEXT: stderr: 2021-04-14 07:23:46.915 com.apple.WebKit.Networking.Development[91141:40671502] DiskCookieStorage changing policy from 0 to 2, cookie file: file:///Users/buildbot/Library/HTTPStorages/com.apple.WebKit.WebKitTestRunner.binarycookies LEAK: 1 WebPageProxy

Radar WebKit Bug Importer

Comment 1 2021-04-14 11:40:48 PDT

<rdar://problem/76657007>

Robert Jenner

Comment 2 2021-04-14 11:49:04 PDT

Updated test expectations to Pass Crash here: https://trac.webkit.org/changeset/275955/webkit Updated before investigating because this test was causing issues for EWS and commit-queue.

Robert Jenner

Comment 3 2021-04-14 15:12:25 PDT

(In reply to Robert Jenner from comment #2) > Updated test expectations to Pass Crash here: > > https://trac.webkit.org/changeset/275955/webkit > > Updated before investigating because this test was causing issues for EWS > and commit-queue. Forgot to add expectations to Mac wk2. Updated here: https://trac.webkit.org/changeset/275974/webkit

Chris Dumez

Comment 4 2021-04-14 15:52:58 PDT

No crash log? Weird.

Chris Dumez

Comment 5 2021-04-14 16:29:36 PDT

I was able to reproduce the crash locally on macOS with the command that was provided. I see this: 16:24:02.300 6056 com.apple.WebKit.WebContent.Development crash, pid = 6113 16:24:02.318 6056 killed pid 6059 16:26:17.392 6056 worker/26 http/tests/security/webgl-remote-read-remote-image-allowed-with-credentials.html crashed, (stderr lines): 16:26:17.392 6056 LEAK: 1 WebPageProxy 16:26:17.394 5731 [2000/2000] http/tests/security/webgl-remote-read-remote-image-allowed-with-credentials.html failed unexpectedly (com.apple.WebKit.WebContent.Development crashed [pid=6113]) But again, no generated crash log.

Chris Dumez

Comment 6 2021-04-14 16:33:24 PDT

Looking at the system logs I found: com.apple.WebKit IPC fault 16:32:20.017530-0700 9721 com.apple.WebKit.GPU.Development Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID com.apple.WebKit IPC fault 16:32:20.447207-0700 9728 com.apple.WebKit.GPU.Development Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID com.apple.WebKit IPC fault 16:32:20.592596-0700 9729 com.apple.WebKit.GPU.Development Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID com.apple.WebKit IPC fault 16:32:20.593085-0700 9658 com.apple.WebKit.GPU.Development Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID com.apple.WebKit IPC fault 16:32:20.593305-0700 9790 com.apple.WebKit.WebContent.Development Exiting: Received XPC event type: XPC_ERROR_CONNECTION_INVALID

Chris Dumez

Comment 7 2021-04-14 16:39:46 PDT

It is super odd that r275921 would have regressed this on macOS given that it was an iOS patch (VISIBILITY_PROPAGATION_VIEW is only 1 on iOS).

Chris Dumez

Comment 8 2021-04-14 16:50:51 PDT

On the flakiness dashboard, we can see a crash as early as r273654. That said, it is true the the crash became super frequent recently (r275922). r275922 is not platform-specific and is actually more likely to have caused this.

Chris Dumez

Comment 9 2021-04-14 16:55:53 PDT

As I suspected, the regression point is wrong. I can still reproduce the crash after reverting r275921 locally. I will try reverting r275922 next.

Chris Dumez

Comment 10 2021-04-14 17:05:15 PDT

(In reply to Chris Dumez from comment #9) > As I suspected, the regression point is wrong. I can still reproduce the > crash after reverting r275921 locally. I will try reverting r275922 next. I can still reproduce after reverting r275922. I am thus not sure what caused this.

Chris Dumez

Comment 11 2021-04-14 17:16:13 PDT

Could be just flakiness but it looks like reverting both r275921 and r275922 makes the flaky crash go away. Both patches may impact when we decide to launch to GPUProcess so the issue could be timing-related. I'll investigate more tomorrow.

Chris Dumez

Comment 12 2021-04-14 17:23:42 PDT

(In reply to Chris Dumez from comment #11) > Could be just flakiness but it looks like reverting both r275921 and r275922 > makes the flaky crash go away. Both patches may impact when we decide to > launch to GPUProcess so the issue could be timing-related. I'll investigate > more tomorrow. I think this has to do with when we decide to initiate the connection to the GPUProcess. r275921 is mostly an iOS patch but it does one thing on macOS: - It delays the construction of the RemoteRenderingBackendProxy until we actually need it. It used to get constructed as soon as we constructed the WebPage, now it only gets constructed when the GPUProcess launches or when we start doing canvas rendering. r275922 is related too. Before r275922, constructing the RemoteRenderingBackendProxy would initiate the GPUProcess connection right away (in the constructor). After r275922, eagerly constructing a RemoteRenderingBackendProxy does not actually initiate the GPUProcess connection. We only launch the GPUProcess when we actually do canvas rendering via the RemoteRenderingBackendProxy. I think this is why reverting r275921 without reverting r275922 does not fix the crash. That said, now I have to figure out why launching the GPUProcess when we start canvas drawing instead of on WebPage creation actually causes this flaky crash. Even weirder, it is not a real crash but an XPC error (XPC_ERROR_CONNECTION_INVALID).

Robert Jenner

Comment 13 2021-04-14 18:58:43 PDT

I found two more tests that fail at the r275921 and r275922 revisions, but not r275920: imported/w3c/web-platform-tests/webmessaging/with-ports/011.html webgl/1.0.3/conformance/glsl/matrices/glsl-mat3-construction.html As such, I have added these tests to the previous test expectations I set: https://trac.webkit.org/changeset/275987/webkit

Chris Dumez

Comment 14 2021-04-14 22:05:40 PDT

I am not sure if this is the root cause but there is definitely an IPC ordering issue. I see the WebProcess sending to the GPUProcess: 1. RemoteRenderingBackend::CreateImageBuffer 2. RemoteRenderingBackend::GetDataURLForImageBuffer And the GPUProcess receiving then in the opposite order. As a result, the second IPC handling is unable to find the image buffer. The first message is async and the second one is async so this is likely where the ordering issue comes from. Oddly though, we are using the IPC::SendOption::DispatchMessageEvenWhenWaitingForSyncReply flag when sending the CreateImageBuffer IPC, which *should* maintain ordering with sync messages. That said, the receiver seems to be a threaded message receiver, which is not super common so it is possible we have a bug with this configuration.

Chris Dumez

Comment 15 2021-04-14 22:33:07 PDT

Adding Kimmo in cc as this seems to be using his new MessageReceiveQueue code (since RemoteRenderingBackend is a WorkQueueMessageReceiver). With some local debugging, I can see there there seems to be an issue here in void Connection::processIncomingMessage(std::unique_ptr<Decoder> message): ``` if (auto* receiveQueue = m_receiveQueues.get(*message)) { receiveQueue->enqueueMessage(*this, WTFMove(message)); return; } ``` I would expect all RemoteRenderingBackend IPC messages to use this code path, since RemoteRenderingBackend is a WorkQueueMessageReceiver. However, I can see locally that the first couple of RemoteRenderingBackend IPC messages get a receiveQueue that is null. Starting with the 3rd RemoteRenderingBackend message, receiveQueue becomes non-null. Seems like we may have a race somewhere if my logging is correct.

Chris Dumez

Comment 16 2021-04-14 22:43:36 PDT

My patches did not actually change the timing of the GPUProcess launch in the context of the tests. There is something during the window.internals initialization that ends up launching the GPUProcess. What did change with my patches is the timing of when the GPUConnectionToWebProcess::CreateRenderingBackend IPC get sent. It used to get sent as soon as the RemoteRenderingBackendProxy got created and now it only gets send right before the first IPC from the RemoteRenderingBackendProxy to the GPUProcess. I think we've had a IPC synchronization issue on GPUProcess side that became obvious after my patches because the IPCs are now so close together. In particular, we send a GPUConnectionToWebProcess::CreateRenderingBackend async IPC that creates the RemoteRenderingBackend on the GPUProcess side, on the main thread. When constructed, the RemoteRenderingBackend registers itself as a WorkQueueMessageReceiver. Right after the WebProcess sends IPC from the WebProcess to the RemoteRenderingBackend. When we get those IPC on the IPC thread, the RemoteRenderingBackend may not have registered itself as a WorkQueueMessageReceiver yet on the main thread. As a result, RemoteRenderingBackend IPC messages may get processed out of order because some of them use the MessageReceiveQueue and some early IPC messages don't.

Chris Dumez

Comment 17 2021-04-14 23:08:25 PDT

Created attachment 426076 [details] Patch

Kimmo Kinnunen

Comment 18 2021-04-14 23:16:33 PDT

Yes, good debugging. That's the reason for the reordering. There was multiple of these reordering bugs. The series of changes related to MessageReceiveQueue fixed most of them, but this one remains as the FIXME mentions. So basically the Connection implementation is still such that messages are delivered in random order. I have a wip patch locally that fixes this reordering. However, it's not entirely simple to solve since the Connection behavior when dispatching messages while waiting wrt all global Connection instances is a bit underspecified. E.g. it's a complex patch to solve if we want to preserve the behavior that *ALL* Connections get their messages dispatched in main thread while one connection waits on a particular message.

Kimmo Kinnunen

Comment 19 2021-04-14 23:22:18 PDT

The FIXME in: // FIXME: The matches here come from // m_messagesToDispatchWhileWaitingForSyncReply.

Chris Dumez

Comment 20 2021-04-14 23:42:31 PDT

(In reply to Kimmo Kinnunen from comment #19) > The FIXME in: > // FIXME: The matches here come from > // m_messagesToDispatchWhileWaitingForSyncReply. Just focusing on this particular WorkQueueMessageReceiver issue, I think one possible fix at the IPC::Connection layer might be in Connection::processIncomingMessage(std::unique_ptr<Decoder> message): ``` if (auto* receiveQueue = m_receiveQueues.get(*message)) { receiveQueue->enqueueMessage(*this, WTFMove(message)); return; } ``` In the if scope, we should check if there are any pending messages in m_incomingMessages for this receiver name, if there are, then we could remove them from m_incomingMessages and then enqueue them on the receiveQueue *before* the current message. I think this would preserve ordering for async messages at least. We'd likely need a similar fix for sync messages (maybe checking m_messagesToDispatchWhileWaitingForSyncReply instead of m_incomingMessages). Let me know what you think. Again though, I'd want to land the current patch (simple fix make the IPC sync) first to address crashes quickly. We can follow-up with a better / more generic solution.

EWS

Comment 21 2021-04-15 00:37:13 PDT

Committed r276007 (236559@main): <https://commits.webkit.org/236559@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 426076 [details].

Kimmo Kinnunen

Comment 22 2021-04-15 01:25:35 PDT

(In reply to Chris Dumez from comment #20) > In the if scope, we should check if there are any pending messages in > m_incomingMessages for this receiver name, if there are, then we could > remove them from m_incomingMessages and then enqueue them on the > receiveQueue *before* the current message. I interpret your suggestion to mean the normal operation, e.g. the server is not waiting on a specific message. So the normal operation is already covered, or should be covered. If you're seeing the behavior in normal delivery, then I think something is still missing there and could be addressed simply. However, I tried to make this covered, so I don't know if this is the case. In normal operation, if messages are scheduled to m_incomingMessages *before* the work queue is attached, this is business-as-usual. The m_incomingMessages are delivered when the work queue is attached, right at that moment, holding a lock. So if there are pending messages in m_incomingMessages, they should be processed at work queue install time and no additional changes should be needed in Connection::processIncomingMessage. (as per the intention, there could of course be bugs) The reordering happens when server starts waiting for messages. 1) Normal operation queues messages to m_incomingMessages (regardless of options) 2) Wait starts 3a) Messages during wait are queued to m_incomingMessages, if they don't have options 3b) Messages during wait are queued to m_messagesToDispatchWhileWaitingForSyncReply if they have the specific option or if they're sync messages So it is impossible to order messages correctly, if they pseudo-randomly get queued to two different queues. So even the messages with DispatchMessageEvenWhenWaitingForSyncReply

Kimmo Kinnunen

Comment 23 2021-04-15 01:52:00 PDT

(In reply to Chris Dumez from comment #14) > I am not sure if this is the root cause but there is definitely an IPC > ordering issue. > > I see the WebProcess sending to the GPUProcess: > 1. RemoteRenderingBackend::CreateImageBuffer > 2. RemoteRenderingBackend::GetDataURLForImageBuffer > > And the GPUProcess receiving then in the opposite order This interpretation is a bit suspicious and perhaps warrants a double check. The normal logical operation of creating a image buffer in Web Content process side is currently: 1) send(CreateImageBuffer()) 2) waitForDidCreateImageBufferBackend() 3) do other operations If we apply this logic to GetDataURL, without looking at the implementation it should be: 1) send(CreateImageBuffer()) 2) waitForDidCreateImageBufferBackend() 3) send(GetDataURLForImageBuffer) If this is the case, it should simply not be possible to observe any reordering, as the Web Content process will not send the GetDataURL message before GPU Process has gotten and replied the CreateImageBuffer. However, there's no such requirement for CreateRemoteRenderingBackend + CreateImageBuffer, so I could imagine those being reordered.

Chris Dumez

Comment 24 2021-04-15 07:43:35 PDT

(In reply to Kimmo Kinnunen from comment #22) > (In reply to Chris Dumez from comment #20) > > In the if scope, we should check if there are any pending messages in > > m_incomingMessages for this receiver name, if there are, then we could > > remove them from m_incomingMessages and then enqueue them on the > > receiveQueue *before* the current message. > > I interpret your suggestion to mean the normal operation, e.g. the server is > not waiting on a specific message. > > So the normal operation is already covered, or should be covered. If you're > seeing the behavior in normal delivery, then I think something is still > missing there and could be addressed simply. However, I tried to make this > covered, so I don't know if this is the case. > > In normal operation, if messages are scheduled to m_incomingMessages > *before* the work queue is attached, this is business-as-usual. The > m_incomingMessages are delivered when the work queue is attached, right at > that moment, holding a lock. I guess you are talking about this code: ``` void Connection::addMessageReceiveQueue(MessageReceiveQueue& receiveQueue, ReceiverName receiverName, uint64_t destinationID) { auto locker = holdLock(m_incomingMessagesMutex); m_incomingMessages = filterWithMessageReceiveQueue(*this, receiveQueue, receiverName, destinationID, WTFMove(m_incomingMessages)); m_receiveQueues.add(receiveQueue, receiverName, destinationID); } ``` In particular the `filterWithMessageReceiveQueue()` part. I did not realize we already did that and it is great. However, the RemoteRenderingBackend::CreateImageBuffer IPC gets sent with the IPC::SendOption::DispatchMessageEvenWhenWaitingForSyncReply flag. As a result, the message does not get appended to m_incomingMessages. Instead, it gets appended to m_messagesToDispatchWhileWaitingForSyncReply :) As I mentioned earlier in one of my comments, we have to take care of both regular async messages (in m_incomingMessages) and sync messages / async messages with DispatchMessageEvenWhenWaitingForSyncReply flag (in m_messagesToDispatchWhileWaitingForSyncReply). Seems like our code currently only deals with regular async messages?

Chris Dumez

Comment 25 2021-04-15 14:00:22 PDT

(In reply to Chris Dumez from comment #24) > (In reply to Kimmo Kinnunen from comment #22) > > (In reply to Chris Dumez from comment #20) > > > In the if scope, we should check if there are any pending messages in > > > m_incomingMessages for this receiver name, if there are, then we could > > > remove them from m_incomingMessages and then enqueue them on the > > > receiveQueue *before* the current message. > > > > I interpret your suggestion to mean the normal operation, e.g. the server is > > not waiting on a specific message. > > > > So the normal operation is already covered, or should be covered. If you're > > seeing the behavior in normal delivery, then I think something is still > > missing there and could be addressed simply. However, I tried to make this > > covered, so I don't know if this is the case. > > > > In normal operation, if messages are scheduled to m_incomingMessages > > *before* the work queue is attached, this is business-as-usual. The > > m_incomingMessages are delivered when the work queue is attached, right at > > that moment, holding a lock. > > I guess you are talking about this code: > ``` > void Connection::addMessageReceiveQueue(MessageReceiveQueue& receiveQueue, > ReceiverName receiverName, uint64_t destinationID) > { > auto locker = holdLock(m_incomingMessagesMutex); > m_incomingMessages = filterWithMessageReceiveQueue(*this, receiveQueue, > receiverName, destinationID, WTFMove(m_incomingMessages)); > m_receiveQueues.add(receiveQueue, receiverName, destinationID); > } > ``` > > In particular the `filterWithMessageReceiveQueue()` part. I did not realize > we already did that and it is great. > > However, the RemoteRenderingBackend::CreateImageBuffer IPC gets sent with > the IPC::SendOption::DispatchMessageEvenWhenWaitingForSyncReply flag. As a > result, the message does not get appended to m_incomingMessages. Instead, it > gets appended to m_messagesToDispatchWhileWaitingForSyncReply :) > > As I mentioned earlier in one of my comments, we have to take care of both > regular async messages (in m_incomingMessages) and sync messages / async > messages with DispatchMessageEvenWhenWaitingForSyncReply flag (in > m_messagesToDispatchWhileWaitingForSyncReply). > Seems like our code currently only deals with regular async messages? Looks like I figured out a proper way to fix the ordering issue. I will address via https://bugs.webkit.org/show_bug.cgi?id=224623 and make the CreateImageBuffer IPC async again.

Alexey Proskuryakov

Comment 26 2021-04-20 09:44:25 PDT

*** Bug 224612 has been marked as a duplicate of this bug. ***

Attachments
Patch (10.59 KB, patch) 2021-04-14 23:08 PDT, Chris Dumez	ews-feeder: commit-queue-	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.