Bug 232580

Summary: [GPU Process] Rendering backend may stop processing IPC stream messages after a sync message timeout
Product: WebKit Reporter: Wenson Hsieh <wenson_hsieh>
Component: WebKit2Assignee: Wenson Hsieh <wenson_hsieh>
Status: RESOLVED FIXED    
Severity: Normal CC: cdumez, kkinnunen, thorton, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
For EWS
none
For EWS none

Description Wenson Hsieh 2021-11-01 14:13:42 PDT
It's possible for RemoteRenderingBackend's IPC stream to stop processing altogether after the web process times out when sending a synchronous stream message to the GPU Process.

To reproduce, (1) modify a sync message to RRB such that it intentionally times out (here, I made it so that we only time out every other call).

diff --git a/Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp b/Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp
index b664d8e37640..c4980ea4df71 100644
--- a/Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp
+++ b/Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp
@@ -284,6 +284,9 @@ void RemoteRenderingBackend::getDataForImageBufferWithQualifiedIdentifier(const
     Vector<uint8_t> data;
     if (auto imageBuffer = m_remoteResourceCache.cachedImageBuffer(renderingResourceIdentifier))
         data = imageBuffer->toData(mimeType, quality);
+    static bool shouldSleep = false;
+    if (shouldSleep) sleep(1.5_s);
+    shouldSleep = !shouldSleep;
     completionHandler(WTFMove(data));
 }

(2) Then go to https://whsieh.github.io/examples/canvas-to-blob and click the "Change color" button twice.

The GPU Process should hang the first time, but then recover and (correctly) change color the second time. Instead, it currently hangs after the first click, and then stops processing incoming stream messages entirely.
Comment 1 Wenson Hsieh 2021-11-01 15:07:18 PDT
(In debug builds, we actually hit this assertion:)

```
if (!currentReceiver) {
    // Valid scenario is when receiver has been removed, but there are messages for it in the buffer.
    // FIXME: Since we do not have a receiver, we don't know how to decode the message.
    // This means we must timeout every receiver in the stream connection.
    // Currently we assert that the receivers are empty, as we only have up to one receiver in
    // a stream connection until possibility of skipping is implemented properly.
    Locker locker { m_receiversLock };
    ASSERT(m_receivers.isEmpty());
    return DispatchResult::HasNoMessages;
}
```

…because the currentReceiverName is `IPC`, but the destination is equal that of the `RemoteRenderingBackend`, and the message name itself is `SyncMessageReply`.

What's happening is:

1. Client sends the sync message (RemoteRenderingBackend::GetDataForImageBuffer, in this case).
   Client then begins waiting for a response under `tryAcquireAll()`.

2. Server receives this message but does not reply in time.

3. About a second later, the client times out while waiting in `tryAcquireAll()`, failing the `sendSync()`.
   Soon afterwards, it sends the next stream message and blocks on `clientWaitSemaphore()` inside `tryAcquire`.

4. Server finally finishes processing `GetDataForImageBuffer` and goes to write SyncMessageReply at offset 0.
   Note that WP has already given up on waiting this reply in (3).
   Server then signals the client waiting semaphore.

5. Client writes the new stream message after the SyncMessageReply.

6. Server runs its processing loop from 0, decodes the SyncMessageReply that it just wrote.
   It then fails to find a suitable message receiver for this message and exits the processing loop early.
Comment 2 Kimmo Kinnunen 2021-11-02 01:22:33 PDT
Yeah.
The design is so that timeouts do not make sense.

If WP asks to draw a 4k x 4k image, and that never completes, there's nothing sensible you can do except restart the GPU process.

The rationale:
We define timeout as X seconds, e.g. if message is not replied within X seconds, we declare the GPU process has hung.

You send message.
The message times out.
What do you do now?

A) Maybe GPUP has not hung, we should be able to retry something like 3 times
  - If the GPUP has not hung, why did you define "GPUP is hung if timeout X seconds"?
  - I still want to retry 3 times
  - Why did you not define "GPUP is hung if timeout is 3*X seconds"? What does the 
    retry contribute here?
  - ...

B) We did define X sec timeout means GPU process has hung, so GPU process has hung. Restart it.


"Retry" or "carry on like nothing happened and hope for the best" is ok if the medium is such that it's expected that messages get lost, like a network. IPC isn't like that, nothing should be lost by design.
Comment 3 Wenson Hsieh 2021-11-02 08:28:20 PDT
(In reply to Kimmo Kinnunen from comment #2)
> Yeah.
> The design is so that timeouts do not make sense.
> 
> If WP asks to draw a 4k x 4k image, and that never completes, there's
> nothing sensible you can do except restart the GPU process.
> 
> The rationale:
> We define timeout as X seconds, e.g. if message is not replied within X
> seconds, we declare the GPU process has hung.
> 
> You send message.
> The message times out.
> What do you do now?
> 
> A) Maybe GPUP has not hung, we should be able to retry something like 3 times
>   - If the GPUP has not hung, why did you define "GPUP is hung if timeout X
> seconds"?
>   - I still want to retry 3 times
>   - Why did you not define "GPUP is hung if timeout is 3*X seconds"? What
> does the 
>     retry contribute here?
>   - ...

Agreed! Implementing some kind of retry policy is redundant with timeouts in the first place, since the timeout is arbitrary anyways.

> 
> B) We did define X sec timeout means GPU process has hung, so GPU process
> has hung. Restart it.

That said, it's critical that a web process does not have the capability to cause the GPU Process to "restart" arbitrarily, as the client is untrusted. We wouldn't want a compromised web process to be able to effectively kill and relaunch the GPU Process in a loop.

Now that I think about it more, I believe there's a third option (C) here, which is to simply remove the timeout and just use Seconds::infinity() for these sync messages. The client (web process) trusts the server (GPU process) to be able to handle incoming drawing commands, and knows that it's only a matter of time until the GPU Process either:
    [1] ..finishes executing the command and sends the reply, in which case the `sendSync()` succeeds or
    [2] ..jetsams or crashes in the process, in which case the `sendSync()` fails and we just carry on in the web process

> 
> 
> "Retry" or "carry on like nothing happened and hope for the best" is ok if
> the medium is such that it's expected that messages get lost, like a
> network. IPC isn't like that, nothing should be lost by design.
Comment 4 Wenson Hsieh 2021-11-02 08:36:07 PDT
(In reply to Wenson Hsieh from comment #3)
> (In reply to Kimmo Kinnunen from comment #2)
> > Yeah.
> > The design is so that timeouts do not make sense.
> > 
> > If WP asks to draw a 4k x 4k image, and that never completes, there's
> > nothing sensible you can do except restart the GPU process.
> > 
> > The rationale:
> > We define timeout as X seconds, e.g. if message is not replied within X
> > seconds, we declare the GPU process has hung.
> > 
> > You send message.
> > The message times out.
> > What do you do now?
> > 
> > A) Maybe GPUP has not hung, we should be able to retry something like 3 times
> >   - If the GPUP has not hung, why did you define "GPUP is hung if timeout X
> > seconds"?
> >   - I still want to retry 3 times
> >   - Why did you not define "GPUP is hung if timeout is 3*X seconds"? What
> > does the 
> >     retry contribute here?
> >   - ...
> 
> Agreed! Implementing some kind of retry policy is redundant with timeouts in
> the first place, since the timeout is arbitrary anyways.
> 
> > 
> > B) We did define X sec timeout means GPU process has hung, so GPU process
> > has hung. Restart it.
> 
> That said, it's critical that a web process does not have the capability to
> cause the GPU Process to "restart" arbitrarily, as the client is untrusted.
> We wouldn't want a compromised web process to be able to effectively kill
> and relaunch the GPU Process in a loop.
> 
> Now that I think about it more, I believe there's a third option (C) here,
> which is to simply remove the timeout and just use Seconds::infinity() for
> these sync messages. The client (web process) trusts the server (GPU
> process) to be able to handle incoming drawing commands, and knows that it's
> only a matter of time until the GPU Process either:
>     [1] ..finishes executing the command and sends the reply, in which case
> the `sendSync()` succeeds or
>     [2] ..jetsams or crashes in the process, in which case the `sendSync()`
> fails and we just carry on in the web process

I should also note — using an ∞ timeout is also how most of the regular `IPC::Connection::sendSync()` calls from the web process to the UI process work, and the trust model between the web and UI process is pretty much the same as the trust model between the web and GPU process.

The notion of a sync IPC timeout is really only useful in the other direction (UI -> web, or more generally "trusted" -> "untrusted"), since any sync IPC in this direction (1) will prevent the trusted process from doing anything else in the meantime, and (2) is not guaranteed to complete in a timely manner.

> 
> > 
> > 
> > "Retry" or "carry on like nothing happened and hope for the best" is ok if
> > the medium is such that it's expected that messages get lost, like a
> > network. IPC isn't like that, nothing should be lost by design.
Comment 5 Wenson Hsieh 2021-11-02 11:16:48 PDT
Created attachment 443113 [details]
For EWS
Comment 6 Wenson Hsieh 2021-11-03 11:38:05 PDT
I think we should consider adjusting our web -> GPUP IPC timeout policy in a followup — there might be some reasons to remove the timeout (which would match the behavior of  most sync messages from web -> UI process), but there might also be advantages to preserving the timeout since (in theory) it may be better for some synchronous canvas APIs like `toDataURL()` to fail gracefully instead of hanging the web process forever.

In any case, I think this adjustment is a fairly safe and straightforward way to mitigate this source of instability.
Comment 7 Kimmo Kinnunen 2021-11-03 12:23:53 PDT
Comment on attachment 443113 [details]
For EWS

View in context: https://bugs.webkit.org/attachment.cgi?id=443113&action=review

>  In summary, the sync IPC timeout causes us to enter a state where the web process continues to encode messages
..
>  To fix this, we simply keep the client offset in sync with the server offset in the case of a sync message

I'm a bit torn about this change.
It's not really consistent to say:
A) It is programming error to send messages after a timeout
B) I'm fixing an issue that happens when I send messages after a timeout

The connection state is not well-defined after timeout.
Primarily the timeout means the GPU process receiver is broken.
If it was not broken, it'd make sense to wait longer.

> Source/WebKit/Platform/IPC/StreamClientConnection.h:291
>      // If the transaction commits, server is guaranteed to signal.

The comment above says: 
// This would mean we try to send messages after a timeout. It is a programming error.
Comment 8 Kimmo Kinnunen 2021-11-03 12:26:41 PDT
> synchronous canvas APIs like `toDataURL()` to fail gracefully instead of hanging the web process forever.

But what's the graceful method to do that? I think UI should stop the web page that hangs and show "web page crashed" dialog.
Comment 9 Wenson Hsieh 2021-11-03 12:30:24 PDT
(In reply to Kimmo Kinnunen from comment #7)
> Comment on attachment 443113 [details]
> For EWS
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=443113&action=review
> 
> >  In summary, the sync IPC timeout causes us to enter a state where the web process continues to encode messages
> ..
> >  To fix this, we simply keep the client offset in sync with the server offset in the case of a sync message
> 
> I'm a bit torn about this change.
> It's not really consistent to say:
> A) It is programming error to send messages after a timeout
> B) I'm fixing an issue that happens when I send messages after a timeout

Perhaps this is true of timeouts when waiting in tryAcquire(), but it seems that timing out doesn't leave us in a bad state in tryAcquireAll() (i.e. in the case of sync IPC), with the exception of the client offset not being reset.

Can we just adjust this comment to be more precise?

> 
> The connection state is not well-defined after timeout.
> Primarily the timeout means the GPU process receiver is broken.
> If it was not broken, it'd make sense to wait longer.
> 
> > Source/WebKit/Platform/IPC/StreamClientConnection.h:291
> >      // If the transaction commits, server is guaranteed to signal.
> 
> The comment above says: 
> // This would mean we try to send messages after a timeout. It is a
> programming error.
Comment 10 Wenson Hsieh 2021-11-03 12:42:47 PDT
(In reply to Kimmo Kinnunen from comment #8)
> > synchronous canvas APIs like `toDataURL()` to fail gracefully instead of hanging the web process forever.
> 
> But what's the graceful method to do that? I think UI should stop the web
> page that hangs and show "web page crashed" dialog.

I think some APIs (like getImageData) will simply return null, in the case of timeouts.

After chatting with thorton, I now think that my original approach might be better (i.e. just remove the timeout in these circumstances), since we don't know if the results of this API are going to be (for instance) uploaded and stored on a server, or something.

That said, I don't think there's any tangible downside to the tweak in https://bugs.webkit.org/attachment.cgi?id=443113&action=review. Perhaps we could still consider that change, but in a separate bug?
Comment 11 Wenson Hsieh 2021-11-03 12:59:57 PDT
Created attachment 443228 [details]
For EWS
Comment 12 Tim Horton 2021-11-03 13:36:37 PDT
Comment on attachment 443228 [details]
For EWS

Kimmo's comments (mostly #c2) suggest he's OK with this approach too, if I'm reading them right. It certainly makes sense to me.
Comment 13 Wenson Hsieh 2021-11-03 14:51:27 PDT
Comment on attachment 443228 [details]
For EWS

Thanks for the review!
Comment 14 EWS 2021-11-03 15:03:13 PDT
Committed r285233 (243854@main): <https://commits.webkit.org/243854@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 443228 [details].
Comment 15 Radar WebKit Bug Importer 2021-11-03 15:04:18 PDT
<rdar://problem/84993628>
Comment 16 Kimmo Kinnunen 2021-11-04 02:11:24 PDT
(In reply to Wenson Hsieh from comment #10)
> That said, I don't think there's any tangible downside to the tweak in
> https://bugs.webkit.org/attachment.cgi?id=443113&action=review. Perhaps we
> could still consider that change, but in a separate bug?

When sync message is put to the buffer, it is as if sender says to the receiver:
"You now own the whole buffer. My message starts at the offset you expect it to start."

When sync message is confirmed to be completed, it is as if receiver says to the sender:
"I just processed your message. You now own the whole buffer. My reply starts at offset 0."

So if you don't get the completion confirmation, the receiver still owns the buffer. Sender cannot just start overwriting data and expect correct operation.

If we were able to do this, what would be the point in waiting for sync completion in the first place?

Also, correcting the senders local offset variable is not the full solution. The shared state is still left in undefined state. The local offset variable can be only modified based on the state of the shared variables, interpreted with the communication protocol rules.

Also, correcting any state is pointless, as timeout means the receiver has gone broken.
Comment 17 Kimmo Kinnunen 2021-11-04 02:21:40 PDT
> So if you don't get the completion confirmation, the receiver still owns the buffer. Sender cannot just start overwriting data and expect correct operation.

Concrete example of receiver speed, workload and timeout duration mismatch:
1) Sender sends sync message asking for image data.
2) This times out.
3) Sender reset the local client offset to 0.
4a) Sender then send normal drawImage by writing 50 bytes that to index 0.
4b) At the same time, the receiver replies to the message in 1) by writing the 100kb image data to index 0.

It doesn't work.


> Also, correcting any state is pointless, as timeout means the receiver has gone broken.

Concrete example of a receiver bug:
1) Sender sends sync message asking for image data.
2) This times out.
3) Sender reset the local client offset to 0.
4a) Sender then send normal drawImage by writing 50 bytes that to index 0.
5) Receiver does not receive the drawImage, because it has hung somewhere earlier, as indicated first by 2).

So calls either complete or the receiver is invalid.
Comment 18 Wenson Hsieh 2021-11-04 09:24:02 PDT
(In reply to Kimmo Kinnunen from comment #17)
> > So if you don't get the completion confirmation, the receiver still owns the buffer. Sender cannot just start overwriting data and expect correct operation.
> 
> Concrete example of receiver speed, workload and timeout duration mismatch:
> 1) Sender sends sync message asking for image data.
> 2) This times out.
> 3) Sender reset the local client offset to 0.
> 4a) Sender then send normal drawImage by writing 50 bytes that to index 0.
> 4b) At the same time, the receiver replies to the message in 1) by writing
> the 100kb image data to index 0.
> 
> It doesn't work.

In practice (at least, when I tested locally yesterday), this is not what happens because in step (4a) the sender will wait for the receiver to `releaseAll()` under `tryAcquire()`. So the receiver will first write the 100KB data to 0, and *then* the sender will write its data on top of that at offset 0.

> 
> 
> > Also, correcting any state is pointless, as timeout means the receiver has gone broken.
> 
> Concrete example of a receiver bug:
> 1) Sender sends sync message asking for image data.
> 2) This times out.
> 3) Sender reset the local client offset to 0.
> 4a) Sender then send normal drawImage by writing 50 bytes that to index 0.
> 5) Receiver does not receive the drawImage, because it has hung somewhere
> earlier, as indicated first by 2).
> 
> So calls either complete or the receiver is invalid.

In the case where the receiver is "invalid" (presumably, hanging?), I don't think anything we can do on the sender really matters. It should probably be up to the UI process to detect this and restart the GPU Process, in this circumstance.