Bug 229523

Summary: [Catalina GPU] fast/canvas/webgl/lose-context-on-timeout.html is a flaky timeout
Product: WebKit Reporter: Eric Hutchison <ehutchison>
Component: CanvasAssignee: Dan Glastonbury <djg>
Status: ASSIGNED ---    
Severity: Normal CC: dino, ehutchison, kkinnunen, sabouhallawa, webkit-bot-watchers-bugzilla, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   

Description Eric Hutchison 2021-08-25 15:18:11 PDT
fast/canvas/webgl/lose-context-on-timeout.html

is a flaky timeout on Catalina GPU

History: https://results.webkit.org/?platform=ios&platform=mac&suite=layout-tests&test=fast%2Fcanvas%2Fwebgl%2Flose-context-on-timeout.html

Results: https://build.webkit.org/results/Apple-Catalina-Debug-WK2-GPUProcess-Tests/r281492%20(2091)/results.html

Diff: 
--- /Volumes/Data/worker/catalina-debug-tests-wk2-gpuprocess/build/layout-test-results/fast/canvas/webgl/lose-context-on-timeout-expected.txt
+++ /Volumes/Data/worker/catalina-debug-tests-wk2-gpuprocess/build/layout-test-results/fast/canvas/webgl/lose-context-on-timeout-actual.txt
@@ -1,13 +1,8 @@
-CONSOLE MESSAGE: WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost
+FAIL: Timed out waiting for notifyDone to be called
+
 Checks that a GPU process timeout on a result producing WebGL call will lose the context.
 NOTE: This only passes in the test harness because it requires Internals.
 
 PASS gl.isContextLost() is false
 PASS gl.getError() is gl.NO_ERROR
-PASS gl.isContextLost() is true
-PASS gl.getError() is gl.CONTEXT_LOST_WEBGL
-PASS val is [0, 0, 0, 0]
-PASS successfullyParsed is true
 
-TEST COMPLETE
-
Comment 1 Radar WebKit Bug Importer 2021-08-25 15:19:02 PDT
<rdar://problem/82360416>
Comment 2 Eric Hutchison 2021-08-25 16:58:46 PDT
Updated test expectations at https://trac.webkit.org/changeset/281601/webkit.
Comment 3 Eric Hutchison 2021-09-14 16:17:20 PDT
Reproduced test results locally on r282390 using run-webkit-tests --debug -f --force --iterations 500 --exit-after-n-crashes-or-timeouts 2 fast/canvas/webgl/lose-context-on-timeout.html
Comment 4 Kimmo Kinnunen 2022-05-10 02:40:41 PDT
ebGL GPUP simulated timeout does not apply directly after invocation

Timeout is implemented as 

1) Remove the graphics context from the list of graphics contexts in the WP-GPUP session
     RemoteGraphicsContextGL::simulateEventForTesting

2) As a detail, ignore all out-of-stream messages that do not have a corresponding context in the list
     GPUConnectionToWebProcess::dispatchSyncMessage,  GPUConnectionToWebProcess::dispatchMessage


The test seems to work sometimes because when the simulated context is removed, the GPUP may become idle and is shut down. Further messages to shut down GPUP are correctly detected as failing and context is lost.




Bug 1:
The 1) happens in main thread, where as the stream message dispatch happens in the stream work queue. Removing the stream connection from the stream work queue happens at arbitrary moment in stream work queue perspective. This means that it can process messages to the timed out context after main thread has done the removal.

Bug 2:
Out-of-stream message sync replies of the removed contexts might be sent, and they might be interpreted as correct messages. The correct handling in GPUConnectionToWebProcess::dispatchSyncMessage,  GPUConnectionToWebProcess::dispatchMessage should be to mark the decoders as invalid (decoder.markInvalid()).

Bug 3:
RemoteGraphicsContextGL design is that the sender RemoteGraphicsContextGLProxy should stop sending immediately after context loss. However, given two threads of execution, this is not possible. RemoteGraphicsContextGL might have messages after simulated timeout, simulated context loss or real GPUP side context loss. This means that all the context calls after context loss should be skipped.





 E.g. generated functions should be fixed like so:

    void getFloatv(uint32_t pname, uint64_t valueSize, CompletionHandler<void(IPC::ArrayReference<float>)>&& completionHandler)
    {
        WTFLogAlways("kkimmo GetFloat");
        Vector<GCGLfloat, 16> value(static_cast<size_t>(valueSize), 0);
        assertIsCurrent(workQueue());
        m_context->getFloatv(pname, value);
        completionHandler(IPC::ArrayReference<float>(reinterpret_cast<float*>(value.data()), value.size()));
    }

-->

    void getFloatv(uint32_t pname, uint64_t valueSize, CompletionHandler<void(IPC::ArrayReference<float>)>&& completionHandler)
    {
        WTFLogAlways("kkimmo GetFloat");
        Vector<GCGLfloat, 16> value(static_cast<size_t>(valueSize), 0);
        assertIsCurrent(workQueue());
        if (somecondition)
            m_context->getFloatv(pname, value);
        completionHandler(IPC::ArrayReference<float>(reinterpret_cast<float*>(value.data()), value.size()));
    }


the condition could. be like 
  LIKELY(m_context)

or
  LIKELY(!isLost())

or similar.
Comment 5 Kimmo Kinnunen 2022-05-10 02:43:20 PDT
and to make the test predictable, the test should create one context that is keeping the GPUP alive, and other context to which a simulated timeout is being sent
Comment 6 Dan Glastonbury 2022-05-12 22:51:33 PDT
(In reply to Kimmo Kinnunen from comment #4)
> ebGL GPUP simulated timeout does not apply directly after invocation
> 
> Timeout is implemented as 
> 
> 1) Remove the graphics context from the list of graphics contexts in the
> WP-GPUP session
>      RemoteGraphicsContextGL::simulateEventForTesting
> 
> 2) As a detail, ignore all out-of-stream messages that do not have a
> corresponding context in the list
>      GPUConnectionToWebProcess::dispatchSyncMessage, 
> GPUConnectionToWebProcess::dispatchMessage
> 
> 
> The test seems to work sometimes because when the simulated context is
> removed, the GPUP may become idle and is shut down. Further messages to shut
> down GPUP are correctly detected as failing and context is lost.

After investigating the code, with the current setup for how the timeout is being simulated, I can't see that this test would ever not be flaky. I'm seeing the 3 GL calls arrive on the GPUP before the main thread runs to remove the context. Since the logic to trigger context lost on the WP side relies on the "poll" of sending GL commands across IPC, the WP sits idle never invoking any GL calls and thus the `webglcontextlost` event is never fired.

> Bug 1:
> The 1) happens in main thread, where as the stream message dispatch happens
> in the stream work queue. Removing the stream connection from the stream
> work queue happens at arbitrary moment in stream work queue perspective.
> This means that it can process messages to the timed out context after main
> thread has done the removal.

I can fix the test to run by replacing the context removal with a call to `forceContextLost()`, but this appears to be against the spirit of the test which is trying to simulate that the WP-GPU IPC happens to timeout.

> Bug 2:
> Out-of-stream message sync replies of the removed contexts might be sent,
> and they might be interpreted as correct messages. The correct handling in
> GPUConnectionToWebProcess::dispatchSyncMessage, 
> GPUConnectionToWebProcess::dispatchMessage should be to mark the decoders as
> invalid (decoder.markInvalid()).

In my testing, the code in `GPUConnectionToWebProcess::dispatchMessage` or `GPUConnectionToWebProcess::dispatchSyncMessage` are never called. Even if the main thread code to remove the connection runs before the final `getFloatv`, the message is still dispatched to the `RemoteGraphicsContextGL`

Perhaps the test needs return to the main event loop and put the handling of timeout into an rAF?
Comment 7 Dan Glastonbury 2022-05-12 23:28:02 PDT
The rAF trick works.

Alternatively, setting `RemoteGraphicsContextGLProxy::defaultSendTimeout` to `0_s` caused the IPC send to fail and though the remote connection timed out, also works.

Kimmo, which is the best way to proceed?
Comment 8 Kimmo Kinnunen 2022-05-12 23:48:59 PDT
(In reply to Dan Glastonbury from comment #7)

> After investigating the code, with the current setup for how the timeout is being simulated, I can't see that this test would ever not be flaky.

I think originally the GPU Process idle-exited immediately upon the simulated timeout event, so this is why it worked.

> I'm seeing the 3 GL calls arrive on the GPUP before the main thread runs to remove the context.

I don't think this is the case. The calls "arrive" to the remotegraphicscontextgl work queue, but they're processed in order:

1. simulated timeout
2. clearColor
3. clear
4. getParameter

Since simulated timeout executes a function sync in main thread, the effects are:

1. simulated timeout
1.1 remove context from the list in the main thread
2. clearColor
3. clear
4. getParameter

As far as I understand, of course maybe I'm wrong.

> Since the logic to trigger context lost on the WP side relies on the "poll" of sending GL commands across IPC, the WP sits idle never invoking any GL calls and thus the `webglcontextlost` event is never fired.

Yes, this is correct. But the underlying reason why the GPUP processes calls 2. - 4. is because the contract of removing the stream connection from the processing thread is that the removal happens at some point in time, but not neccessarily immediately (with respect to the messages being processed in the remotegraphicscontextgl work queue.

> In my testing, the code in `GPUConnectionToWebProcess::dispatchMessage` or `GPUConnectionToWebProcess::dispatchSyncMessage` are never called. Even if the main thread code to remove the connection runs before the final `getFloatv`, the message is still dispatched to the `RemoteGraphicsContextGL`

This was just describing an edge case when the message is sent out of stream. Typically they're sent in the stream. This is a separate change, but needed for the consistency for this bug.

> Perhaps the test needs return to the main event loop and put the handling of timeout into an rAF?
> The rAF trick works.

I don't immediately see what you mean. The timeout is handled by webgl context lost event, which is delivered after dropping to event loop.

> 
> Alternatively, setting `RemoteGraphicsContextGLProxy::defaultSendTimeout` to
> `0_s` caused the IPC send to fail and though the remote connection timed
> out, also works.

Nothing completes in 0 seconds, so the timeout isn't really a timeout in this case.
 
> Kimmo, which is the best way to proceed?

I think the best way to proceed is the way I described.
Comment 9 Kimmo Kinnunen 2022-05-13 00:00:20 PDT
We want to test the case where we send to GPUP but receive nothing back in this case. So that's why it sends a simulated timeout and not a context lost.

So I think the method of timing out could be something like 

{
    assertIsCurrent(workQueue());
    // FIXME: only run this in testing mode. https://bugs.webkit.org/show_bug.cgi?id=222544
    if (event == WebCore::GraphicsContextGL::SimulatedEventForTesting::Timeout)
        stopListeningForIPCForTesting()
}


void stopListeningForIPCForTesting() {
    m_streamConnection->stopReceivingMessagesForTesting?(Messages::RemoteGraphicsContextGL::messageReceiverName(), m_graphicsContextGLIdentifier.toUInt64());
}


Then we just need to make sure that for testing purposes, the removal of stream stream receiver from the connection in stopReceivingMessagesForTesting? happens somehow synchronously but still makes sense.
Comment 10 Kimmo Kinnunen 2022-05-13 00:05:10 PDT
Alternatively it could be as simple as

void RemoteGraphicsContextGL::simulateEventForTesting(WebCore::GraphicsContextGL::SimulatedEventForTesting event)
{
    assertIsCurrent(workQueue());
    // FIXME: only run this in testing mode. https://bugs.webkit.org/show_bug.cgi?id=222544
    if (event == WebCore::GraphicsContextGL::SimulatedEventForTesting::Timeout)
        sleep(100000);
    ...
}

But in this case we would need a some sort of mechanism to really shut down the hung GPUP.
Comment 11 Dan Glastonbury 2022-05-13 00:09:28 PDT
(In reply to Kimmo Kinnunen from comment #8)
> (In reply to Dan Glastonbury from comment #7)
> 
> > I'm seeing the 3 GL calls arrive on the GPUP before the main thread runs to remove the context.
> 
> I don't think this is the case. The calls "arrive" to the
> remotegraphicscontextgl work queue, but they're processed in order:
> 
> 1. simulated timeout
> 2. clearColor
> 3. clear
> 4. getParameter
> 
> Since simulated timeout executes a function sync in main thread, the effects
> are:
> 
> 1. simulated timeout
> 1.1 remove context from the list in the main thread
> 2. clearColor
> 3. clear
> 4. getParameter

The order in an example timeout according to logging is:

WP clearColor -->
WP clear -->
--> GPUP clearColor
--> GPUP clear
WP simulateEventForTesting -->
WP clearColor -->
--> GPUP simulateEventForTesting(Timeout)
WP clear -->
--> GPUP clearColor
--> GPUP clear
WP getFloatv -->
--> GPUP - releaseGraphicsContextGLForTesting (mainThread)
--> GPUP RemoteGraphicsContextGL::stopListeningForIPC
--> GPUP getFloatv

Even though the connection and context has been removed the message for `getFloatv` is still delivered the RemoteGraphicsContextGL. `getFloatv` checked that `m_context` is nullptr and does nothing. Once a message has been sent, there appears to be no way to signal failure from the receiver to the sender.

Because there are no failures, the context is never lost and the event is never raised and the test times out waiting on the promise to resolve.

> This was just describing an edge case when the message is sent out of
> stream. Typically they're sent in the stream. This is a separate change, but
> needed for the consistency for this bug.

Ok. I'll add the changes, although I've been unable to test that it's right.
Comment 12 Dan Glastonbury 2022-05-13 00:11:30 PDT
(In reply to Kimmo Kinnunen from comment #10)
> Alternatively it could be as simple as
> 
> void
> RemoteGraphicsContextGL::simulateEventForTesting(WebCore::GraphicsContextGL::
> SimulatedEventForTesting event)
> {
>     assertIsCurrent(workQueue());
>     // FIXME: only run this in testing mode.
> https://bugs.webkit.org/show_bug.cgi?id=222544
>     if (event ==
> WebCore::GraphicsContextGL::SimulatedEventForTesting::Timeout)
>         sleep(100000);
>     ...
> }

That's more elegant than my idea to crash the GPUP. If we reduced the timeout on the caller side, we could sleep for, say 10s, and that would be less than the test timeout threshold of 30s. Or mark the test as slow and wait for 30s (ish).
Comment 13 Kimmo Kinnunen 2022-05-13 02:03:03 PDT
(In reply to Dan Glastonbury from comment #11)
> (In reply to Kimmo Kinnunen from comment #8)
> > (In reply to Dan Glastonbury from comment #7)
> > 
> > > I'm seeing the 3 GL calls arrive on the GPUP before the main thread runs to remove the context.
> > 
> > I don't think this is the case. The calls "arrive" to the
> > remotegraphicscontextgl work queue, but they're processed in order:
>
> WP clear -->
> --> GPUP clearColor
> --> GPUP clear
> WP simulateEventForTesting -->
> WP clearColor -->
> --> GPUP simulateEventForTesting(Timeout)
> WP clear -->
> --> GPUP clearColor
> --> GPUP clear
> WP getFloatv -->
> --> GPUP - releaseGraphicsContextGLForTesting (mainThread)
> --> GPUP RemoteGraphicsContextGL::stopListeningForIPC
> --> GPUP getFloatv


You are right! I imagined I used the callOnMainRunLoop correctly. The intention was to call callOnMainRunLoopAndWait.
Comment 14 Kimmo Kinnunen 2022-05-13 02:04:29 PDT
(In reply to Dan Glastonbury from comment #11)
> > This was just describing an edge case when the message is sent out of
> > stream. Typically they're sent in the stream. This is a separate change, but
> > needed for the consistency for this bug.
> 
> Ok. I'll add the changes, although I've been unable to test that it's right.

Maybe submit this as a separate change.
Comment 15 Dan Glastonbury 2022-05-15 19:56:19 PDT
Pull request: https://github.com/WebKit/WebKit/pull/631