WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
ASSIGNED
229523
[Catalina GPU] fast/canvas/webgl/lose-context-on-timeout.html is a flaky timeout
https://bugs.webkit.org/show_bug.cgi?id=229523
Summary
[Catalina GPU] fast/canvas/webgl/lose-context-on-timeout.html is a flaky timeout
Eric Hutchison
Reported
2021-08-25 15:18:11 PDT
fast/canvas/webgl/lose-context-on-timeout.html is a flaky timeout on Catalina GPU History:
https://results.webkit.org/?platform=ios&platform=mac&suite=layout-tests&test=fast%2Fcanvas%2Fwebgl%2Flose-context-on-timeout.html
Results:
https://build.webkit.org/results/Apple-Catalina-Debug-WK2-GPUProcess-Tests/r281492%20(2091)/results.html
Diff: --- /Volumes/Data/worker/catalina-debug-tests-wk2-gpuprocess/build/layout-test-results/fast/canvas/webgl/lose-context-on-timeout-expected.txt +++ /Volumes/Data/worker/catalina-debug-tests-wk2-gpuprocess/build/layout-test-results/fast/canvas/webgl/lose-context-on-timeout-actual.txt @@ -1,13 +1,8 @@ -CONSOLE MESSAGE: WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost +FAIL: Timed out waiting for notifyDone to be called + Checks that a GPU process timeout on a result producing WebGL call will lose the context. NOTE: This only passes in the test harness because it requires Internals. PASS gl.isContextLost() is false PASS gl.getError() is gl.NO_ERROR -PASS gl.isContextLost() is true -PASS gl.getError() is gl.CONTEXT_LOST_WEBGL -PASS val is [0, 0, 0, 0] -PASS successfullyParsed is true -TEST COMPLETE -
Attachments
Add attachment
proposed patch, testcase, etc.
Radar WebKit Bug Importer
Comment 1
2021-08-25 15:19:02 PDT
<
rdar://problem/82360416
>
Eric Hutchison
Comment 2
2021-08-25 16:58:46 PDT
Updated test expectations at
https://trac.webkit.org/changeset/281601/webkit
.
Eric Hutchison
Comment 3
2021-09-14 16:17:20 PDT
Reproduced test results locally on
r282390
using run-webkit-tests --debug -f --force --iterations 500 --exit-after-n-crashes-or-timeouts 2 fast/canvas/webgl/lose-context-on-timeout.html
Kimmo Kinnunen
Comment 4
2022-05-10 02:40:41 PDT
ebGL GPUP simulated timeout does not apply directly after invocation Timeout is implemented as 1) Remove the graphics context from the list of graphics contexts in the WP-GPUP session RemoteGraphicsContextGL::simulateEventForTesting 2) As a detail, ignore all out-of-stream messages that do not have a corresponding context in the list GPUConnectionToWebProcess::dispatchSyncMessage, GPUConnectionToWebProcess::dispatchMessage The test seems to work sometimes because when the simulated context is removed, the GPUP may become idle and is shut down. Further messages to shut down GPUP are correctly detected as failing and context is lost. Bug 1: The 1) happens in main thread, where as the stream message dispatch happens in the stream work queue. Removing the stream connection from the stream work queue happens at arbitrary moment in stream work queue perspective. This means that it can process messages to the timed out context after main thread has done the removal. Bug 2: Out-of-stream message sync replies of the removed contexts might be sent, and they might be interpreted as correct messages. The correct handling in GPUConnectionToWebProcess::dispatchSyncMessage, GPUConnectionToWebProcess::dispatchMessage should be to mark the decoders as invalid (decoder.markInvalid()). Bug 3: RemoteGraphicsContextGL design is that the sender RemoteGraphicsContextGLProxy should stop sending immediately after context loss. However, given two threads of execution, this is not possible. RemoteGraphicsContextGL might have messages after simulated timeout, simulated context loss or real GPUP side context loss. This means that all the context calls after context loss should be skipped. E.g. generated functions should be fixed like so: void getFloatv(uint32_t pname, uint64_t valueSize, CompletionHandler<void(IPC::ArrayReference<float>)>&& completionHandler) { WTFLogAlways("kkimmo GetFloat"); Vector<GCGLfloat, 16> value(static_cast<size_t>(valueSize), 0); assertIsCurrent(workQueue()); m_context->getFloatv(pname, value); completionHandler(IPC::ArrayReference<float>(reinterpret_cast<float*>(value.data()), value.size())); } --> void getFloatv(uint32_t pname, uint64_t valueSize, CompletionHandler<void(IPC::ArrayReference<float>)>&& completionHandler) { WTFLogAlways("kkimmo GetFloat"); Vector<GCGLfloat, 16> value(static_cast<size_t>(valueSize), 0); assertIsCurrent(workQueue()); if (somecondition) m_context->getFloatv(pname, value); completionHandler(IPC::ArrayReference<float>(reinterpret_cast<float*>(value.data()), value.size())); } the condition could. be like LIKELY(m_context) or LIKELY(!isLost()) or similar.
Kimmo Kinnunen
Comment 5
2022-05-10 02:43:20 PDT
and to make the test predictable, the test should create one context that is keeping the GPUP alive, and other context to which a simulated timeout is being sent
Dan Glastonbury
Comment 6
2022-05-12 22:51:33 PDT
(In reply to Kimmo Kinnunen from
comment #4
)
> ebGL GPUP simulated timeout does not apply directly after invocation > > Timeout is implemented as > > 1) Remove the graphics context from the list of graphics contexts in the > WP-GPUP session > RemoteGraphicsContextGL::simulateEventForTesting > > 2) As a detail, ignore all out-of-stream messages that do not have a > corresponding context in the list > GPUConnectionToWebProcess::dispatchSyncMessage, > GPUConnectionToWebProcess::dispatchMessage > > > The test seems to work sometimes because when the simulated context is > removed, the GPUP may become idle and is shut down. Further messages to shut > down GPUP are correctly detected as failing and context is lost.
After investigating the code, with the current setup for how the timeout is being simulated, I can't see that this test would ever not be flaky. I'm seeing the 3 GL calls arrive on the GPUP before the main thread runs to remove the context. Since the logic to trigger context lost on the WP side relies on the "poll" of sending GL commands across IPC, the WP sits idle never invoking any GL calls and thus the `webglcontextlost` event is never fired.
> Bug 1: > The 1) happens in main thread, where as the stream message dispatch happens > in the stream work queue. Removing the stream connection from the stream > work queue happens at arbitrary moment in stream work queue perspective. > This means that it can process messages to the timed out context after main > thread has done the removal.
I can fix the test to run by replacing the context removal with a call to `forceContextLost()`, but this appears to be against the spirit of the test which is trying to simulate that the WP-GPU IPC happens to timeout.
> Bug 2: > Out-of-stream message sync replies of the removed contexts might be sent, > and they might be interpreted as correct messages. The correct handling in > GPUConnectionToWebProcess::dispatchSyncMessage, > GPUConnectionToWebProcess::dispatchMessage should be to mark the decoders as > invalid (decoder.markInvalid()).
In my testing, the code in `GPUConnectionToWebProcess::dispatchMessage` or `GPUConnectionToWebProcess::dispatchSyncMessage` are never called. Even if the main thread code to remove the connection runs before the final `getFloatv`, the message is still dispatched to the `RemoteGraphicsContextGL` Perhaps the test needs return to the main event loop and put the handling of timeout into an rAF?
Dan Glastonbury
Comment 7
2022-05-12 23:28:02 PDT
The rAF trick works. Alternatively, setting `RemoteGraphicsContextGLProxy::defaultSendTimeout` to `0_s` caused the IPC send to fail and though the remote connection timed out, also works. Kimmo, which is the best way to proceed?
Kimmo Kinnunen
Comment 8
2022-05-12 23:48:59 PDT
(In reply to Dan Glastonbury from
comment #7
)
> After investigating the code, with the current setup for how the timeout is being simulated, I can't see that this test would ever not be flaky.
I think originally the GPU Process idle-exited immediately upon the simulated timeout event, so this is why it worked.
> I'm seeing the 3 GL calls arrive on the GPUP before the main thread runs to remove the context.
I don't think this is the case. The calls "arrive" to the remotegraphicscontextgl work queue, but they're processed in order: 1. simulated timeout 2. clearColor 3. clear 4. getParameter Since simulated timeout executes a function sync in main thread, the effects are: 1. simulated timeout 1.1 remove context from the list in the main thread 2. clearColor 3. clear 4. getParameter As far as I understand, of course maybe I'm wrong.
> Since the logic to trigger context lost on the WP side relies on the "poll" of sending GL commands across IPC, the WP sits idle never invoking any GL calls and thus the `webglcontextlost` event is never fired.
Yes, this is correct. But the underlying reason why the GPUP processes calls 2. - 4. is because the contract of removing the stream connection from the processing thread is that the removal happens at some point in time, but not neccessarily immediately (with respect to the messages being processed in the remotegraphicscontextgl work queue.
> In my testing, the code in `GPUConnectionToWebProcess::dispatchMessage` or `GPUConnectionToWebProcess::dispatchSyncMessage` are never called. Even if the main thread code to remove the connection runs before the final `getFloatv`, the message is still dispatched to the `RemoteGraphicsContextGL`
This was just describing an edge case when the message is sent out of stream. Typically they're sent in the stream. This is a separate change, but needed for the consistency for this bug.
> Perhaps the test needs return to the main event loop and put the handling of timeout into an rAF? > The rAF trick works.
I don't immediately see what you mean. The timeout is handled by webgl context lost event, which is delivered after dropping to event loop.
> > Alternatively, setting `RemoteGraphicsContextGLProxy::defaultSendTimeout` to > `0_s` caused the IPC send to fail and though the remote connection timed > out, also works.
Nothing completes in 0 seconds, so the timeout isn't really a timeout in this case.
> Kimmo, which is the best way to proceed?
I think the best way to proceed is the way I described.
Kimmo Kinnunen
Comment 9
2022-05-13 00:00:20 PDT
We want to test the case where we send to GPUP but receive nothing back in this case. So that's why it sends a simulated timeout and not a context lost. So I think the method of timing out could be something like { assertIsCurrent(workQueue()); // FIXME: only run this in testing mode.
https://bugs.webkit.org/show_bug.cgi?id=222544
if (event == WebCore::GraphicsContextGL::SimulatedEventForTesting::Timeout) stopListeningForIPCForTesting() } void stopListeningForIPCForTesting() { m_streamConnection->stopReceivingMessagesForTesting?(Messages::RemoteGraphicsContextGL::messageReceiverName(), m_graphicsContextGLIdentifier.toUInt64()); } Then we just need to make sure that for testing purposes, the removal of stream stream receiver from the connection in stopReceivingMessagesForTesting? happens somehow synchronously but still makes sense.
Kimmo Kinnunen
Comment 10
2022-05-13 00:05:10 PDT
Alternatively it could be as simple as void RemoteGraphicsContextGL::simulateEventForTesting(WebCore::GraphicsContextGL::SimulatedEventForTesting event) { assertIsCurrent(workQueue()); // FIXME: only run this in testing mode.
https://bugs.webkit.org/show_bug.cgi?id=222544
if (event == WebCore::GraphicsContextGL::SimulatedEventForTesting::Timeout) sleep(100000); ... } But in this case we would need a some sort of mechanism to really shut down the hung GPUP.
Dan Glastonbury
Comment 11
2022-05-13 00:09:28 PDT
(In reply to Kimmo Kinnunen from
comment #8
)
> (In reply to Dan Glastonbury from
comment #7
) > > > I'm seeing the 3 GL calls arrive on the GPUP before the main thread runs to remove the context. > > I don't think this is the case. The calls "arrive" to the > remotegraphicscontextgl work queue, but they're processed in order: > > 1. simulated timeout > 2. clearColor > 3. clear > 4. getParameter > > Since simulated timeout executes a function sync in main thread, the effects > are: > > 1. simulated timeout > 1.1 remove context from the list in the main thread > 2. clearColor > 3. clear > 4. getParameter
The order in an example timeout according to logging is: WP clearColor --> WP clear --> --> GPUP clearColor --> GPUP clear WP simulateEventForTesting --> WP clearColor --> --> GPUP simulateEventForTesting(Timeout) WP clear --> --> GPUP clearColor --> GPUP clear WP getFloatv --> --> GPUP - releaseGraphicsContextGLForTesting (mainThread) --> GPUP RemoteGraphicsContextGL::stopListeningForIPC --> GPUP getFloatv Even though the connection and context has been removed the message for `getFloatv` is still delivered the RemoteGraphicsContextGL. `getFloatv` checked that `m_context` is nullptr and does nothing. Once a message has been sent, there appears to be no way to signal failure from the receiver to the sender. Because there are no failures, the context is never lost and the event is never raised and the test times out waiting on the promise to resolve.
> This was just describing an edge case when the message is sent out of > stream. Typically they're sent in the stream. This is a separate change, but > needed for the consistency for this bug.
Ok. I'll add the changes, although I've been unable to test that it's right.
Dan Glastonbury
Comment 12
2022-05-13 00:11:30 PDT
(In reply to Kimmo Kinnunen from
comment #10
)
> Alternatively it could be as simple as > > void > RemoteGraphicsContextGL::simulateEventForTesting(WebCore::GraphicsContextGL:: > SimulatedEventForTesting event) > { > assertIsCurrent(workQueue()); > // FIXME: only run this in testing mode. >
https://bugs.webkit.org/show_bug.cgi?id=222544
> if (event == > WebCore::GraphicsContextGL::SimulatedEventForTesting::Timeout) > sleep(100000); > ... > }
That's more elegant than my idea to crash the GPUP. If we reduced the timeout on the caller side, we could sleep for, say 10s, and that would be less than the test timeout threshold of 30s. Or mark the test as slow and wait for 30s (ish).
Kimmo Kinnunen
Comment 13
2022-05-13 02:03:03 PDT
(In reply to Dan Glastonbury from
comment #11
)
> (In reply to Kimmo Kinnunen from
comment #8
) > > (In reply to Dan Glastonbury from
comment #7
) > > > > > I'm seeing the 3 GL calls arrive on the GPUP before the main thread runs to remove the context. > > > > I don't think this is the case. The calls "arrive" to the > > remotegraphicscontextgl work queue, but they're processed in order:
>
> WP clear --> > --> GPUP clearColor > --> GPUP clear > WP simulateEventForTesting --> > WP clearColor --> > --> GPUP simulateEventForTesting(Timeout) > WP clear --> > --> GPUP clearColor > --> GPUP clear > WP getFloatv --> > --> GPUP - releaseGraphicsContextGLForTesting (mainThread) > --> GPUP RemoteGraphicsContextGL::stopListeningForIPC > --> GPUP getFloatv
You are right! I imagined I used the callOnMainRunLoop correctly. The intention was to call callOnMainRunLoopAndWait.
Kimmo Kinnunen
Comment 14
2022-05-13 02:04:29 PDT
(In reply to Dan Glastonbury from
comment #11
)
> > This was just describing an edge case when the message is sent out of > > stream. Typically they're sent in the stream. This is a separate change, but > > needed for the consistency for this bug. > > Ok. I'll add the changes, although I've been unable to test that it's right.
Maybe submit this as a separate change.
Dan Glastonbury
Comment 15
2022-05-15 19:56:19 PDT
Pull request:
https://github.com/WebKit/WebKit/pull/631
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug