Bug 219838 - Web process crashes during MotionMark Images when GPU Process for DOM is enabled
Summary: Web process crashes during MotionMark Images when GPU Process for DOM is enabled
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit2 (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Wenson Hsieh
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-12-13 11:52 PST by Wenson Hsieh
Modified: 2020-12-14 10:04 PST (History)
5 users (show)

See Also:


Attachments
Patch (3.39 KB, patch)
2020-12-13 12:09 PST, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Patch (6.21 KB, patch)
2020-12-13 20:57 PST, Wenson Hsieh
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wenson Hsieh 2020-12-13 11:52:45 PST
Another crash due to backend creation failure:

Termination Signal: Segmentation fault: 11
Termination Reason: Namespace SIGNAL, Code 0xb
Terminating Process: exc handler [2237]
Triggered by Thread:  0

Thread 0 name:  Dispatch queue: com.apple.main-thread
Thread 0 Crashed:
0   WebKit                        	0x000000010352857c WebKit::ImageBufferShareableIOSurfaceBackend::createImageBufferBackendHandle() const + 24
1   WebKit                        	0x0000000103206474 WebKit::RemoteLayerBackingStore::encode(IPC::Encoder&) const + 316
2   WebKit                        	0x0000000103206474 WebKit::RemoteLayerBackingStore::encode(IPC::Encoder&) const + 316
3   WebKit                        	0x000000010320a050 WebKit::RemoteLayerTreeTransaction::LayerProperties::encode(IPC::Encoder&) const + 1184
4   WebKit                        	0x000000010320ac84 WebKit::RemoteLayerTreeTransaction::encode(IPC::Encoder&) const + 164
5   WebKit                        	0x000000010308fee4 WebKit::RemoteLayerTreeDrawingArea::updateRendering() + 552
6   WebCore                       	0x00000001091d42e4 WebCore::ThreadTimers::sharedTimerFiredInternal() + 224
7   WebCore                       	0x00000001091faa34 WebCore::timerFired(__CFRunLoopTimer*, void*) + 32
8   CoreFoundation                	0x00000001a782734c __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 32
9   CoreFoundation                	0x00000001a7826f48 __CFRunLoopDoTimer + 1076
10  CoreFoundation                	0x00000001a7826398 __CFRunLoopDoTimers + 328
11  CoreFoundation                	0x00000001a782014c __CFRunLoopRun + 1944
12  CoreFoundation                	0x00000001a781f480 CFRunLoopRunSpecific + 600
13  Foundation                    	0x00000001a8ceccac -[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 232
14  Foundation                    	0x00000001a8d203b8 -[NSRunLoop(NSRunLoop) run] + 92
15  libxpc.dylib                  	0x00000001f560e2cc _xpc_objc_main + 688
16  libxpc.dylib                  	0x00000001f56105e8 xpc_main + 180
17  WebKit                        	0x00000001031ce0b4 WebKit::XPCServiceMain(int, char const**) + 124
18  libdyld.dylib                 	0x00000001a74dd38c start + 4
Comment 1 Wenson Hsieh 2020-12-13 12:09:00 PST
Created attachment 416124 [details]
Patch
Comment 2 Sam Weinig 2020-12-13 15:31:45 PST
Comment on attachment 416124 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=416124&action=review

> Source/WebKit/ChangeLog:21
> +        B subsequently finishes waiting for backend creation because it incorrectly believes that its backend has been
> +        initialized, and we end up crashing downstream in `RemoteLayerBackingStore::encode()` because
> +        `ensureBackendCreated()` for B returned null.

Seems like we will still have this issue, just perhaps less often, since we will still return null from ensureBackendCreated() if enough callers call ensureBackendCreated() at the same time, or am I missing something?

> Source/WebKit/WebProcess/GPU/graphics/RemoteImageBufferProxy.h:140
> +        static constexpr unsigned maximumNumberOfTimeouts = 3;
> +        unsigned numberOfTimeouts = 0;
> +        while (!m_backend && numberOfTimeouts < maximumNumberOfTimeouts) {
> +            if (!m_remoteRenderingBackendProxy->waitForDidCreateImageBufferBackend())
> +                ++numberOfTimeouts;
> +        }

It seems a bit odd to call these timeouts if they also represent other sync messages that have been delivered in the interim (which is what I think you are saying they do). Can we differentiate between actual timeouts and other things that would cause waitForDidCreateImageBufferBackend to return prematurely?
Comment 3 Wenson Hsieh 2020-12-13 15:50:20 PST
Comment on attachment 416124 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=416124&action=review

>> Source/WebKit/ChangeLog:21
>> +        `ensureBackendCreated()` for B returned null.
> 
> Seems like we will still have this issue, just perhaps less often, since we will still return null from ensureBackendCreated() if enough callers call ensureBackendCreated() at the same time, or am I missing something?

After this change, we'll only return null here if we time out 3 times. As explained in the changelog entry below, we only increment the counter upon IPC timeout (or any other kind of IPC communication failure).

So if we receive, say, 10 `RemoteRenderingBackendProxy::DidCreateImageBufferBackend` responses where the 10th response confirms our RemoteImageBufferProxy, we'll spin in the while loop a total of 10 times and then return successfully. However, if the GPU process, say, hits an infinite loop, we'll subsequently loop 3 times here and then give up and return null.

I think I also want to refactor `ensureBackendCreated` so that it doesn't have a max timeout limit and instead returns an `ImageBufferBackend&`, but I was planning to do that separately.

>> Source/WebKit/WebProcess/GPU/graphics/RemoteImageBufferProxy.h:140
>> +        }
> 
> It seems a bit odd to call these timeouts if they also represent other sync messages that have been delivered in the interim (which is what I think you are saying they do). Can we differentiate between actual timeouts and other things that would cause waitForDidCreateImageBufferBackend to return prematurely?

Hm…is this addressed by my comment above?
Comment 4 Wenson Hsieh 2020-12-13 15:57:41 PST
(In reply to Sam Weinig from comment #2)
> Comment on attachment 416124 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=416124&action=review
> 
> > Source/WebKit/ChangeLog:21
> > +        B subsequently finishes waiting for backend creation because it incorrectly believes that its backend has been
> > +        initialized, and we end up crashing downstream in `RemoteLayerBackingStore::encode()` because
> > +        `ensureBackendCreated()` for B returned null.
> 
> Seems like we will still have this issue, just perhaps less often, since we
> will still return null from ensureBackendCreated() if enough callers call
> ensureBackendCreated() at the same time, or am I missing something?

I should also clarify — the DidCreateImageBufferBackend message is both sent by the GPU process asynchronously and received by the web process asynchronously. We only hit this sync codepath in the web process (ensureBackendCreated) in the case where we definitely require the backend to have been created in the web process.

Since all callers of `ensureBackendCreated()` are on the main thread, there should only ever be one caller of `ensureBackendCreated()` at a time in the web process — the issue is that there could be multiple `DidCreateImageBufferBackend` messages that have been dispatched by the GPU process, and are waiting to be handled in the web process at the point where we call `ensureBackendCreated()`.

> 
> > Source/WebKit/WebProcess/GPU/graphics/RemoteImageBufferProxy.h:140
> > +        static constexpr unsigned maximumNumberOfTimeouts = 3;
> > +        unsigned numberOfTimeouts = 0;
> > +        while (!m_backend && numberOfTimeouts < maximumNumberOfTimeouts) {
> > +            if (!m_remoteRenderingBackendProxy->waitForDidCreateImageBufferBackend())
> > +                ++numberOfTimeouts;
> > +        }
> 
> It seems a bit odd to call these timeouts if they also represent other sync
> messages that have been delivered in the interim (which is what I think you
> are saying they do). Can we differentiate between actual timeouts and other
> things that would cause waitForDidCreateImageBufferBackend to return
> prematurely?
Comment 5 Sam Weinig 2020-12-13 18:27:16 PST
(In reply to Wenson Hsieh from comment #3)
> Comment on attachment 416124 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=416124&action=review
> 
> >> Source/WebKit/ChangeLog:21
> >> +        `ensureBackendCreated()` for B returned null.
> > 
> > Seems like we will still have this issue, just perhaps less often, since we will still return null from ensureBackendCreated() if enough callers call ensureBackendCreated() at the same time, or am I missing something?
> 
> After this change, we'll only return null here if we time out 3 times. As
> explained in the changelog entry below, we only increment the counter upon
> IPC timeout (or any other kind of IPC communication failure).
> 
> So if we receive, say, 10
> `RemoteRenderingBackendProxy::DidCreateImageBufferBackend` responses where
> the 10th response confirms our RemoteImageBufferProxy, we'll spin in the
> while loop a total of 10 times and then return successfully. However, if the
> GPU process, say, hits an infinite loop, we'll subsequently loop 3 times
> here and then give up and return null.

I'm having trouble following the logic in the code, but I think I get it. If RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend() returns true, which happens when a sync message is received (and not a timeout) we just loop.

I think the real issue I am having is that the function RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend() returns a bool and gives no indication what that bool means (thought this seems to extend down in Connection::waitForAndDispatchImmediately as well). Ideally we would only return bools from functions that were clearly predicates. Can we change RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend() to return an enum to make it make it clear what it is saying. Otherwise, this code is just mysterious.
Comment 6 Wenson Hsieh 2020-12-13 18:40:38 PST
(In reply to Sam Weinig from comment #5)
> (In reply to Wenson Hsieh from comment #3)
> > Comment on attachment 416124 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=416124&action=review
> > 
> > >> Source/WebKit/ChangeLog:21
> > >> +        `ensureBackendCreated()` for B returned null.
> > > 
> > > Seems like we will still have this issue, just perhaps less often, since we will still return null from ensureBackendCreated() if enough callers call ensureBackendCreated() at the same time, or am I missing something?
> > 
> > After this change, we'll only return null here if we time out 3 times. As
> > explained in the changelog entry below, we only increment the counter upon
> > IPC timeout (or any other kind of IPC communication failure).
> > 
> > So if we receive, say, 10
> > `RemoteRenderingBackendProxy::DidCreateImageBufferBackend` responses where
> > the 10th response confirms our RemoteImageBufferProxy, we'll spin in the
> > while loop a total of 10 times and then return successfully. However, if the
> > GPU process, say, hits an infinite loop, we'll subsequently loop 3 times
> > here and then give up and return null.
> 
> I'm having trouble following the logic in the code, but I think I get it. If
> RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend() returns
> true, which happens when a sync message is received (and not a timeout) we
> just loop.

Ah, yes — this was indeed the key bit.

> 
> I think the real issue I am having is that the function
> RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend() returns a
> bool and gives no indication what that bool means (thought this seems to
> extend down in Connection::waitForAndDispatchImmediately as well). Ideally
> we would only return bools from functions that were clearly predicates. Can
> we change RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend()
> to return an enum to make it make it clear what it is saying. Otherwise,
> this code is just mysterious.

Yep, that sounds good! Will change this to return something along the lines of:

enum class DidReceiveBackendCreationMessage : bool {
    No,
    Yes
};
Comment 7 Sam Weinig 2020-12-13 20:01:12 PST
(In reply to Wenson Hsieh from comment #6)
> (In reply to Sam Weinig from comment #5)
> > (In reply to Wenson Hsieh from comment #3)
> > > Comment on attachment 416124 [details]
> > > Patch
> > > 
> > > View in context:
> > > https://bugs.webkit.org/attachment.cgi?id=416124&action=review
> > > 
> > > >> Source/WebKit/ChangeLog:21
> > > >> +        `ensureBackendCreated()` for B returned null.
> > > > 
> > > > Seems like we will still have this issue, just perhaps less often, since we will still return null from ensureBackendCreated() if enough callers call ensureBackendCreated() at the same time, or am I missing something?
> > > 
> > > After this change, we'll only return null here if we time out 3 times. As
> > > explained in the changelog entry below, we only increment the counter upon
> > > IPC timeout (or any other kind of IPC communication failure).
> > > 
> > > So if we receive, say, 10
> > > `RemoteRenderingBackendProxy::DidCreateImageBufferBackend` responses where
> > > the 10th response confirms our RemoteImageBufferProxy, we'll spin in the
> > > while loop a total of 10 times and then return successfully. However, if the
> > > GPU process, say, hits an infinite loop, we'll subsequently loop 3 times
> > > here and then give up and return null.
> > 
> > I'm having trouble following the logic in the code, but I think I get it. If
> > RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend() returns
> > true, which happens when a sync message is received (and not a timeout) we
> > just loop.
> 
> Ah, yes — this was indeed the key bit.
> 
> > 
> > I think the real issue I am having is that the function
> > RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend() returns a
> > bool and gives no indication what that bool means (thought this seems to
> > extend down in Connection::waitForAndDispatchImmediately as well). Ideally
> > we would only return bools from functions that were clearly predicates. Can
> > we change RemoteRenderingBackendProxy::waitForDidCreateImageBufferBackend()
> > to return an enum to make it make it clear what it is saying. Otherwise,
> > this code is just mysterious.
> 
> Yep, that sounds good! Will change this to return something along the lines
> of:
> 
> enum class DidReceiveBackendCreationMessage : bool {
>     No,
>     Yes
> };

I think it would be even better if the reason could be in there. Perhaps something like:

enum class DidReceiveBackendCreationResult : bool {
    Success,
    TimeoutOrIPCFailure
};

I am not super happy with "Success", since it might be success for a different backend, though I am not sure which word would indicate the truth succinctly.

(Sorry for not being more specific before).
Comment 8 Wenson Hsieh 2020-12-13 20:57:26 PST
Created attachment 416135 [details]
Patch
Comment 9 Wenson Hsieh 2020-12-14 09:55:07 PST
Comment on attachment 416135 [details]
Patch

Thanks for the review!
Comment 10 EWS 2020-12-14 10:03:30 PST
Committed r270775: <https://trac.webkit.org/changeset/270775>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 416135 [details].
Comment 11 Radar WebKit Bug Importer 2020-12-14 10:04:19 PST
<rdar://problem/72302200>