Bug 63189

Summary: Fix latch deadlock when GPU process crashes or context is lost
Product: WebKit Reporter: John Bates <jbates>
Component: New BugsAssignee: John Bates <jbates>
Status: RESOLVED FIXED    
Severity: Normal CC: enne, jamesr, kbr, vangelis, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch none

Description John Bates 2011-06-22 14:56:28 PDT
Fix latch deadlock when GPU process crashes or context is lost
Comment 1 John Bates 2011-06-22 15:15:21 PDT
Created attachment 98248 [details]
Patch
Comment 2 John Bates 2011-06-22 15:21:28 PDT
This patch solves the deadlock problem with latches when the GPU process crashes or in some cases when a context is lost because of errors.

This does not recreate the failed child contexts, because that is already done by WebGLRenderingContext when the application supports it.

This also does not reparent the child contexts to the new compositor context when the LayerRendererChromium is recreated. A separate patch needs to do that.
Comment 3 James Robinson 2011-06-22 15:40:45 PDT
Comment on attachment 98248 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=98248&action=review

> Source/WebCore/ChangeLog:9
> +
> +        Test: open particles WebGL demo in chrome, kill GPU process from Task Manager; observe no deadlock.
> +

you need at least a little bit of explanation here of what's going on

> Source/WebCore/platform/graphics/chromium/LayerRendererChromium.cpp:-379
> -
> -    // After updateCompositorResources, set/wait latches for all child
> -    // contexts. This will prevent the compositor from using any of the child
> -    // parent textures while WebGL commands are executing from javascript *and*
> -    // while the final parent texture is being blit'd. copyTexImage2D
> -    // uses the parent texture as a temporary resolve buffer, so that's why the
> -    // waitLatch is below, to block the compositor from using the parent texture
> -    // until the next WebGL SwapBuffers (or copyTextureToParentTexture for
> -    // Canvas2D).
> -    if (hardwareCompositing() && m_contextSupportsLatch) {
> -        m_childContextsWereCopied = true;
> -        // For each child context:
> -        //   glSetLatch(Offscreen->Compositor);
> -        //   glWaitLatch(Compositor->Offscreen);
> -        ChildContextMap::iterator i = m_childContexts.begin();
> -        for (; i != m_childContexts.end(); ++i) {
> -            Extensions3DChromium* ext = static_cast<Extensions3DChromium*>(i->first->getExtensions());
> -            GC3Duint childToParentLatchId, parentToChildLatchId;
> -            ext->getParentToChildLatchCHROMIUM(&parentToChildLatchId);
> -            ext->getChildToParentLatchCHROMIUM(&childToParentLatchId);
> -            ext->setLatchCHROMIUM(childToParentLatchId);
> -            ext->waitLatchCHROMIUM(parentToChildLatchId);
> -        }
> -    }

why's this going away?
Comment 4 John Bates 2011-06-22 17:51:08 PDT
Comment on attachment 98248 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=98248&action=review

>> Source/WebCore/ChangeLog:9
>> +
> 
> you need at least a little bit of explanation here of what's going on

Will do.

>> Source/WebCore/platform/graphics/chromium/LayerRendererChromium.cpp:-379
>> -    }
> 
> why's this going away?

It turned out to be premature to try and plan ahead for the threaded compositor with lots of nasty comments. For now, it's much safer to do all latching in one place (above in updateAndDrawLayers). It's also best to do the corresponding set before the wait, so consolidating the latching code above allows us to do that. When the threaded compositor comes online, we need to redo the latching code.
Comment 5 John Bates 2011-06-22 18:11:01 PDT
Comment on attachment 98248 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=98248&action=review

>>> Source/WebCore/ChangeLog:9
>>> +
>> 
>> you need at least a little bit of explanation here of what's going on
> 
> Will do.

This is what I'm adding:
The main bug fix is to only set/wait latches if the child context has no errors.
Additionally, the LayerChromium classes needed to be modified to not continue drawing when their corresponding contexts have errors. Otherwise, they would draw with invalid texture ids.
Comment 6 Kenneth Russell 2011-06-22 18:45:00 PDT
Comment on attachment 98248 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=98248&action=review

The logic changes look fine but there's an important documentation update that needs to be incorporated into this patch.

> Source/WebCore/platform/graphics/chromium/Canvas2DLayerChromium.cpp:68
> +            && (context->getExtensions()->getGraphicsResetStatusARB() == GraphicsContext3D::NO_ERROR));

I made a mistake when originally adding getGraphicsResetStatusARB to Extensions3D.h. In the real OpenGL API, this function only returns the error state once; subsequent calls generally return NO_ERROR. In WebKit, this entry point's state needs to be sticky; it needs to continue to return the error state essentially forever, because context recovery is implemented by throwing away the GraphicsContext3D instance and creating a new one. Please update the comments in Extensions3D.h indicating that any error returned needs to be persistent for the lifetime of the GraphicsContext3D object, unlike the behavior of the actual GL_ARB_robustness extension.
Comment 7 John Bates 2011-06-23 15:59:38 PDT
Created attachment 98423 [details]
Patch
Comment 8 John Bates 2011-06-23 16:00:53 PDT
Comment on attachment 98248 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=98248&action=review

>> Source/WebCore/platform/graphics/chromium/Canvas2DLayerChromium.cpp:68
>> +            && (context->getExtensions()->getGraphicsResetStatusARB() == GraphicsContext3D::NO_ERROR));
> 
> I made a mistake when originally adding getGraphicsResetStatusARB to Extensions3D.h. In the real OpenGL API, this function only returns the error state once; subsequent calls generally return NO_ERROR. In WebKit, this entry point's state needs to be sticky; it needs to continue to return the error state essentially forever, because context recovery is implemented by throwing away the GraphicsContext3D instance and creating a new one. Please update the comments in Extensions3D.h indicating that any error returned needs to be persistent for the lifetime of the GraphicsContext3D object, unlike the behavior of the actual GL_ARB_robustness extension.

Done.
Comment 9 James Robinson 2011-06-23 16:01:44 PDT
Comment on attachment 98423 [details]
Patch

Seems good
Comment 10 Kenneth Russell 2011-06-23 17:01:00 PDT
Comment on attachment 98423 [details]
Patch

As we've discussed offline, I have some remaining concerns about the asynchronous nature of the loss of contexts and whether a synchronous check before issuing the latch command will really be sufficient. However this seems to be a step forward. cq=me
Comment 11 WebKit Review Bot 2011-06-23 17:12:04 PDT
Comment on attachment 98423 [details]
Patch

Clearing flags on attachment: 98423

Committed r89635: <http://trac.webkit.org/changeset/89635>
Comment 12 WebKit Review Bot 2011-06-23 17:12:08 PDT
All reviewed patches have been landed.  Closing bug.