Bug 205707 - REGRESSION (r253926): Flaky crashes on webgl/1.0.3/conformance/textures/texture-copying-feedback-loops.html and webgl/2.0.0/conformance/textures/misc/texture-copying-feedback-loops.html
Summary: REGRESSION (r253926): Flaky crashes on webgl/1.0.3/conformance/textures/textu...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebGL (show other bugs)
Version: WebKit Local Build
Hardware: Unspecified Unspecified
: P1 Normal
Assignee: Kenneth Russell
URL:
Keywords: InRadar
Depends on:
Blocks: 205483 205843
  Show dependency treegraph
 
Reported: 2020-01-02 19:46 PST by Alexey Proskuryakov
Modified: 2020-01-06 17:26 PST (History)
8 users (show)

See Also:


Attachments
Patch (4.47 KB, patch)
2020-01-05 13:28 PST, Kenneth Russell
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2020-01-02 19:46:21 PST
rdar://problem/58227967

conformance/textures/texture-copying-feedback-loops.html started being a flaky crash after switching to ANGLE.

https://results.webkit.org/?suite=layout-tests&test=webgl%2F1.0.3%2Fconformance%2Ftextures%2Ftexture-copying-feedback-loops.html

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Application Specific Information:
abort() called

Application Specific Signatures:
Graphics kernel error: 0xfffffffe


Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	0x00007fff6ab9765a __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff6ac53e60 pthread_kill + 430
2   libsystem_c.dylib             	0x00007fff6ab1e808 abort + 120
3   libGPUSupportMercury.dylib    	0x00007fff53a380a5 gpusGenerateCrashLog.cold.1 + 95
4   libGPUSupportMercury.dylib    	0x00007fff53a2f18f gpusGenerateCrashLog + 89
5   com.apple.driver.AppleIntelKBLGraphicsGLDriver	0x00007fff29795e58 gpusKillClientExt + 9
6   libGPUSupportMercury.dylib    	0x00007fff53a3055b gpusSubmitDataBuffers + 164
7   com.apple.driver.AppleIntelKBLGraphicsGLDriver	0x00007fff28a0da36 IntelCommandBuffer::getNew(GLDContextRec*) + 48
8   com.apple.driver.AppleIntelKBLGraphicsGLDriver	0x00007fff28a0d844 intelSubmitCommands + 171
Comment 1 Kenneth Russell 2020-01-03 14:14:27 PST
Looks like a crash inside Intel's graphics driver. There aren't any suppressions for this test in Chromium on macOS so it should be possible to make WebKit's WebGL validation behave exactly the same. I'll investigate locally.
Comment 2 Kenneth Russell 2020-01-03 18:27:47 PST
More complete crash stack:

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	0x00007fff6d24d2c6 __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff6d308bf1 pthread_kill + 284
2   libsystem_c.dylib             	0x00007fff6d1b76a6 abort + 127
3   libGPUSupportMercury.dylib    	0x00007fff5c222240 gpusGenerateCrashLog + 166
4   com.apple.driver.AppleIntelKBLGraphicsGLDriver	0x00007fff3a511578 gpusKillClientExt + 9
5   libGPUSupportMercury.dylib    	0x00007fff5c2235ea gpusSubmitDataBuffers + 164
6   com.apple.driver.AppleIntelKBLGraphicsGLDriver	0x00007fff3996aea8 IntelCommandBuffer::getNew(GLDContextRec*) + 48
7   com.apple.driver.AppleIntelKBLGraphicsGLDriver	0x00007fff3996acba intelSubmitCommands + 171
8   GLEngine                      	0x00007fff4af5f671 gleUnbindTextureObject + 58
9   GLEngine                      	0x00007fff4af30477 gleUnbindDeleteHashNamesAndObjects + 159
10  GLEngine                      	0x00007fff4ae3eaae glDeleteTextures_Exec + 697
11  libANGLE-shared.dylib         	0x000000010b494ece rx::StateManagerGL::deleteTexture(unsigned int) + 278 (StateManagerGL.cpp:246)
12  libANGLE-shared.dylib         	0x000000010b4aef61 rx::TextureGL::onDestroy(gl::Context const*) + 31 (TextureGL.cpp:156)
13  libANGLE-shared.dylib         	0x000000010b4a9b0b gl::Texture::onDestroy(gl::Context const*) + 171 (Texture.cpp:675)
14  libANGLE-shared.dylib         	0x000000010b4624c6 angle::RefCountObject<gl::Context, angle::Result>::release(gl::Context const*) + 266 (RefCountObject.h:45)
15  libANGLE-shared.dylib         	0x000000010b46fd2a gl::TypedResourceManager<gl::Texture, gl::HandleAllocator, gl::TextureManager, gl::TextureID>::deleteObject(gl::Context const*, gl::TextureID) + 78 (ResourceManager.cpp:102)
16  libANGLE-shared.dylib         	0x000000010b263cd6 gl::Context::deleteTextures(int, gl::TextureID const*) + 42 (Context.cpp:6210)
17  libANGLE-shared.dylib         	0x000000010b2c2773 gl::DeleteTextures(int, unsigned int const*) + 142 (entry_points_gles_2_0_autogen.cpp:799)
18  com.apple.WebCore             	0x000000010f64ef35 WebCore::GraphicsContext3D::deleteTexture(unsigned int) + 245 (GraphicsContext3DANGLE.cpp:1763)
19  com.apple.WebCore             	0x00000001106e9cdc WebCore::WebGLObject::deleteObject(WebCore::GraphicsContext3D*) + 76 (WebGLObject.cpp:63)
20  com.apple.WebCore             	0x000000011070f800 WebCore::WebGLSharedObject::detachContextGroup() + 32 (WebGLSharedObject.cpp:52)
21  com.apple.WebCore             	0x00000001106e8838 WebCore::WebGLContextGroup::loseContextGroup(WebCore::WebGLRenderingContextBase::LostContextMode) + 184 (WebGLContextGroup.cpp:88)
22  com.apple.WebCore             	0x0000000110709560 non-virtual thunk to WebCore::WebGLRenderingContextBase::stop() + 96 (WebGLRenderingContextBase.cpp:5180)
23  com.apple.WebCore             	0x00000001104af3cd WTF::Detail::CallableWrapper<WebCore::ScriptExecutionContext::stopActiveDOMObjects()::$_7, WebCore::ScriptExecutionContext::ShouldContinue, WebCore::ActiveDOMObject&>::call(WebCore::ActiveDOMObject&) + 13 (Function.h:52)
24  com.apple.WebCore             	0x00000001104aaa59 WebCore::ScriptExecutionContext::forEachActiveDOMObject(WTF::Function<WebCore::ScriptExecutionContext::ShouldContinue (WebCore::ActiveDOMObject&)> const&) const + 489 (ScriptExecutionContext.cpp:244)
25  com.apple.WebCore             	0x00000001104aac06 WebCore::ScriptExecutionContext::stopActiveDOMObjects() + 70 (ScriptExecutionContext.cpp:298)
26  com.apple.WebCore             	0x000000011085c042 WebCore::FrameLoader::frameDetached() + 178 (FrameLoader.cpp:2817)
27  com.apple.WebCore             	0x00000001105fcabf WebCore::HTMLFrameOwnerElement::disconnectContentFrame() + 31 (HTMLFrameOwnerElement.cpp:81)
28  com.apple.WebCore             	0x00000001103f300a WebCore::disconnectSubframes(WebCore::ContainerNode&, WebCore::SubframeDisconnectPolicy) + 218 (ContainerNodeAlgorithms.cpp:304)
29  com.apple.WebCore             	0x00000001103efc3c WebCore::ContainerNode::removeChildren() + 540 (ContainerNode.cpp:104)
30  com.apple.WebCore             	0x000000011059c74c WebCore::replaceChildrenWithFragment(WebCore::ContainerNode&, WTF::Ref<WebCore::DocumentFragment, WTF::DumbPtrTraits<WebCore::DocumentFragment> >&&) + 92 (markup.cpp:1334)


The crash is occurring during test teardown. I can reproduce this running in WebKit locally.
Comment 3 Kenneth Russell 2020-01-04 16:02:50 PST
This crash is strange. All of ANGLE's state appears to be set up correctly. The egl::Display is initialized, and the CGLContextObj that it refers to internally is still current, per CGLGetCurrentContext(). It doesn't look like the CGLContextObj has been deleted out from under the library - assuming that would have caused the current context to be null'ed out. Still investigating.
Comment 4 Kenneth Russell 2020-01-04 18:40:44 PST
Still trying to figure out what is going on. Have checked out two copies of WebKit at this revision:

Provide pid to crashing service worker process and GPU process
617c711307e3fb84a29000b2010bbf95b43773cc

one using ANGLE, the other not. Built both Debug, running both with minibrowser in WK1 mode (for easier debugging).

Can see both versions allocating and deleting texture objects. The ANGLE version seems to be allocating a few more textures per iteration, but that may be because the instrumentation I added is at a lower level in ANGLE.

The crash is somehow related to running the test in an iframe. With the ANGLE backend, when loading LayoutTests/webgl/1.0.3/conformance/textures/texture-copying-feedback-loops.html (served up via a local HTTP server), it reliably crashes upon the first or second reload. The iframe never displays correctly with the ANGLE backend, even if attempting to resize the window to force a repaint. Loading LayoutTests/webgl/1.0.3/resources/webgl_test_files/conformance/textures/texture-copying-feedback-loops.html, the page can be reloaded over and over again reliably.

Wondering whether some other part of WebKit which is using raw OpenGL is coming in and setting the CGLContextObj out from under ANGLE, causing ANGLE to inadvertently modify OpenGL context state belonging to WebKit, and ultimately leading to corrupted OpenGL state and a crash. I've tried modifying EGL_MakeCurrent (Source/ThirdParty/ANGLE/src/libGLESv2/entry_points_egl.cpp) to always call SetContextCurrent (and GraphicsContext3D::makeContextCurrent to always call EGL_MakeCurrent), but the crash persists.

Still investigating.
Comment 5 Kenneth Russell 2020-01-05 00:56:50 PST
Looking more deeply into ANGLE, it assumes no other code in the process is making OpenGL calls - not a good assumption when it's being embedded as a library (but not the base OpenGL ES implementation) in a larger application. Have tried modifying GraphicsContext3D::makeContextCurrent (Source/WebCore/platform/graphics/cocoa/GraphicsContext3DCocoa.mm) to forcibly set ANGLE's internal CGLContextObj to be current, via CGLSetCurrentContext:

...
#elif USE(ANGLE)
    bool res = EGL_MakeCurrent(m_displayObj, EGL_NO_SURFACE, EGL_NO_SURFACE, m_contextObj);
    if (res) {
        EGLDeviceEXT device = nullptr;
        EGL_QueryDisplayAttribEXT(m_displayObj, EGL_DEVICE_EXT, reinterpret_cast<EGLAttrib*>(&device));
        CGLContextObj cglContext = nullptr;
        EGL_QueryDeviceAttribEXT(device, EGL_CGL_CONTEXT_ANGLE, reinterpret_cast<EGLAttrib*>(&cglContext));
        CGLSetCurrentContext(cglContext);
    }
    return res;
#endif
...

Have added some logging and confirmed that CGLSetCurrentContext is being called for each WebGL entry point. Unfortunately, this doesn't work around the problem; the same crash happens after reloading the page once or twice. I'd like to figure out a way to reset the CGLContextObj to null upon returning from the JavaScript callback that's making WebGL API calls, to see whether that helps isolate ANGLE from the rest of WebKit's rendering.

Have looked a bit into RenderIFrame and superclasses, PlatformCALayerCocoa and the layer types it uses, GraphicsLayerCA, TileController, WidgetMac.mm, and some others, to try to understand how iframes are rendered. Are they another CALayer in the hierarchy? Do they have their own NSView? The question remains why the iframe running this test is disappearing when the ANGLE backend is enabled, and what interference there is between ANGLE's OpenGL context and the rest of WebKit's rendering that's causing this crash in the OpenGL driver.

Still investigating - would appreciate any feedback on the questions above.
Comment 6 Kenneth Russell 2020-01-05 01:35:57 PST
In GraphicsContext3D::markLayerComposited (GraphicsContext3DANGLE.cpp), tried resetting ANGLE's notion of the EGL context to EGL_NO_CONTEXT, and calling CGLSetCurrentContext(nullptr). WebGL still works, but the crash still persists. If a workaround like this is to be attempted then it needs to be bounded more tightly to the JavaScript callback currently running WebGL.
Comment 7 Kenneth Russell 2020-01-05 13:23:30 PST
It turns out the disappearance of the iframe was unrelated. Apple's layout test wrapper for the WebGL conformance suite removes the iframe if all sub-tests pass, and for this test, the switch to ANGLE fixes the remaining two preexisting failures.

Remembering that a workaround was recently added to Chromium for an Intel graphics driver bug where operations were being reordered with respect to glBindFramebuffer calls, I dug in the version history and found this:

Fixed WebGL flush bug for old Intel driver
https://chromium-review.googlesource.com/1912790

Applying this workaround in ANGLE conclusively works around this crash in the OpenGL driver! Running the test in the Minibrowser, it can be reloaded 40+ times reliably, where it used to crash within the first couple of reloads.

It's likely to solve several of the other layout test flakes caused by the switch to ANGLE. Likely not all, though - will have to analyze them all case-by-case.

Patch incoming. It will be upstreamed to ANGLE under this bug:

Need flush workaround before/after bindFramebuffer on Intel drivers on macOS
https://bugs.chromium.org/p/angleproject/issues/detail?id=4267
Comment 8 Kenneth Russell 2020-01-05 13:28:55 PST
Created attachment 386796 [details]
Patch
Comment 9 EWS Watchlist 2020-01-05 13:29:49 PST
Note that there are important steps to take when updating ANGLE. See http://trac.webkit.org/wiki/UpdatingANGLE
Comment 10 Kenneth Russell 2020-01-05 14:04:00 PST
This patch is being upstreamed to ANGLE under:
https://chromium-review.googlesource.com/1987932

I think it's worth adding to WebKit's copy of ANGLE sooner rather than later because this may fix many flaky failures of the WebGL layout / conformance tests.

Raising bug to P1 because this was a crash.
Comment 11 WebKit Commit Bot 2020-01-05 15:22:51 PST
Comment on attachment 386796 [details]
Patch

Clearing flags on attachment: 386796

Committed r254045: <https://trac.webkit.org/changeset/254045>
Comment 12 WebKit Commit Bot 2020-01-05 15:22:53 PST
All reviewed patches have been landed.  Closing bug.
Comment 13 Kenneth Russell 2020-01-06 17:13:26 PST
Upon review of this patch upstream in ANGLE in https://chromium-review.googlesource.com/1987932 , more effort was devoted to creating a reduced test case, and it came to light that this workaround is not really sufficient.

Will file a new bug, block it on this one, and propose a revised workaround.