RESOLVED FIXED 217655
[GTK][WPE] PlatformDisplay::terminateEglDisplays() is never called
https://bugs.webkit.org/show_bug.cgi?id=217655
Summary [GTK][WPE] PlatformDisplay::terminateEglDisplays() is never called
Claudio Saavedra
Reported 2020-10-13 07:09:55 PDT
Despite it being installed as a std::atexit() handler, it seems that on web process termination the shutDownEglDisplays is not called at all. Is it possible that the web process is exiting in a way that prevents std::atexit() handlers from being called?
Attachments
patch (7.05 KB, patch)
2022-01-24 04:34 PST, Pablo Saavedra
ews-feeder: commit-queue-
patch (7.26 KB, patch)
2022-01-24 06:08 PST, Pablo Saavedra
no flags
patch (7.33 KB, patch)
2022-01-24 08:00 PST, Pablo Saavedra
no flags
patch (7.45 KB, patch)
2022-01-24 11:30 PST, Pablo Saavedra
ews-feeder: commit-queue-
Simpler patch (2.22 KB, patch)
2022-02-15 07:29 PST, Carlos Garcia Campos
no flags
psaavedra patch version 2 (4.60 KB, patch)
2022-02-18 10:46 PST, Pablo Saavedra
no flags
psaavedra patch version 3 (5.04 KB, patch)
2022-02-21 10:16 PST, Pablo Saavedra
no flags
psaavedra patch version 3 (7.66 KB, patch)
2022-02-22 00:22 PST, Pablo Saavedra
no flags
psaavedra patch version 3 (8.51 KB, patch)
2022-02-22 00:24 PST, Pablo Saavedra
no flags
psaavedra patch version 3 (8.51 KB, patch)
2022-02-22 00:25 PST, Pablo Saavedra
no flags
psaavedra patch version 4 (4.64 KB, patch)
2022-02-22 04:00 PST, Pablo Saavedra
no flags
psaavedra patch version 5 (4.76 KB, patch)
2022-02-22 06:00 PST, Pablo Saavedra
no flags
Claudio Saavedra
Comment 1 2021-09-29 07:22:25 PDT
I've been looking at this again due to a downstream bug with wpewebkit. Turns out that most of the times, the webprocess exists when one of the IPC connections to the ui process is closed. This happens via a couple of handlers that are installed in both WebProcess and AuxiliaryProcess. These handlers use _exit(2) to ensure that, when the connection with the UI process is gone, the process is not left lingering. What is important about this is that _exit(2) does NOT run the handlers that are installed via on_exit(3) or atexit(3). My observations indicate that most of the times the web process is exiting via _exit(2), which would explain why this method (now a lambda, btw, after r261905) is not always called.
Adrian Perez
Comment 2 2021-09-29 07:53:22 PDT
(In reply to Claudio Saavedra from comment #1) > I've been looking at this again due to a downstream bug with wpewebkit. > Turns out that most of the times, the webprocess exists when one of the IPC > connections to the ui process is closed. This happens via a couple of > handlers that are installed in both WebProcess and AuxiliaryProcess. These > handlers use _exit(2) to ensure that, when the connection with the UI > process is gone, the process is not left lingering. > > What is important about this is that _exit(2) does NOT run the handlers that > are installed via on_exit(3) or atexit(3). My observations indicate that > most of the times the web process is exiting via _exit(2), which would > explain why this method (now a lambda, btw, after r261905) is not always > called. This means that the whole idea of using an atexit() handler to termite the EGL displays is only working sometimes due to the cases in which _exit() is used and the handlers (and display termination) get skipped. What was the reason to begin with to EGL display termination here instead of somewhere else like WebProcessMain::platformTerminate()? Also at the point where the atexit() handlers are invoked, surely we do not have many things around anymore, so for example what would happen if the PlatformDisplay implementation would needs to, let's say, spin some iterations of a RunLoop?
Michael Catanzaro
Comment 3 2021-09-29 10:11:49 PDT
I wonder what would break if we were to simply change _exit() to normal exit()? Might need to keep using _exit() #if PLATFORM(COCOA) is defined.
Claudio Saavedra
Comment 4 2021-09-29 11:27:14 PDT
(In reply to Michael Catanzaro from comment #3) > I wonder what would break if we were to simply change _exit() to normal > exit()? The bug I am analyzing currently is that the eglTerminate() call happening in the aforementioned lambda is actually hanging on a poll() call and the web process never exists. I'll report that later in a separate bug if I find it out to be a bug in WebKit (could be a downstream driver issue or wpe backend, etc), but right now what I am wondering is whether this lambda is serving any purpose at all. Tomorrow I'll be testing using exit(3) instead, I'll let you know what I find out.
Claudio Saavedra
Comment 5 2021-09-29 11:28:42 PDT
(In reply to Adrian Perez from comment #2) > > What was the reason to begin with to EGL display termination here instead > of somewhere else like WebProcessMain::platformTerminate()? Also at the > point where the atexit() handlers are invoked, surely we do not have many > things around anymore, so for example what would happen if the > PlatformDisplay > implementation would needs to, let's say, spin some iterations of a RunLoop? This is explained in a comment in PlatformDisplay: // EGL registers atexit handlers to cleanup its global display list. // Since the global PlatformDisplay instance is created before, // when the PlatformDisplay destructor is called, EGL has already removed the // display from the list, causing eglTerminate() to crash. So, here we register // our own atexit handler, after EGL has been initialized and after the global // instance has been created to ensure we call eglTerminate() before the other // EGL atexit handlers and the PlatformDisplay destructor. // See https://bugs.webkit.org/show_bug.cgi?id=157973. I have my doubts that this is doing anything useful in practice, but tomorrow we'll know more.
Claudio Saavedra
Comment 6 2021-10-06 09:48:24 PDT
I have not been able to get to the bottom of this, but I did the following test with interesting results. I changed both AuxiliarProcess and WebProcess classes to actually call exit(2) on disconnection instead of _exit(2), to make sure that the exit handlers are actually called. What happens is that very often the web process crashes during termination. The crashes happen all in Wayland-related calls as a result of the eglTerminate() call in the exit handler that is now getting called. I have suspicions that some of the issues that were addressed in #157973 and #174789 are still there, just that they are hidden because very often the web process is exiting without calling this exit handler. Unfortunately I am not very familiar with all of the display/egl/wayland logic behind this, so analyzing this by myself is probably going to take some time.
Michael Catanzaro
Comment 7 2021-10-06 11:46:07 PDT
(In reply to Claudio Saavedra from comment #5) > I have my doubts that this is doing anything useful in practice I'm pretty sure we should not need to manually deinitialize EGL, so it should be OK and practical to just drop all the exit handlers and sidestep the issue. At least, I think so. Trying to make exit handlers work as expected is hard and probably not worthwhile. That said, I would worry that this may have been added as a workaround for some embedded device bug. Say some proprietary graphics driver fails to release a global resource if the process exits without first calling eglTerminate (or if it crashes). If so, Miguel would probably know about it.
Carlos Garcia Campos
Comment 8 2021-10-07 00:43:35 PDT
(In reply to Michael Catanzaro from comment #7) > (In reply to Claudio Saavedra from comment #5) > > I have my doubts that this is doing anything useful in practice > > I'm pretty sure we should not need to manually deinitialize EGL, so it > should be OK and practical to just drop all the exit handlers and sidestep > the issue. At least, I think so. Trying to make exit handlers work as > expected is hard and probably not worthwhile. > > That said, I would worry that this may have been added as a workaround for > some embedded device bug. Say some proprietary graphics driver fails to > release a global resource if the process exits without first calling > eglTerminate (or if it crashes). If so, Miguel would probably know about it. The problem is that mesa install atexit handlers to shutdown EGL, IIRC the original problem was that we were getting crashes when EGL was uninitialized by mesa because we had already destroyed things. Just by adding our own exit handlers things happened earlier and crashes didn't happen. I don't remember the details, but I think it was something like this.
Michael Catanzaro
Comment 9 2021-10-07 07:07:43 PDT
Ugh, so then another option would be to ensure we always exit using _exit() rather than exit(). One thing's for sure: switching between _exit() and exit() depending on how the web process quits is inconsistent and not great!
Adrian Perez
Comment 10 2021-10-07 07:52:40 PDT
(In reply to Michael Catanzaro from comment #9) > Ugh, so then another option would be to ensure we always exit using _exit() > rather than exit(). Well, then the EGL displays will never be terminated because _exit() directly does the syscall() to tell the kernel to kill the process without running any atexit() handlers _at all_. > One thing's for sure: switching between _exit() and exit() depending on how > the web process quits is inconsistent and not great! Yes, it's confusing. If there is some good reason (at all) to use _exit() IMO it should be explained with a comment next to the code.
Claudio Saavedra
Comment 11 2021-10-08 01:47:36 PDT
Some git history: - The _exit() call in AuxiliaryProcess was added by cdumez as part of bug #177972 to prevent processes from taking too long to exit. It is not clear to me whether using exit(2) instead would have been acceptable too, I think it should be OK. - The one in WebProcess is basically a specialization of what should be done on connection closure and was added as part of bug #190294, also by cdumez. Similarly, I don't see a reason why exit() can't be used. Chris, do you think we could safely switch _exit() to exit() in these classes? There are also other uses of _exit() for similar reasons that are not relevant to this bug (in PluginProcess and NetworkProcess), that I think should probably also be changed to exit(), but I don't have a strong opinion there since we don't have exit handlers that need to be honored there.
Claudio Saavedra
Comment 12 2021-10-18 08:46:03 PDT
What I have learned since my last update on this topic. - IPCConnection provides two different mechanisms for clients to be made aware of a connection getting closed: 1. A didClose() client method that gets called from the main thread (via the main loop). This is implemented by AuxilaryProcess and _exit() gets called there. 2. Also, clients can connect a callback to be called directly from the worker thread *before* the aforementioned didClose() method. This was added to make sure that interested clients can quickly be made aware of disconnection, even if the main thread is stuck for whichever reason. WebProcess (subclass of AuxiliaryProcess) installs this callback and calls _exit() directly so that as soon as disconnection happens, the web process exits. Both of these cases are problematic for the web process, given that the web process relies on an atexit() handler being called on shutdown to be able to terminate EGL displays properly, via eglTerminate(). Directly replacing _exit() with exit() in those two methods does NOT work straight away, because it is possible to run into concurrency shutdown issues. In particular, calling exit() from a thread different from the main one can conflict with the program terminating concurrently from the main thread (via exiting main() or the exit() call in AuxiliaryProcess), which can cause use-after-free errors and all kind of nasty behavior. A correct solution for this, preserving the atexit() handler used to call eglTeminate() would involve the following: - Termination of the web process should always invoke exit handlers. Not only in order to terminate EGL displays, but also because EGL might have its own exit handlers that need to be called. - This call should always happen from the main thread. Web processes in ports that rely on exit handlers cannot bypass exit handlers via _exit(). A two-fold solution for these ports would be: 1. Do not call connection->setDidCloseOnConnectionWorkQueueCallback(callExit); in WebProcess::initializeConnection() to avoid any handling right in the worker thread, and 2. AuxiliaryProcess::didClose() should use exit() instead of _exit().
Pablo Saavedra
Comment 13 2021-12-02 07:57:02 PST
Testing the Claudio's proposal I've noticed the WebProcess is hanged in the eglTerminate() called from the PlatformDisplay::terminateEGLDisplay during the exit handler execution for the WebProcess. This is because the EGL resources wasn't correctly released at that point. One way to avoid is to call the `.close()` method for all the WebPages associated to the closed WebProcess. This eventually will invalidate all the ThreadedCompositor objects associated to those WebPages what it will end-up in the releasing of the EGLS resources and it will prevent the WebProcess hang in the eglTerminate(). For this purpose something like this works: https://paste.debian.net/1221681/
Pablo Saavedra
Comment 14 2022-01-07 04:07:05 PST
Also I've noticed that this commit (https://bugs.webkit.org/show_bug.cgi?id=214307) have removed some specific logic that affects to the WPE and GTK ports: ``` commit 813abc09e84c9a406836319bfa8f7fb83d378605 Author: cdumez@apple.com <cdumez@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc> Date: Wed Jul 15 18:23:22 2020 +0000 [WK2] Make establishing a connection between the WebProcess and the Network process more robust https://bugs.webkit.org/show_bug.cgi?id=214307 <rdar://problem/64592340> Reviewed by Alex Christensen. ``` ``` diff diff --git a/Source/WebKit/WebProcess/WebProcess.cpp b/Source/WebKit/WebProcess/WebProcess.cpp index 6f2d8fad165f..c73c43e9857c 100644 --- a/Source/WebKit/WebProcess/WebProcess.cpp +++ b/Source/WebKit/WebProcess/WebProcess.cpp @@ -1100,23 +1100,28 @@ void WebProcess::setInjectedBundleParameters(const IPC::DataReference& value) static NetworkProcessConnectionInfo getNetworkProcessConnection(IPC::Connection& connection) { NetworkProcessConnectionInfo connectionInfo; - if (!connection.sendSync(Messages::WebProcessProxy::GetNetworkProcessConnection(), Messages::WebProcessProxy::GetNetworkProcessConnection::Reply(connectionInfo), 0)) { - // If we failed the first time, retry once. The attachment may have become invalid - // before it was received by the web process if the network process crashed. - if (!connection.sendSync(Messages::WebProcessProxy::GetNetworkProcessConnection(), Messages::WebProcessProxy::GetNetworkProcessConnection::Reply(connectionInfo), 0)) { -#if PLATFORM(GTK) || PLATFORM(WPE) - // GTK+ and WPE ports don't exit on send sync message failure. - // In this particular case, the network process can be terminated by the UI process while the - // Web process is still initializing, so we always want to exit instead of crashing. This can - // happen when the WebView is created and then destroyed quickly. - // See https://bugs.webkit.org/show_bug.cgi?id=183348. + auto requestConnection = [&] { + if (!connection.isValid()) { + // Connection to UIProcess has been severed, exit cleanly. exit(0); -#else - CRASH(); -#endif ... ``` The change removes the specific logic for WPE and GTK added in: ``` commit f0627f5287186ec975c3439e9bcfe119493adc95 Author: carlosgc@webkit.org <carlosgc@webkit.org@268f45cc-cd09-0410-ab3c-d52691b4dbfc> Date: Thu May 3 08:45:32 2018 +0000 REGRESSION(r222772): [GTK][WPE] WebProcess from WebKitGtk+ 2.19.9x SIGSEVs in WebKit::WebProcess::ensureNetworkProcessConnection() at Source/WebKit/WebProcess/WebProcess.cpp:1127 https://bugs.webkit.org/show_bug.cgi?id=183348 ``` The comment in https://bugs.webkit.org/show_bug.cgi?id=183348#c40 offers a justification for that code.
Pablo Saavedra
Comment 15 2022-01-14 02:26:30 PST
(In reply to Pablo Saavedra from comment #14) > Also I've noticed that this commit > (https://bugs.webkit.org/show_bug.cgi?id=214307) have removed some specific > logic that affects to the WPE and GTK ports: > ... Moved aside to https://bugs.webkit.org/show_bug.cgi?id=235224
Pablo Saavedra
Comment 16 2022-01-24 04:34:55 PST
Pablo Saavedra
Comment 17 2022-01-24 06:08:23 PST
Pablo Saavedra
Comment 18 2022-01-24 08:00:35 PST
Michael Catanzaro
Comment 19 2022-01-24 09:21:52 PST
Comment on attachment 449816 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=449816&action=review > Source/WebKit/WebProcess/WebProcess.cpp:362 > +#if !PLATFORM(WPE) > // We call _exit() directly from the background queue in case the main thread is unresponsive > // and AuxiliaryProcess::didClose() does not get called. > connection->setDidCloseOnConnectionWorkQueueCallback(callExit); > +#endif Is this change really WPE-specific? Can it go into the next group of preprocessor guards below? The platform-specific termination behavior that we currently have is already very confusing. :/
Pablo Saavedra
Comment 20 2022-01-24 11:29:34 PST
(In reply to Michael Catanzaro from comment #19) > Comment on attachment 449816 [details] > patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=449816&action=review > > > Source/WebKit/WebProcess/WebProcess.cpp:362 > > +#if !PLATFORM(WPE) > > // We call _exit() directly from the background queue in case the main thread is unresponsive > > // and AuxiliaryProcess::didClose() does not get called. > > connection->setDidCloseOnConnectionWorkQueueCallback(callExit); > > +#endif > > Is this change really WPE-specific? Can it go into the next group of > preprocessor guards below? > > The platform-specific termination behavior that we currently have is already > very confusing. :/ No, it is not. Thanks for point to this. It should be both glib ports: WPE and GTK. Regarding your second comment, not sure if it could be incorporated to the `#if !PLATFORM(GTK) && !PLATFORM(WPE) && !ENABLE(IPC_TESTING_API)` block. I'd say it could be but I don't know what implications exactly have IPC_TESTING_API (added in r268239) but I tempted to say both things have nothing to do one with the another. Let's manage them separately.
Pablo Saavedra
Comment 21 2022-01-24 11:30:13 PST
Carlos Garcia Campos
Comment 22 2022-02-03 03:04:31 PST
Comment on attachment 449837 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=449837&action=review > Source/WebCore/platform/graphics/PlatformDisplay.cpp:277 > + eglMakeCurrent(m_eglDisplay, EGL_NO_SURFACE, EGL_NO_SURFACE, EGL_NO_CONTEXT); This is clearly right, so we can probably split this patch and land this in a separate bug. > Source/WebKit/Shared/AuxiliaryProcess.cpp:63 > - _exit(EXIT_SUCCESS); > + exit(EXIT_SUCCESS); I think other platforms don't want to exit() because _exit() should be faster. > Source/WebKit/WebProcess/WebProcess.cpp:888 > + platformTerminate(); > + > + AuxiliaryProcess::terminate(); Why do we call terminate here? > Source/WebKit/WebProcess/WebProcess.cpp:895 > + for (auto& webPage : m_pageMap.values()) > + webPage->close(); I think it's expected that page map is empty when terminate is called. Why isn't empty here? I think we need a way to reproduce this or a test to fully understand the problem. > Source/WebKit/WebProcess/WebProcess.cpp:897 > + AuxiliaryProcess::didClose(connection); terminate() stops the run loop, so the exit() from the main function will be called and shouldn't need to call didClose here, no?
Carlos Garcia Campos
Comment 23 2022-02-15 07:29:03 PST
Created attachment 452022 [details] Simpler patch Pablo, could you check if this patch is enough? I see terminate egl display always called after applying it.
Michael Catanzaro
Comment 24 2022-02-15 08:48:19 PST
Comment on attachment 452022 [details] Simpler patch View in context: https://bugs.webkit.org/attachment.cgi?id=452022&action=review > Source/WebKit/Shared/AuxiliaryProcess.cpp:67 > +#if PLATFORM(GTK) || PLATFORM(WPE) > + stopRunLoop(); > +#else > _exit(EXIT_SUCCESS); > +#endif Really requires a comment to explain our different behavior. > Source/WebKit/WebProcess/WebProcess.cpp:260 > +#if !PLATFORM(GTK) && !PLATFORM(WPE) Ditto. > Source/WebKit/WebProcess/WebProcess.cpp:368 > +#if !PLATFORM(GTK) && !PLATFORM(WPE) > // We call _exit() directly from the background queue in case the main thread is unresponsive > // and AuxiliaryProcess::didClose() does not get called. > connection->setDidCloseOnConnectionWorkQueueCallback(callExit); > +#endif Ditto. Might be able to convert callExit into a lambda, then that would be one fewer place that would need the platform-specific bits.
Pablo Saavedra
Comment 25 2022-02-17 04:02:25 PST
(In reply to Carlos Garcia Campos from comment #23) > Created attachment 452022 [details] > Simpler patch > > Pablo, could you check if this patch is enough? I see terminate egl display > always called after applying it. We are experimenting the hang only for a specific case with the EGL implementation of the Galcore Vivante driver (6.4.0.p2.4-aarch32 ) [1], when we added eglMakeCurrent() ( https://bugs.webkit.org/show_bug.cgi?id=236766 ). Only in this specific scenario we are able to reproduce the hang. This points to a defective leaky of the driver on this regard since one could not a hang even if there are still EGL resources not released yet. After an exhaustive investigation what I found is the way to prevent this hang is by ensuring all the pages are closed before the eglMakeCurrent() is invoked. + // Closing WebPages before exit. This ends-up in the invalidation of the + // ThreadedCompositor objects associated to those WebPages. This action + // will release the EGL resources preventing the WebProcess hang in the + // eglTerminate() during a exit() call. + for (auto& webPage : m_pageMap.values()) + webPage->close(); So, long story short the problem manifested in a specific situation using a specific driver and a proprietary driver but, at the same times, it shows what things depends on others during a WebProcess "terminate" situation. [1] https://www.nxp.com/docs/en/release-note/i.MX_Linux_RN.pdf
Pablo Saavedra
Comment 26 2022-02-17 04:39:06 PST
(In reply to Carlos Garcia Campos from comment #22) > Comment on attachment 449837 [details] > patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=449837&action=review > > > Source/WebCore/platform/graphics/PlatformDisplay.cpp:277 > > + eglMakeCurrent(m_eglDisplay, EGL_NO_SURFACE, EGL_NO_SURFACE, EGL_NO_CONTEXT); > > This is clearly right, so we can probably split this patch and land this in > a separate bug. Splitted off: https://bugs.webkit.org/show_bug.cgi?id=236766 > > > Source/WebKit/Shared/AuxiliaryProcess.cpp:63 > > - _exit(EXIT_SUCCESS); > > + exit(EXIT_SUCCESS); > > I think other platforms don't want to exit() because _exit() should be > faster. Let's guard this with `#if PLATFORM(GTK) && PLATFORM(WPE)` > > > Source/WebKit/WebProcess/WebProcess.cpp:888 > > + platformTerminate(); > > + > > + AuxiliaryProcess::terminate(); > > Why do we call terminate here? No reason. At the beginning mimicry with the WebProcess::terminate() method. If this could be useful I suppose the right way ti should be to add something like WebProcess::platformDidClose() or something like that. The true is there is the glib platform does implement nothing specific for this platform so this call into this didClose() function has no effects in the scope of this path. Source/WebKit/WebProcess/glib/WebProcessGLib.cpp:void WebProcess::platformTerminate() { } > > > Source/WebKit/WebProcess/WebProcess.cpp:895 > > + for (auto& webPage : m_pageMap.values()) > > + webPage->close(); > > I think it's expected that page map is empty when terminate is called. Why > isn't empty here? I think we need a way to reproduce this or a test to fully > understand the problem. > Any idea or suggestion on this regard. > > Source/WebKit/WebProcess/WebProcess.cpp:897 > > + AuxiliaryProcess::didClose(connection); > > terminate() stops the run loop, so the exit() from the main function will be > called and shouldn't need to call didClose here, no? ... (I will reply this in another post)
Pablo Saavedra
Comment 27 2022-02-17 05:29:53 PST
> > Source/WebKit/WebProcess/WebProcess.cpp:897 > > + AuxiliaryProcess::didClose(connection); > > terminate() stops the run loop, so the exit() from the main function will be > called and shouldn't need to call didClose here, no? The didClose() method is also called in the main thread if the connection is closed, for example, if there is a problem in the IPC::Connection or if the WebProcess is replace by another for example due to CSP. For example, in situations where a page (domain.A) loads a new content from another domain (domain.B) using `window.location = "domain.B"`: ``` - Starting page on domain.A - Web extension initializing for pid 590 ... - Load started on domain.A - Load committed on domain.A ... (Later when the redirection starts ...) ... - Load started on domain.B ... The PlatformDisplay:atExit() is invoked for the pid 590 ... AuxiliaryProcess::didClose() exiting here for the pid 590 ... PlatformDisplay:atExit(): calling eglTerminate() - PID: 590 ``` The PlatformDisplay:atExit() is invoked in the previous old WebProcess because the AuxiliaryProcess::didClose() is called.
Pablo Saavedra
Comment 28 2022-02-18 01:02:19 PST
Comment on attachment 452022 [details] Simpler patch View in context: https://bugs.webkit.org/attachment.cgi?id=452022&action=review >> Source/WebKit/Shared/AuxiliaryProcess.cpp:67 >> +#endif > > Really requires a comment to explain our different behavior. e.g: // _exit() does not call the atexit handlers that we are using in some places. // We rely on exit handlers to cleanup EGL resources, and this handler needs to // be called from the main thread. // // For glib based ports we call the stopRunLoop() what will endup in a normal exit. >> Source/WebKit/WebProcess/WebProcess.cpp:260 >> +#if !PLATFORM(GTK) && !PLATFORM(WPE) > > Ditto. The only place where the callExit() function is called is now already safe-guarded for non WPE/GTK ports so it it is clear this is not being in use for these ports: #if !PLATFORM(GTK) && !PLATFORM(WPE) // We call _exit() directly from the background queue in case the main thread is unresponsive // and AuxiliaryProcess::didClose() does not get called. connection->setDidCloseOnConnectionWorkQueueCallback(callExit); #endif > Source/WebKit/WebProcess/WebProcess.cpp:363 > I'd suggest some comment here like this: // For GTK and WPE ports, do not fast-track an _exit() from a worker thread on IPC disconnection
Carlos Garcia Campos
Comment 29 2022-02-18 03:13:03 PST
(In reply to Pablo Saavedra from comment #25) > (In reply to Carlos Garcia Campos from comment #23) > > Created attachment 452022 [details] > > Simpler patch > > > > Pablo, could you check if this patch is enough? I see terminate egl display > > always called after applying it. > > We are experimenting the hang only for a specific case with the EGL > implementation of the Galcore Vivante driver (6.4.0.p2.4-aarch32 > ) [1], when we added eglMakeCurrent() ( > https://bugs.webkit.org/show_bug.cgi?id=236766 ). Only in this specific > scenario we are able to reproduce the hang. > > This points to a defective leaky of the driver on this regard since one > could not a hang even if there are still EGL resources not released yet. > > After an exhaustive investigation what I found is the way to prevent this > hang is by ensuring all the pages are closed before the eglMakeCurrent() is > invoked. > > + // Closing WebPages before exit. This ends-up in the invalidation of the > + // ThreadedCompositor objects associated to those WebPages. This action > + // will release the EGL resources preventing the WebProcess hang in the > + // eglTerminate() during a exit() call. > + for (auto& webPage : m_pageMap.values()) > + webPage->close(); But at this point the webpage shuld havce already been destroyed, and that's the case in my tests. So we need to figure out why pages are still alive for you at this point. > > So, long story short the problem manifested in a specific situation using a > specific driver and a proprietary driver but, at the same times, it shows > what things depends on others during a WebProcess "terminate" situation. > > [1] https://www.nxp.com/docs/en/release-note/i.MX_Linux_RN.pdf
Pablo Saavedra
Comment 30 2022-02-18 10:36:50 PST
(In reply to Carlos Garcia Campos from comment #29) > (In reply to Pablo Saavedra from comment #25) > > (In reply to Carlos Garcia Campos from comment #23) > > > Created attachment 452022 [details] > > > Simpler patch > > > > > > Pablo, could you check if this patch is enough? I see terminate egl display > > > always called after applying it. > > > > We are experimenting the hang only for a specific case with the EGL > > implementation of the Galcore Vivante driver (6.4.0.p2.4-aarch32 > > ) [1], when we added eglMakeCurrent() ( > > https://bugs.webkit.org/show_bug.cgi?id=236766 ). Only in this specific > > scenario we are able to reproduce the hang. > > > > This points to a defective leaky of the driver on this regard since one > > could not a hang even if there are still EGL resources not released yet. > > > > After an exhaustive investigation what I found is the way to prevent this > > hang is by ensuring all the pages are closed before the eglMakeCurrent() is > > invoked. > > > > + // Closing WebPages before exit. This ends-up in the invalidation of the > > + // ThreadedCompositor objects associated to those WebPages. This action > > + // will release the EGL resources preventing the WebProcess hang in the > > + // eglTerminate() during a exit() call. > > + for (auto& webPage : m_pageMap.values()) > > + webPage->close(); > > But at this point the webpage shuld havce already been destroyed, and that's > the case in my tests. So we need to figure out why pages are still alive for > you at this point. > I agree but at this point I can not tell you why. Instead of, what I can show you is the evidence. The following traces are based on your simplified version of the patch but with `fprintf(stderr, )` before and after the EGL calls in the exit handler and still with the specialization of the didClose() method that I did for the WebProcess: Trace 1: +void WebProcess::didClose(IPC::Connection& connection) +{ + // Closing WebPages before exit. This ends-up in the invalidation of the + // ThreadedCompositor objects associated to those WebPages. This action + // will release the EGL resources preventing the WebProcess hang in the + // eglTerminate() during a exit() call. + for (auto& webPage : m_pageMap.values()) + { + fprintf("XXX: WebProcess::didClose(): Detected page in m_pageMap. Executed webPage->close() in the didClose()\n"); + webPage->close(); + } + AuxiliaryProcess::didClose(connection); +} Trace 2: +void WebProcess::didClose(IPC::Connection& connection) +{ + // Closing WebPages before exit. This ends-up in the invalidation of the + // ThreadedCompositor objects associated to those WebPages. This action + // will release the EGL resources preventing the WebProcess hang in the + // eglTerminate() during a exit() call. + for (auto& webPage : m_pageMap.values()) + { + fprintf(" WebProcess::didClose(): Detected page in m_pageMap. NOT executed webPage->close() in the didClose()\n"); + // webPage->close(); + } + AuxiliaryProcess::didClose(connection); +} Trace 1: ``` Feb 18 17:19:02 board launcher[932]: XXX: WebProcess::didClose(): Detected page in m_pageMap. Executed webPage->close() in the didClose() Feb 18 17:19:03 board launcher[932]: XXX: AuxiliaryProcess::didClose() Feb 18 17:19:03 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:19:03 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:19:03 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:19:56 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:19:56 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:19:56 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated ... Feb 18 17:20:31 board launcher[932]: XXX: WebProcess::didClose(): Detected page in m_pageMap. Executed webPage->close() in the didClose() Feb 18 17:20:31 board launcher[932]: XXX: AuxiliaryProcess::didClose() Feb 18 17:20:31 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:20:31 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:20:31 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:20:37 board launcher[932]: XXX: WebProcess::didClose(): Detected page in m_pageMap. Executed webPage->close() in the didClose() Feb 18 17:20:38 board launcher[932]: XXX: AuxiliaryProcess::didClose() Feb 18 17:20:38 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:20:38 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:20:38 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:20:45 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:20:45 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:20:45 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:20:52 board launcher[932]: XXX: WebProcess::didClose(): Detected page in m_pageMap. Executed webPage->close() in the didClose() Feb 18 17:20:52 board launcher[932]: XXX: AuxiliaryProcess::didClose() Feb 18 17:20:52 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:20:52 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:20:52 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:20:58 board launcher[932]: XXX: WebProcess::didClose(): Detected page in m_pageMap. Executed webPage->close() in the didClose() Feb 18 17:20:59 board launcher[932]: XXX: AuxiliaryProcess::didClose() Feb 18 17:20:59 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:20:59 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:20:59 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:21:04 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:21:04 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:21:04 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:21:11 board launcher[932]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:21:11 board launcher[932]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:21:11 board launcher[932]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated ``` Trace 2: ``` user 973 0.9 2.1 264288 21512 ? SLl 17:23 0:01 \_ /usr/libexec/wpe-webkit-1.0/WPENetworkProcess 4 42 user 1135 4.3 6.4 324284 64252 ? SLl 17:24 0:06 \_ /usr/libexec/wpe-webkit-1.0/WPEWebProcess 10 56 user 1584 13.2 5.4 317024 54480 ? SLl 17:26 0:04 \_ /usr/libexec/wpe-webkit-1.0/WPEWebProcess 52 74 <<<<< root 1698 0.0 0.0 1836 348 pts/0 S+ 17:27 0:00 \_ grep WPE Feb 18 17:25:59 board launcher[936]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:25:59 board launcher[936]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:25:59 board launcher[936]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:26:09 board launcher[936]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:26:09 board launcher[936]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:26:09 board launcher[936]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:26:18 board launcher[936]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:26:18 board launcher[936]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:26:18 board launcher[936]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:26:26 board launcher[936]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:26:26 board launcher[936]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:26:26 board launcher[936]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:26:32 board launcher[936]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:26:32 board launcher[936]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:26:32 board launcher[936]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated Feb 18 17:26:39 board launcher[936]: XXX: WebProcess::didClose(): Detected page in m_pageMap. NOT executed webPage->close() in the didClose() Feb 18 17:26:39 board launcher[936]: XXX: AuxiliaryProcess::didClose() Feb 18 17:26:39 board launcher[936]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:26:39 board launcher[936]: XXX: eglMakeCurrent() 2 of 3 ... KO. exit hang in the eglTerminate() call !!!!!!!!!!!!!!!!!!! Prevents the termination of the PID 1584 ... Feb 18 17:26:51 board launcher[936]: XXX: WebProcess::didClose(): Detected page in m_pageMap. NOT executed webPage->close() in the didClose() Feb 18 17:26:51 board launcher[936]: XXX: AuxiliaryProcess::didClose() Feb 18 17:26:51 board launcher[936]: XXX: eglMakeCurrent() 1 of 3 Feb 18 17:26:51 board launcher[936]: XXX: eglMakeCurrent() 2 of 3 Feb 18 17:26:51 board launcher[936]: XXX: eglTerminate() 3 of 3. OK. WebProcess terminated ``` Observations: * The hang happens in the situation where there are still pages in the m_pageMap and we don't invoke the `close()` method for those pages. * The key point, I think, is that the webPage.close() triggers at last the ThreadedCompositor.invalidate() in ThreadedCompositor associated to the webPage: WebPage->m_drawingArea<DrawingAreaCoordinatedGraphics>->m_layerTreeHost<LayerTreeHost>->m_compositor(ThreadedCompositor).invalidate() The invalidate() method is the one that purges GL resources and destroy the ContextGL created by the ThreadedCompositor: m_scene->purgeGLResources(); m_context = nullptr; m_client.didDestroyGLContext(); * The reason why the page is still NOT close when the didClose() is called in that situation is simple because I don't see nowhere where all the webpages are closed before the exit. Is that an design decision?
Pablo Saavedra
Comment 31 2022-02-18 10:46:32 PST
Created attachment 452542 [details] psaavedra patch version 2
Pablo Saavedra
Comment 32 2022-02-21 10:16:04 PST
Created attachment 452743 [details] psaavedra patch version 3 This version adds an explanation of why the iterator with the webPage.close() is needed in in the didClose(): // Normally the didClose() is called through the // WebPageProxy::close() -> dispatch() callback but there is the // possibility of the didClose() being called due to an error in the // IPC connection, for example, in the sceneario of a fast webview // open/close action. In this situation the didClose() is invoked before the // webpage.close() is being called because the webkitWebViewDispose(). // By closing all the WebPages we will ensure that all the EGL resources // will be released before calling the eglTerminate() in the terminateEGLDisplay() // because this ends-up in the invalidation of the ThreadedCompositor EGL resources // associated to those WebPages. // // WebPage::WebPage() // webkitWebViewDispose() // WebPageProxy::close() -> dispatch() callback protectedProcess->send() // // // ... (Connection::open() being processed before the WebPage::close()) // Connection::open() // // Error in the connection: G_IO_HUP || condition & G_IO_ERR || condition & G_IO_NVAL -> connectionDidClose() // Connection::connectionDidClose() RunLoop::main().dispatch() -- protectedThis->m_client.didClose() // // WebProcess::didClose(): For each webPage in m_pageMap: webPage->close() // // ... (WebPage::close() being called to lated) // WebPage::close() // WebProcess::removeWebPage() // WebPage::~WebPage() // // PlatformDisplay::terminateEGLDisplay() for (auto& webPage : m_pageMap.values()) webPage->close();
Pablo Saavedra
Comment 33 2022-02-22 00:22:36 PST
Created attachment 452835 [details] psaavedra patch version 3
Pablo Saavedra
Comment 34 2022-02-22 00:24:01 PST
Created attachment 452836 [details] psaavedra patch version 3
Pablo Saavedra
Comment 35 2022-02-22 00:25:22 PST
Created attachment 452837 [details] psaavedra patch version 3
Pablo Saavedra
Comment 36 2022-02-22 04:00:59 PST
Created attachment 452853 [details] psaavedra patch version 4
Carlos Garcia Campos
Comment 37 2022-02-22 04:40:45 PST
Comment on attachment 452853 [details] psaavedra patch version 4 View in context: https://bugs.webkit.org/attachment.cgi?id=452853&action=review > Source/WebKit/WebProcess/WebProcess.cpp:364 > +// For GTK and WPE ports, do not fast-track an _exit() from a worker thread on IPC disconnection Do not call exit in background queue for GTK and WPE because we need to ensure atexit handlers are called in the main thread to cleanup resources like EGL displays. > Source/WebKit/WebProcess/glib/WebProcessGLib.cpp:83 > + for (auto& webPage : m_pageMap.values()) I've now realized that WebPage::close() calls WebProcess::removeWebPage(), so m_pageMap is modified while being iterated. Use copyToVector to copy the values() to a Vector instead of using the iterator directly. for (auto& webPage : copyToVector(m_pageMap.values())) webPage->close();
Pablo Saavedra
Comment 38 2022-02-22 06:00:28 PST
Created attachment 452859 [details] psaavedra patch version 5
Pablo Saavedra
Comment 39 2022-02-22 06:47:35 PST
(In reply to Carlos Garcia Campos from comment #37) > Comment on attachment 452853 [details] > psaavedra patch version 4 > I've now realized that WebPage::close() calls WebProcess::removeWebPage(), > so m_pageMap is modified while being iterated. Use copyToVector to copy the > values() to a Vector instead of using the iterator directly. > > for (auto& webPage : copyToVector(m_pageMap.values())) > webPage->close(); Good point. I was not about that WebProcess::removeWebPage() call into the WebPage::close().
Jonathan Bedard
Comment 40 2022-02-22 16:46:10 PST
Stopped https://ews-build.webkit.org/#/builders/70/builds/1030, the 3 failures it flagged are already known.
EWS
Comment 41 2022-02-23 01:33:24 PST
Committed r290360 (247678@main): <https://commits.webkit.org/247678@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 452859 [details].
Philippe Normand
Comment 42 2022-02-23 04:34:07 PST
Thanks for fixing this Pablo! It seems that we could now deinitialize GStreamer as well when terminating the WebProcess. That would allow things like the GStreamer leak tracer to be useful for us.
Pablo Saavedra
Comment 43 2022-02-23 06:16:06 PST
(In reply to Philippe Normand from comment #42) > Thanks for fixing this Pablo! It seems that we could now deinitialize > GStreamer as well when terminating the WebProcess. That would allow things > like the GStreamer leak tracer to be useful for us. Thanks Philippe. The idea was to avoid abrupt exits allowing then the exit handlers to be called. So yes, I think you can use a handler for that purpose now too.
Philippe Normand
Comment 44 2022-02-23 06:51:55 PST
robert.nagy
Comment 45 2022-03-28 03:41:19 PDT
Hi After upgrading to webkitgtk 2.36.0 from 2.34 we have noticed a crash on ever exit in all binaries using webkitgtk and reverting this commit fixes the issue. I assume there is either a race or somethig gets called twice and it is trying to destroy an already destroyed resource. Backtrace: #0 native () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/x11/PlatformDisplayX11.h:48 #1 sharedDisplay () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/x11/XUniqueResource.cpp:46 #2 deleteXResource () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/x11/XUniqueResource.cpp:79 #3 0x00000f5eea945e8c in reset () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/x11/XUniqueResource.h:86 #4 ~XUniqueResource () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/x11/XUniqueResource.h:77 #5 ~GLContextGLX () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/glx/GLContextGLX.cpp:341 #6 0x00000f5eea945f3f in ~GLContextGLX () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/glx/GLContextGLX.cpp:323 #7 0x00000f5eea955161 in operator() () at /usr/include/c++/v1/__memory/unique_ptr.h:57 #8 reset () at /usr/include/c++/v1/__memory/unique_ptr.h:318 #9 operator= () at /usr/include/c++/v1/__memory/unique_ptr.h:276 #10 ~PlatformDisplayX11 () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/x11/PlatformDisplayX11.cpp:76 #11 ~PlatformDisplayX11 () at /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/graphics/x11/PlatformDisplayX11.cpp:73 #12 0x00000f5ed8272df1 in _libc___cxa_finalize (dso=0x0) at /usr/src/lib/libc/stdlib/atexit.c:177 #13 0x00000f5ed826b1a1 in _libc_exit (status=0) at /usr/src/lib/libc/stdlib/exit.c:54 #14 0x00000f5c422ffa09 in _start ()
Carlos Garcia Campos
Comment 46 2022-03-29 01:49:18 PDT
(In reply to robert.nagy from comment #45) > Hi > > After upgrading to webkitgtk 2.36.0 from 2.34 we have noticed a crash on > ever exit > in all binaries using webkitgtk and reverting this commit fixes the issue. > > I assume there is either a race or somethig gets called twice and it is > trying > to destroy an already destroyed resource. > > Backtrace: > > #0 native () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/x11/PlatformDisplayX11.h:48 > #1 sharedDisplay () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/x11/XUniqueResource.cpp:46 > #2 deleteXResource () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/x11/XUniqueResource.cpp:79 > #3 0x00000f5eea945e8c in reset () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/x11/XUniqueResource.h:86 > #4 ~XUniqueResource () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/x11/XUniqueResource.h:77 > #5 ~GLContextGLX () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/glx/GLContextGLX.cpp:341 > #6 0x00000f5eea945f3f in ~GLContextGLX () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/glx/GLContextGLX.cpp:323 > #7 0x00000f5eea955161 in operator() () at > /usr/include/c++/v1/__memory/unique_ptr.h:57 > #8 reset () at /usr/include/c++/v1/__memory/unique_ptr.h:318 > #9 operator= () at /usr/include/c++/v1/__memory/unique_ptr.h:276 > #10 ~PlatformDisplayX11 () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/x11/PlatformDisplayX11.cpp:76 > #11 ~PlatformDisplayX11 () at > /home/pobj/webkitgtk4-2.36.0/webkitgtk-2.36.0/Source/WebCore/platform/ > graphics/x11/PlatformDisplayX11.cpp:73 > #12 0x00000f5ed8272df1 in _libc___cxa_finalize (dso=0x0) at > /usr/src/lib/libc/stdlib/atexit.c:177 > #13 0x00000f5ed826b1a1 in _libc_exit (status=0) at > /usr/src/lib/libc/stdlib/exit.c:54 > #14 0x00000f5c422ffa09 in _start () Let's track this in a new bug, see https://bugs.webkit.org/show_bug.cgi?id=238494
Note You need to log in before you can comment on or make changes to this bug.