Bug 176490

Summary: [GTK] Crash in WebCore::PlatformDisplayWayland::~PlatformDisplayWayland
Product: WebKit Reporter: Michael Catanzaro <mcatanzaro>
Component: WebKitGTKAssignee: Michael Catanzaro <mcatanzaro>
Status: RESOLVED FIXED    
Severity: Normal CC: bugs-noreply, cgarcia, clopez, commit-queue, magomez, mcatanzaro, wk, zan
Priority: P2    
Version: Other   
Hardware: PC   
OS: Linux   
See Also: https://bugs.webkit.org/show_bug.cgi?id=157973
https://bugzilla.gnome.org/show_bug.cgi?id=794228
https://bugs.webkit.org/show_bug.cgi?id=184405
Attachments:
Description Flags
Patch
none
Patch none

Description Michael Catanzaro 2017-09-06 18:31:51 PDT
This is an Epiphany search provider crash:

#0  0x00007f22ec477af7 in wl_list_insert_list (list=0x2524078, other=0x2524068)
    at /home/mcatanzaro/Projects/GNOME/wayland/src/wayland-util.c:89
No locals.
#1  0x00007f22e6e81389 in wl_priv_signal_emit (signal=0x2524068, data=0x2524000)
    at /home/mcatanzaro/Projects/GNOME/wayland/src/wayland-server.c:1955
        l = 0x2524000
        pos = 0x7f230640dfb0 <initial+944>
#2  0x00007f22e6e80215 in wl_display_destroy (display=0x2524000)
    at /home/mcatanzaro/Projects/GNOME/wayland/src/wayland-server.c:1077
        s = 0x7f2307aa8000
        next = 0x0
        global = 0x7f22e0ff7000
        gnext = 0x252c620
#3  0x00007f22fe89cf3c in WebCore::PlatformDisplayWayland::~PlatformDisplayWayland (
    this=0x7f22e0ff7000, __in_chrg=<optimized out>)
    at ../../Source/WebCore/platform/graphics/wayland/PlatformDisplayWayland.cpp:71
No locals.
#4  0x00007f22fe89cf84 in WebCore::PlatformDisplayWayland::~PlatformDisplayWayland (
    this=0x7f22e0ff7000, __in_chrg=<optimized out>)
    at ../../Source/WebCore/platform/graphics/wayland/PlatformDisplayWayland.cpp:72
No locals.
#5  0x00007f22feec6bf8 in std::default_delete<WebCore::PlatformDisplay>::operator() (
    this=0x7f2304293908 <WebCore::PlatformDisplay::sharedDisplay()::display>, __ptr=0x7f22e0ff7000)
    at /usr/include/c++/7/bits/unique_ptr.h:78
No locals.
#6  0x00007f22feec6869 in std::unique_ptr<WebCore::PlatformDisplay, std::default_delete<WebCore::PlatformDisplay> >::~unique_ptr (this=0x7f2304293908 <WebCore::PlatformDisplay::sharedDisplay()::display>, 
    __in_chrg=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:268
        __ptr = @0x7f2304293908: 0x7f22e0ff7000
#7  0x00007f230607bc68 in __run_exit_handlers () from /lib64/libc.so.6
No symbol table info available.
#8  0x00007f230607bcba in exit () from /lib64/libc.so.6
No symbol table info available.
#9  0x00007f2306061511 in __libc_start_main () from /lib64/libc.so.6
No symbol table info available.
#10 0x00000000004041ca in _start ()
No symbol table info available.

I'll continue to vote in favor of getting rid of these buggy exit handlers altogether. I doubt we really need to be calling wl_display_destroy().
Comment 1 Michael Catanzaro 2017-09-11 17:16:30 PDT
Unless graphics folks have any ideas, we need to remove the PlatformDisplay exit handlers.
Comment 2 Michael Catanzaro 2017-10-25 20:31:42 PDT
(In reply to Michael Catanzaro from comment #1)
> Unless graphics folks have any ideas, we need to remove the PlatformDisplay
> exit handlers.

The exit handlers are not needed for program correctness as far as I'm aware, and they've been causing many, many crashes over the years. So I think it's time for them to go. At least this particular one.
Comment 3 Michael Catanzaro 2017-10-26 04:32:54 PDT
Created attachment 324987 [details]
Patch
Comment 4 Carlos Garcia Campos 2017-10-26 04:57:11 PDT
I think we need to understand why it's crashing and only if it's not possible to fix it we can remove the exit handlers. I don't remember why, but I think we shut down the egl displays for a reason. IIRC mesa has its own exit handlers to clean up stuff too.
Comment 5 Michael Catanzaro 2017-10-26 17:14:07 PDT
(In reply to Carlos Garcia Campos from comment #4)
> I think we need to understand why it's crashing and only if it's not
> possible to fix it we can remove the exit handlers. I don't remember why,
> but I think we shut down the egl displays for a reason.

If nobody knows why, we should stop doing it and see if anyone notices a problem.
Comment 6 Christian Stadelmann 2017-10-30 05:52:32 PDT
I am seeing this crash with liferea when running liferea headless to get GIR symbols, see https://github.com/lwindolf/liferea/issues/552 for details.
Comment 7 Michael Catanzaro 2018-02-23 11:10:37 PST
Comment on attachment 324987 [details]
Patch

Carlos, I don't understand why it's crashing, and the graphics people don't seem to have been able to debug it. Strongly suggest pushing this in now. If the exit handlers are needed to work around OS or driver bugs, then they should be reintroduced once they can be done without crashing.
Comment 9 Carlos Alberto Lopez Perez 2018-03-05 05:29:00 PST
(In reply to Carlos Garcia Campos from comment #4)
> I think we need to understand why it's crashing and only if it's not
> possible to fix it we can remove the exit handlers. I don't remember why,
> but I think we shut down the egl displays for a reason. IIRC mesa has its
> own exit handlers to clean up stuff too.

Does http://trac.webkit.org/r201595 help remembering?
Comment 10 Michael Catanzaro 2018-03-05 05:39:39 PST
(In reply to Carlos Alberto Lopez Perez from comment #9)
> Does http://trac.webkit.org/r201595 help remembering?

Note: it was an exit-time destructor before that patch, too (since the PlatformDisplay was not NeverDestroyed). The goal there was to move it to run earlier at exit time, to avoid a specific crash. But as long as we do work at exit time, there will always be more crashes....
Comment 11 Michael Catanzaro 2018-03-10 09:51:53 PST
(In reply to Michael Catanzaro from comment #8)
> Crash stats:
> 
> https://retrace.fedoraproject.org/faf/problems/bthash/
> ?bth=7712e893327fdae3c939ee81aa54d082ebbdc717
> 
> https://retrace.fedoraproject.org/faf/problems/bthash/
> ?bth=8adf67e9e11bb1765726e37b1f3d7028ca66a8db

I really underestimated the total number of crashes here. Those links are currently showing 704 total crashes. I missed this one:

https://retrace.fedoraproject.org/faf/reports/1793321/

which shows 2024 reports. That's 2728 total reports of this crash from Fedora alone.
Comment 12 Michael Catanzaro 2018-03-10 10:16:24 PST
(In reply to Michael Catanzaro from comment #11)
> which shows 2024 reports. That's 2728 total reports of this crash from
> Fedora alone.

I should add: only 688 unique reports (reports from users who haven't reported the crash before). So about four reports per unique user.
Comment 13 Michael Catanzaro 2018-03-10 17:24:35 PST
ephy-search-provider crashes 100% for me... I didn't notice for so long because the search provider was broken for most of the 3.26 cycle, and the update fixing it just hit Fedora 27.

Some debugging:

Since this is the search provider, we wind up taking this path in PlatformDisplay::createPlatformDisplay:

#if PLATFORM(WAYLAND)
    if (auto platformDisplay = PlatformDisplayWayland::create())
        return platformDisplay;
#endif

Notably, that's the only path where m_nativeDisplayOwned is true, so it's the only path where the native display is destroyed. That's why the crash only occurs for the search provider. So we destroy it here:

PlatformDisplayWayland::~PlatformDisplayWayland()
{
    if (m_nativeDisplayOwned == NativeDisplayOwned::Yes)
        wl_display_destroy(m_display);
}

That's definitely wrong, because this wl_display was created with wl_display_connect(), which is Wayland client API, but wl_display_destroy() is Wayland server API. So we must never call wl_display_destroy(). The right function to use is wl_display_disconnect(), which parallels the call to wl_display_connect() that we used to create the wl_display in PlatformDisplayWayland::create. It's confusing, though, because the UI process is both a Wayland client of the desktop compositor, and also and a Wayland server for the web process.

Now, with that fixed locally, there's no longer any crash inside wl_display_destroy(), and wl_display_disconnect() works fine. But I'm seeing a similar crash immediately after that, when the wl_compositor is destroyed. I suspect this indicates the root of the problem: we're trying to use the same wl_display as both a Wayland client and Wayland server, and I guess that isn't right. It doesn't happen when there's a GtkWindow because then GDK's wl_display is used as the client display instead.

Thread 1 "epiphany-search" received signal SIGSEGV, Segmentation fault.
0x00007fffed7fc919 in wl_map_insert_at (map=<optimized out>, 
    flags=<optimized out>, i=<optimized out>, data=<optimized out>)
    at src/wayland-util.c:247
247	src/wayland-util.c: No such file or directory.
(gdb) bt
#0  0x00007fffed7fc919 in wl_map_insert_at (map=<optimized out>, flags=<optimized out>, i=<optimized out>, data=<optimized out>) at src/wayland-util.c:247
#1  0x00007fffed7f8ad3 in proxy_destroy (proxy=0x5555557a2d60)
    at src/wayland-client.c:506
#2  0x00007fffed7f8ad3 in wl_proxy_destroy (proxy=0x5555557a2d60)
    at src/wayland-client.c:537
#3  0x00007ffff3c0c536 in wl_compositor_destroy (wl_compositor=<optimized out>) at /usr/include/wayland-client-protocol.h:1194
#4  0x00007ffff3c0c536 in WebCore::WlPtrDeleter<wl_compositor>::operator()(wl_compositor*) const (this=<optimized out>, ptr=<optimized out>)
    at /buildstream/build/Source/WebCore/platform/graphics/wayland/WlUniquePtr.h:55
#5  0x00007ffff3c0c536 in std::unique_ptr<wl_compositor, WebCore::WlPtrDeleter<wl_compositor> >::~unique_ptr() (this=0x7fffdf9f5040, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:268
#6  0x00007ffff3c0c536 in WebCore::PlatformDisplayWayland::~PlatformDisplayWayland() (this=0x7fffdf9f5000, __in_chrg=<optimized out>)
    at /buildstream/build/Source/WebCore/platform/graphics/wayland/PlatformDisplayWayland.cpp:68
#7  0x00007ffff3c0c569 in WebCore::PlatformDisplayWayland::~PlatformDisplayWayland() (this=0x7fffdf9f5000, __in_chrg=<optimized out>)
    at /buildstream/build/Source/WebCore/platform/graphics/wayland/PlatformDisplayWayland.cpp:72
#8  0x00007ffff6056248 in __run_exit_handlers (status=1, listp=0x7ffff63cf6f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:83
#9  0x00007ffff605629a in __GI_exit (status=<optimized out>) at exit.c:105
#10 0x00007ffff603ff31 in __libc_start_main (main=
    0x55555555d974 <main>, argc=1, argv=0x7fffffffea58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffea48)
    at ../csu/libc-start.c:344
#11 0x00005555555599fa in _start ()
Comment 14 Michael Catanzaro 2018-03-10 17:37:19 PST
(In reply to Michael Catanzaro from comment #13)
> we're trying to use the
> same wl_display as both a Wayland client and Wayland server, and I guess
> that isn't right.

That's not true, I got confused by wl_compositor. That's client API too. Perhaps it should be destroyed before its wl_display.
Comment 15 Michael Catanzaro 2018-03-10 18:03:31 PST
Created attachment 335523 [details]
Patch
Comment 16 Michael Catanzaro 2018-03-10 18:05:33 PST
Comment on attachment 335523 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=335523&action=review

> Source/WebCore/platform/graphics/wayland/PlatformDisplayWayland.cpp:73
> +        wl_display_disconnect(m_display);

Note: this does, in fact, destroy the wl_display.

WlUniquePtr<wl_display> is a bit of a footgun, as it's not safe to use on any wl_display created with wl_display_connect().
Comment 17 Michael Catanzaro 2018-03-11 16:35:41 PDT
Comment on attachment 335523 [details]
Patch

(cq? for cgarcia)
Comment 18 Carlos Garcia Campos 2018-03-12 01:49:04 PDT
(In reply to Michael Catanzaro from comment #16)
> Comment on attachment 335523 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=335523&action=review
> 
> > Source/WebCore/platform/graphics/wayland/PlatformDisplayWayland.cpp:73
> > +        wl_display_disconnect(m_display);
> 
> Note: this does, in fact, destroy the wl_display.
> 
> WlUniquePtr<wl_display> is a bit of a footgun, as it's not safe to use on
> any wl_display created with wl_display_connect().

Good catch! this is exactly what I wanted, once we understand the problem we can easily fix the actual bug. WlUniquePtr<wl_display> is only used once in Waylandcompositor, which is a singleton, so I think we can simply remove it to avoid confusions and use destroy/disconnect directly.
Comment 19 Carlos Garcia Campos 2018-03-12 01:49:34 PDT
Comment on attachment 335523 [details]
Patch

Thank you!
Comment 20 WebKit Commit Bot 2018-03-12 02:14:23 PDT
Comment on attachment 335523 [details]
Patch

Clearing flags on attachment: 335523

Committed r229525: <https://trac.webkit.org/changeset/229525>
Comment 21 WebKit Commit Bot 2018-03-12 02:14:25 PDT
All reviewed patches have been landed.  Closing bug.
Comment 22 Michael Catanzaro 2018-03-17 16:25:09 PDT
One week later, we're now up to 3958 reports. I think this might be our biggest WebKit crasher ever....