Bug 199572 - REGRESSION(r246671): [WPE][GTK] Crash in NetworkProcess since the DNS cache landed
Summary: REGRESSION(r246671): [WPE][GTK] Crash in NetworkProcess since the DNS cache l...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKitGTK (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Claudio Saavedra
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-08 05:02 PDT by Claudio Saavedra
Modified: 2019-07-08 11:06 PDT (History)
4 users (show)

See Also:


Attachments
Patch (3.81 KB, patch)
2019-07-08 08:01 PDT, Claudio Saavedra
no flags Details | Formatted Diff | Diff
Patch (2.24 KB, patch)
2019-07-08 09:00 PDT, Claudio Saavedra
no flags Details | Formatted Diff | Diff
Patch for landing (2.25 KB, patch)
2019-07-08 09:11 PDT, Claudio Saavedra
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Claudio Saavedra 2019-07-08 05:02:55 PDT
I've noticed that several tests have started crashing the network process sporadically and all they have in common is that the crashing starts right after the DNS cache was landed. Opening this bug to track this potential regression.

Some stacktraces to exemplify this:

https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20(Tests)/r247200%20(10899)/imported/w3c/web-platform-tests/html/dom/interfaces.worker-crash-log.txt
https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20(Tests)/r247200%20(10899)/imported/w3c/web-platform-tests/cors/redirect-preflight-2-crash-log.txt
https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20(Tests)/r247200%20(10899)/imported/w3c/web-platform-tests/websockets/interfaces/WebSocket/readyState/005-crash-log.txt
https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20(Tests)/r247200%20(10899)/imported/w3c/web-platform-tests/websockets/interfaces/WebSocket/url/006-crash-log.txt
https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20(Tests)/r247200%20(10899)/imported/w3c/web-platform-tests/websockets/keeping-connection-open/001-crash-log.txt
https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20(Tests)/r247200%20(10899)/imported/w3c/web-platform-tests/websockets/opening-handshake/005-crash-log.txt
https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20(Tests)/r247204%20(10901)/imported/w3c/web-platform-tests/content-security-policy/connect-src/connect-src-websocket-blocked.sub-crash-log.txt

The stacktraces don't tell much at first sight, but since it seems to be happening deep in glib, there might be some memory corruption somewhere, not sure if related to thread-safety issues in the dns cache resolver (or libsoup even?).
Comment 2 Michael Catanzaro 2019-07-08 06:27:51 PDT
At least most of these (anything involving magazine_chain_pop_head) are definitely memory corruption. Probably the rest are too. The backtraces are only going to lead you on a false chase. What we need are backtraces from asan or valgrind to point to the source of the memory corruption, not these gdb traces that only point to where it eventually blew up.
Comment 3 Michael Catanzaro 2019-07-08 06:30:06 PDT
BTW if we don't solve it soon, then we should roll it out. The DNS cache is only a small thing, not worth a serious regression.
Comment 4 Claudio Saavedra 2019-07-08 07:55:53 PDT
==27300== Thread 1:
==27300== Invalid read of size 8
==27300==    at 0xE4B5029: g_error_free (gerror.c:493)
==27300==    by 0x6428719: webkitCachedResolverLookupByNameAsync(_GResolver*, char const*, _GCancellable*, void (*)(_GObject*, _GAsyncResult*, void*), void*)::{lambda(_GObject*, _GAsyncResult*, void*)#1}::_FUN(_
GObject*, _GAsyncResult*, void*) (in /home/claudio/git/gnome/WebKit/WebKitBuild/Release/lib/libwebkit2gtk-4.0.so.37.38.1)
==27300==    by 0xED40AF8: g_task_return_now (gtask.c:1148)
==27300==    by 0xED40B38: complete_in_idle_cb (gtask.c:1162)
==27300==    by 0xE4C99B7: g_main_dispatch (gmain.c:3182)
==27300==    by 0xE4C99B7: g_main_context_dispatch (gmain.c:3847)
==27300==    by 0xE4C9D77: g_main_context_iterate.isra.26 (gmain.c:3920)
==27300==    by 0xE4CA061: g_main_loop_run (gmain.c:4116)
==27300==    by 0xAAEEA2F: WTF::RunLoop::run() (in /home/claudio/git/gnome/WebKit/WebKitBuild/Release/lib/libjavascriptcoregtk-4.0.so.18.14.1)
==27300==    by 0x6430329: int WebKit::AuxiliaryProcessMain<WebKit::NetworkProcess, WebKit::NetworkProcessMain>(int, char**) (in /home/claudio/git/gnome/WebKit/WebKitBuild/Release/lib/libwebkit2gtk-4.0.so.37.38.
1)
==27300==    by 0xFC7B09A: (below main) (libc-start.c:308)
==27300==  Address 0x5c749bd8 is 8 bytes inside a block of size 16 free'd
==27300==    at 0x48369AB: free (vg_replace_malloc.c:530)
==27300==    by 0xE205350: lookup_resolved (soup-address.c:762)
==27300==    by 0xED40AF8: g_task_return_now (gtask.c:1148)
==27300==    by 0xED41575: g_task_return (gtask.c:1206)
==27300==    by 0x6428744: webkitCachedResolverLookupByNameAsync(_GResolver*, char const*, _GCancellable*, void (*)(_GObject*, _GAsyncResult*, void*), void*)::{lambda(_GObject*, _GAsyncResult*, void*)#1}::_FUN(_
GObject*, _GAsyncResult*, void*) (in /home/claudio/git/gnome/WebKit/WebKitBuild/Release/lib/libwebkit2gtk-4.0.so.37.38.1)
==27300==    by 0xED40AF8: g_task_return_now (gtask.c:1148)
==27300==    by 0xED40B38: complete_in_idle_cb (gtask.c:1162)
==27300==    by 0xE4C99B7: g_main_dispatch (gmain.c:3182)
==27300==    by 0xE4C99B7: g_main_context_dispatch (gmain.c:3847)
==27300==    by 0xE4C9D77: g_main_context_iterate.isra.26 (gmain.c:3920)
==27300==    by 0xE4CA061: g_main_loop_run (gmain.c:4116)
==27300==    by 0xAAEEA2F: WTF::RunLoop::run() (in /home/claudio/git/gnome/WebKit/WebKitBuild/Release/lib/libjavascriptcoregtk-4.0.so.18.14.1)
==27300==    by 0x6430329: int WebKit::AuxiliaryProcessMain<WebKit::NetworkProcess, WebKit::NetworkProcessMain>(int, char**) (in /home/claudio/git/gnome/WebKit/WebKitBuild/Release/lib/libwebkit2gtk-4.0.so.37.38.
1)
==27300==    by 0xFC7B09A: (below main) (libc-start.c:308)
==27300==  Block was alloc'd at
==27300==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==27300==    by 0xE4CF2E0: g_malloc (gmem.c:99)
==27300==    by 0xE4E6782: g_slice_alloc (gslice.c:1024)
==27300==    by 0xE4B4E2C: g_error_new_valist (gerror.c:410)
==27300==    by 0xED42077: g_task_return_new_error (gtask.c:1811)
==27300==    by 0xED44C75: do_lookup_by_name (gthreadedresolver.c:123)
==27300==    by 0xED416A2: g_task_thread_pool_thread (gtask.c:1331)
==27300==    by 0xE4F19B2: g_thread_pool_thread_proxy (gthreadpool.c:307)
==27300==    by 0xE4F1054: g_thread_proxy (gthread.c:784)
==27300==    by 0xDD5BFA2: start_thread (pthread_create.c:486)
==27300==    by 0xFD504CE: clone (clone.S:95)
Comment 5 Claudio Saavedra 2019-07-08 08:01:12 PDT
Created attachment 373625 [details]
Patch
Comment 6 Michael Catanzaro 2019-07-08 08:22:57 PDT
Comment on attachment 373625 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=373625&action=review

Score +1 for valgrind

> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:111
> -            g_task_return_error(task.get(), error.get());
> +            g_task_return_error(task.get(), error);

My preference is to keep using GUniqueOutPtr -- so that we never hold ownership in a raw pointer -- and use error.release() here instead. But this is fine too.

> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:173
> -            g_task_return_error(task.get(), error.get());
> +            g_task_return_error(task.get(), error);

Ditto.
Comment 7 Carlos Garcia Campos 2019-07-08 08:36:13 PDT
Comment on attachment 373625 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=373625&action=review

>> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:111
>> +            g_task_return_error(task.get(), error);
> 
> My preference is to keep using GUniqueOutPtr -- so that we never hold ownership in a raw pointer -- and use error.release() here instead. But this is fine too.

Good catch, thanks Claudio! I would also keep the smart pointer leaking it only when transferred.

> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:166
> +        GError* error;

This is missing the nullptr initialization.
Comment 8 Michael Catanzaro 2019-07-08 08:59:39 PDT
Comment on attachment 373625 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=373625&action=review

>>> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:111
>>> +            g_task_return_error(task.get(), error);
>> 
>> My preference is to keep using GUniqueOutPtr -- so that we never hold ownership in a raw pointer -- and use error.release() here instead. But this is fine too.
> 
> Good catch, thanks Claudio! I would also keep the smart pointer leaking it only when transferred.

Claudio noticed that GUniqueOutPtr::release returns a GUniquePtr, so it would have to be error.release().release(), which is silly. So I think the smart pointer is undesirable here.

I wonder if we have much code that really benefits from that behavior, or if we could change GUniqueOutPtr::release to return the raw pointer directly, as I would have expected.

>> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:166
>> +        GError* error;
> 
> This is missing the nullptr initialization.

Oh no, very good catch.
Comment 9 Claudio Saavedra 2019-07-08 09:00:28 PDT
Created attachment 373629 [details]
Patch
Comment 10 Michael Catanzaro 2019-07-08 09:07:57 PDT
Comment on attachment 373629 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=373629&action=review

cq+ is going to fail because you don't have  the Reviewed by

> Source/WebKit/ChangeLog:10
> +        but passed onto the caller.
> +        * NetworkProcess/glib/WebKitCachedResolver.cpp:

Missing blank line here.

> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:111
> -            g_task_return_error(task.get(), error.get());
> +            g_task_return_error(task.get(), error.release().release());

Carlos, what do you think?
Comment 11 Claudio Saavedra 2019-07-08 09:11:35 PDT
Created attachment 373631 [details]
Patch for landing
Comment 12 Carlos Garcia Campos 2019-07-08 09:12:18 PDT
(In reply to Michael Catanzaro from comment #8)
> Comment on attachment 373625 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=373625&action=review
> 
> >>> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:111
> >>> +            g_task_return_error(task.get(), error);
> >> 
> >> My preference is to keep using GUniqueOutPtr -- so that we never hold ownership in a raw pointer -- and use error.release() here instead. But this is fine too.
> > 
> > Good catch, thanks Claudio! I would also keep the smart pointer leaking it only when transferred.
> 
> Claudio noticed that GUniqueOutPtr::release returns a GUniquePtr, so it
> would have to be error.release().release(), which is silly. So I think the
> smart pointer is undesirable here.
> 
> I wonder if we have much code that really benefits from that behavior, or if
> we could change GUniqueOutPtr::release to return the raw pointer directly,
> as I would have expected.

I think the idea here was to convert an OutPtr (that exists only because outPtr is not possible with std::unique_ptr) into a Ptr which is actually a std::unique_ptr. I don't know if we are actually using it that way. We could probably add leakPtr() for the release().release() case.

> >> Source/WebKit/NetworkProcess/glib/WebKitCachedResolver.cpp:166
> >> +        GError* error;
> > 
> > This is missing the nullptr initialization.
> 
> Oh no, very good catch.
Comment 13 Claudio Saavedra 2019-07-08 09:40:09 PDT
Committed r247209: <https://trac.webkit.org/changeset/247209>
Comment 14 Michael Catanzaro 2019-07-08 11:06:24 PDT
(In reply to Carlos Garcia Campos from comment #12)
> I think the idea here was to convert an OutPtr (that exists only because
> outPtr is not possible with std::unique_ptr) into a Ptr which is actually a
> std::unique_ptr. I don't know if we are actually using it that way. We could
> probably add leakPtr() for the release().release() case.

There were only three places using release before now. Two of them also want the raw pointer (they're doing release().release()) and the other is test code. Making a GUniquePtr from the raw pointer in that case is trivial:

diff --git a/Tools/TestWebKitAPI/Tests/WTF/glib/GUniquePtr.cpp b/Tools/TestWebKitAPI/Tests/WTF/glib/GUniquePtr.cpp
index 11f02c7b0f7..288591c8e3f 100644
--- a/Tools/TestWebKitAPI/Tests/WTF/glib/GUniquePtr.cpp
+++ b/Tools/TestWebKitAPI/Tests/WTF/glib/GUniquePtr.cpp
@@ -184,7 +184,7 @@ TEST(WTF_GUniquePtr, OutPtr)
     {
         GUniqueOutPtr<char> a;
         returnOutChar(&a.outPtr());
-        GUniquePtr<char> b = a.release();
+        GUniquePtr<char> b(a.release());
         ASSERT_STREQ(actual.str().c_str(), takeLogStr().c_str());
         actual << "g_free(" << b.get() << ");";
     }

So my vote is to just change release().
Comment 15 Michael Catanzaro 2019-07-08 11:06:35 PDT
Bug #199579.