Bug 183346

Summary: [GTK] NetworkProcess from WebKitGtk+ 2.19.9x SIGSEVs in NetworkStorageSession (secret search callback)
Product: WebKit Reporter: Andres Gomez Garcia <agomez>
Component: WebKitGTKAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: berto, bugs-noreply, cgarcia, ews-watchlist, gustavo, mcatanzaro
Priority: P2    
Version: WebKit Local Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1534860
Attachments:
Description Flags
BT from gdb for the NetworkProcess
none
BT from gdb for the NetworkProcess 2.19.92
none
Another backtrace
none
Patch mcatanzaro: review+

Description Andres Gomez Garcia 2018-03-05 14:24:14 PST
Created attachment 335027 [details]
BT from gdb for the NetworkProcess

I'm using epiphany 3.27.90-22-ge89d8de with WebKitGtk+ 2.19.91 from gnome-nightly's flatpak.

I see often crashed tabs. When inspecting in my system with coredumpctl I've found several cores.

In this case, the NetworkProcess is SIGSEVing. I do not know the conditions for this.
Comment 1 Andres Gomez Garcia 2018-03-05 14:32:07 PST
FTR, I have several duplicated cores of this type.
Comment 2 Michael Catanzaro 2018-03-05 17:07:42 PST
*** Bug 183348 has been marked as a duplicate of this bug. ***
Comment 3 Michael Catanzaro 2018-03-05 17:43:59 PST
It's crashing when retrieving a particular HTTP auth password from the keyring. It's probably triggered by visiting a particular website. If you know which website is triggering this (e.g. our intranet?) then you could investigate with seahorse and see if there is any weird data in the keyring.

I don't know about this one. The trap here is that passwordData might not be null-terminated if the data in the keyring has been modified from what WebKit originally set. I think the current code should be safe against that, because it's careful to use the size of the returned data, and it looks like it should also be safe if secret_value_get() returns null or has zero size. I guess I must be wrong about something here, though.

We could try null-checking passwordData. We could also try using secret_value_get_text() instead of secret_value_get() in order to get a null-terminated string and not have to use the length overload of String::fromUTF8. I think I would try both, for good measure. The downside of using flatpak is that now it's much harder to give you a debugging patch to try....

Let's see what Carlos thinks.
Comment 4 Andres Gomez Garcia 2018-03-06 01:19:45 PST
(In reply to Michael Catanzaro from comment #3)
> It's crashing when retrieving a particular HTTP auth password from the
> keyring. It's probably triggered by visiting a particular website. If you
> know which website is triggering this (e.g. our intranet?) then you could
> investigate with seahorse and see if there is any weird data in the keyring.

I didn't ever have this problem before until moving to 2.19.91. The previous version I was using was 2.19.2
Comment 5 Carlos Garcia Campos 2018-03-06 01:31:57 PST
(In reply to Andres Gomez Garcia from comment #4)
> (In reply to Michael Catanzaro from comment #3)
> > It's crashing when retrieving a particular HTTP auth password from the
> > keyring. It's probably triggered by visiting a particular website. If you
> > know which website is triggering this (e.g. our intranet?) then you could
> > investigate with seahorse and see if there is any weird data in the keyring.
> 
> I didn't ever have this problem before until moving to 2.19.91. The previous
> version I was using was 2.19.2

Did you also switch to use flatpak for 2.19.91? or did you still use flatpak with 2.19.2?
Comment 6 Andres Gomez Garcia 2018-03-06 02:46:53 PST
(In reply to Andres Gomez Garcia from comment #4)

[...]

> I didn't ever have this problem before until moving to 2.19.91. The previous
> version I was using was 2.19.2

I meant 2.19.6
Comment 7 Andres Gomez Garcia 2018-03-06 02:48:06 PST
(In reply to Carlos Garcia Campos from comment #5)

[...]

> Did you also switch to use flatpak for 2.19.91? or did you still use flatpak
> with 2.19.2?

I moved from my own JHBuild-ed 2.19.6 to flatpak.

I still have 2.19.6 in my system, in my JHBuild env.
Comment 8 Carlos Garcia Campos 2018-03-06 23:57:51 PST
It seems to me this is a libsecret bug related with flatpak.
Comment 9 Michael Catanzaro 2018-03-07 09:39:19 PST
It works fine for me. Problem is that Flatpak makes debugging much too hard. To debug this, we'll have to apply a test patch to the official GNOME Flatpak runtime, then wait a day for the build to complete, then test it, then try again if we need to try another debug patch.

There is an alternative: flapjack. It's not fun. There are no incremental builds in flapjack: every build is a full rebuild. But it has worked well for me in the past when needing to debug flatpak-specific issues. Setting up a flapjack build environment might be more work than Andres might want to do, but that seems like possibly the easiest patch forward. Andres, are you willing to try building WebKit with Flapjack, to see if you can reproduce there? (You would then call 'flapjack run org.gnome.Epiphany' to test the result.)

Alternatively, we could try to debug this on the keyring side of things. It's clear from the code that there is a problem with one particular secret on one particular website. So you should look with seahorse to see if there are any obvious problems with the secret (e.g. is it empty?). Make a copy of your login keyring somewhere safe, then delete the secret and see if the crash goes away.
Comment 10 Michael Catanzaro 2018-03-07 09:39:44 PST
And you should know what website in question is triggering this. It should be a website that requires HTTP auth.
Comment 11 Andres Gomez Garcia 2018-03-08 02:29:17 PST
(In reply to Michael Catanzaro from comment #9)

[...]
 
> There is an alternative: flapjack. It's not fun. There are no incremental
> builds in flapjack: every build is a full rebuild. But it has worked well
> for me in the past when needing to debug flatpak-specific issues. Setting up
> a flapjack build environment might be more work than Andres might want to
> do, but that seems like possibly the easiest patch forward. Andres, are you
> willing to try building WebKit with Flapjack, to see if you can reproduce
> there? (You would then call 'flapjack run org.gnome.Epiphany' to test the
> result.)

Nope, I'm sorry. The point for me to use the flatpak version was to make this process easier. Flatpak already has some inconveniences for me, if you add this up to it, I'd rather go back to my JHBuild-ed version.

> Alternatively, we could try to debug this on the keyring side of things.
> It's clear from the code that there is a problem with one particular secret
> on one particular website. So you should look with seahorse to see if there
> are any obvious problems with the secret (e.g. is it empty?). Make a copy of
> your login keyring somewhere safe, then delete the secret and see if the
> crash goes away.

The problem may be in the keyring but I have not (consciously) changed anything and the cores have stopped happening.
Comment 12 Michael Catanzaro 2018-03-08 09:15:24 PST
OK... if you see these crashes start up again, let us know and we can try posting a debug patch.

The GNOME SDK is totally broken right now, and we need Alex to fix it, so it would be several days before we could get a new build for you to test anyway.
Comment 13 Michael Catanzaro 2018-03-08 09:18:49 PST
(In reply to Michael Catanzaro from comment #12)
> OK... if you see these crashes start up again, let us know and we can try
> posting a debug patch.

I mean: I can add a debug patch to the GNOME SDK. That's the only way we'll be able to figure out what's going on.
Comment 14 Andres Gomez Garcia 2018-03-09 02:49:25 PST
*** Bug 183509 has been marked as a duplicate of this bug. ***
Comment 15 Andres Gomez Garcia 2018-03-09 02:50:25 PST
Created attachment 335410 [details]
BT from gdb for the NetworkProcess 2.19.92

Still happening with epiphany 3.27.90-38-gdb76a7f with WebKitGtk+ 2.19.92 from gnome-nightly's flatpak
Comment 16 Andres Gomez Garcia 2018-03-09 02:56:19 PST
(In reply to Michael Catanzaro from comment #12)
> OK... if you see these crashes start up again, let us know and we can try
> posting a debug patch.

So ... the crashes have not disappeared.

Good news is WKGTK is so stable that probably I didn't realize because I've not closed the instance in several days.

Bad news is the SIGSEV only happens when starting ephy. Meaning, it crashes when restoring a session with multiple tabs, many of them for pages with HTTP auth. If I reload the crashed tabs, the NetworkProcess doesn't SIGSEV any more ... meaning, it smells like a concurrency problem or a race condition.
Comment 17 Michael Catanzaro 2018-03-28 17:35:45 PDT
*** Bug 176100 has been marked as a duplicate of this bug. ***
Comment 18 Michael Catanzaro 2018-03-28 17:37:58 PDT
Created attachment 336739 [details]
Another backtrace
Comment 19 Carlos Garcia Campos 2018-04-02 02:08:47 PDT
Could someone confirm this only happens in flatpak? I can't reproduce this, and looking at the bt I don't understand what the problem is. It's crashing in 

GRefPtr<SecretItem> secretItem = adoptGRef(static_cast<SecretItem*>(elements->data));

which happens after an early return:

if (error || !elements || !elements->data)

and secret_service_search_finish is transfer full.
Comment 20 Carlos Garcia Campos 2018-04-02 02:44:53 PDT
I didn't notice the bt was from 2.18 not, 2.20 so the crash happens when trying to call the completion handler. That makes more sense. The report in red hat bz, happens when loasing a page without http auth, so my guess is that the auth is from a previous request that was cancelled while the secret was searched.
Comment 21 Carlos Garcia Campos 2018-04-02 03:24:44 PDT
Created attachment 336974 [details]
Patch

I can't reproduce it, but I think this patch might fix the issue.
Comment 22 EWS Watchlist 2018-04-02 03:27:05 PDT
Attachment 336974 [details] did not pass style-queue:


ERROR: Source/WebCore/platform/network/soup/NetworkStorageSessionSoup.cpp:188:  Extra space before ( in function call  [whitespace/parens] [4]
ERROR: Source/WebCore/platform/network/soup/NetworkStorageSessionSoup.cpp:197:  Extra space before ( in function call  [whitespace/parens] [4]
ERROR: Source/WebCore/platform/network/soup/NetworkStorageSessionSoup.cpp:201:  Extra space before ( in function call  [whitespace/parens] [4]
ERROR: Source/WebCore/platform/network/NetworkStorageSession.h:127:  Extra space before ( in function call  [whitespace/parens] [4]
Total errors found: 4 in 6 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 23 Michael Catanzaro 2018-04-02 08:16:01 PDT
Comment on attachment 336974 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=336974&action=review

The Fedora bug report is not for a Flatpak.

> Source/WebCore/ChangeLog:8
> +        This might happen if a request is cancelled right after the password request starts and before it finishes. We

Aha!

> Source/WebCore/platform/network/soup/NetworkStorageSessionSoup.cpp:240
> +            g_list_foreach(elements.get(), reinterpret_cast<GFunc>(g_object_unref), nullptr);

It took me a while to convince myself that there was no leak in the early return up above. GUniquePtr<GList> is similar to the WlUniquePtr<wl_display> case, in that we really need more flexibility in choosing the deleter func to use. I wonder if it would be possible to make a syntax like GUniquePtr<GList, g_object_unref> do the right thing.
Comment 24 Carlos Garcia Campos 2018-04-02 08:25:37 PDT
(In reply to Michael Catanzaro from comment #23)
> Comment on attachment 336974 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=336974&action=review
> 
> The Fedora bug report is not for a Flatpak.
> 
> > Source/WebCore/ChangeLog:8
> > +        This might happen if a request is cancelled right after the password request starts and before it finishes. We
> 
> Aha!
> 
> > Source/WebCore/platform/network/soup/NetworkStorageSessionSoup.cpp:240
> > +            g_list_foreach(elements.get(), reinterpret_cast<GFunc>(g_object_unref), nullptr);
> 
> It took me a while to convince myself that there was no leak in the early
> return up above. GUniquePtr<GList> is similar to the WlUniquePtr<wl_display>
> case, in that we really need more flexibility in choosing the deleter func
> to use. I wonder if it would be possible to make a syntax like
> GUniquePtr<GList, g_object_unref> do the right thing.

It's not the same, in the case of WlUniquePtr<wl_display> there are two api calls to release a wl_display depending on whether it's a client or server connection. Here, GUniquePtr<GList> always frees *only* the container. We could add a way to also release the contents, but I don't know if it's easy to add since GUniquePtr is a template alias of std::unique_ptr nowadays.
Comment 25 Michael Catanzaro 2018-04-02 09:54:12 PDT
(In reply to Carlos Garcia Campos from comment #24)
> It's not the same, in the case of WlUniquePtr<wl_display> there are two api
> calls to release a wl_display depending on whether it's a client or server
> connection. Here, GUniquePtr<GList> always frees *only* the container. We
> could add a way to also release the contents, but I don't know if it's easy
> to add since GUniquePtr is a template alias of std::unique_ptr nowadays.

Ideally it would use g_list_free_full and allow specifying the GDestroyNotify as a template parameter. I *think* it should be possible.

Similarly, ideally WlUniquePtr would make it easy to specify the deleter func for wl_display. (Alternatively: just delete the instantiation for wl_display and force developers to use std::unique_ptr directly.)

That's a matter for a future patch, though, not this one.

BTW, there is also MallocPtr, which uses bmalloc instead, but it's probably not needed here since allocations should be minimal.
Comment 26 Carlos Garcia Campos 2018-04-03 00:09:48 PDT
Committed r230196: <https://trac.webkit.org/changeset/230196>