RESOLVED FIXED 107451
[SOUP] Use SoupSession instead of SoupSessionAsync
https://bugs.webkit.org/show_bug.cgi?id=107451
Summary [SOUP] Use SoupSession instead of SoupSessionAsync
Dominik Röttsches (drott)
Reported 2013-01-21 08:00:36 PST
As the subject says, Dan suggested in https://bugs.webkit.org/show_bug.cgi?id=104684#c3 to use the unified SoupSession.
Attachments
Backtrace showing the deadlock (7.07 KB, text/plain)
2018-11-30 14:18 PST, Michael Catanzaro
no flags
Patch (4.40 KB, patch)
2018-11-30 14:27 PST, Michael Catanzaro
no flags
youenn fablet
Comment 1 2014-02-27 07:14:12 PST
Minimum required version of libsoup should be 2.44.1 due to libsoup bug #707711
Martin Robinson
Comment 2 2014-02-27 08:23:56 PST
I think we decided that it was simpler to just the asynchronous SoupSession for synchronous requests. We have been working out all the bugs and the code is completely shared between the different paths, something that might be harder with another type of SoupSession.
Michael Catanzaro
Comment 3 2018-11-30 14:18:18 PST
With glib-networking 2.57.1, WebKit is no longer able to load TLS error pages. The problem is a network process deadlock caused by a change in how glib-networking performs certificate verification. Previously it verified certificates *after* the TLS handshake had completed, which was weirdly late, but previously not problematic. But now that TLS 1.3 exists, application data can be sent before certificate verification occurs, which is no good. So I moved verification to occur during the handshake. I needed to do this regardless because I need to add a new callback in GnuTLS for another feature. This introduced a deadlock in WebKit: * glib-networking detects an unacceptable certificate, emits accept-certificate signal * NetworkDataTaskSoup::tlsConnectionAcceptCertificate calls NetworkDataTaskSoup::invalidateAndCancel calls NetworkDataTaskSoup::clearRequest * NetworkDataTaskSoup::clearRequest calls soup_session_cancel_message The problem is that, in the deprecated SoupSessionAsync used by WebKit, cancellation is always *synchronous* despite the name of the class. So soup_session_cancel_message winds up doing its thing to close everything out, and that eventually ends up in a synchronous call to g_tls_connection_gnutls_close. The close operation can't proceed until the TLS handshake is finished, and the handshake is blocked waiting for WebKit to return from its accept-certificate handler. So the close operation winds up polling forever waiting for the handshake to finish (the handshake normally runs on a secondary thread, just except for accept-certificate). Deadlock. I've attached a backtrace showing the callstack at the time of deadlock. We can try to fix this in glib-networking or in WebKit. I've actually stalled on this for several weeks just because I'm not sure which approach is best. WebKit's use of soup_session_cancel_message seems *really* innocuous and even for someone familiar with the libsoup API, it's really unexpected that using it in the accept-certificate handler could cause a deadlock, so that favored fixing it in glib-networking. That could be done in several ways, e.g. queuing the close operation to occur at a later time and returning a fake success exit status, or returning an error code instead of polling until the handshake completes. Ultimately, I decided glib-networking is already much too complicated, and solving this in WebKit is so much easier. All we need to do is switch from SoupSessionAsync to the modern SoupSession API. Ironically, with modern SoupSession, cancellation is asynchronous, unlike SoupSessionAsync, so the problem goes away. Solving this in WebKit has the major disadvantage that other applications could be broken by the behavior change, but I've used Debian codesearch to verify that WebKit is the only software likely to be affected. The only changes required in WebKit are adjustments for the new default property values. Most of the properties we used to set explicitly are now defaults and should be removed. Additionally, SoupSession has default timeouts, which we want to override to allow NetworkDataTaskSoup to implement its own timeouts. Some final notes: * We already require libsoup 2.42, the first version to introduce the modern SoupSession, so there is no dependency bump required. Youenn noted that this version is kinda broken, but it will build. * This deadlock bug is super annoying, so if we wind up accepting this solution, it would be nice to do a 2.23.2 release. * There is a moderately-high risk of unexpected regression. Do not backport!
Michael Catanzaro
Comment 4 2018-11-30 14:18:58 PST
Created attachment 356235 [details] Backtrace showing the deadlock
Michael Catanzaro
Comment 5 2018-11-30 14:27:31 PST
Michael Catanzaro
Comment 6 2018-11-30 14:37:14 PST
BTW Dan had favored hiding this change in glib-networking, which we can still try to do. This WebKit change is far simpler, though. There is another glib-networking behavior change that cannot be hidden: certificate verification will now stall forever if the thread that started the handshake is not iterating the main context that was the thread-default main context at the time of the operation that triggered the handshake. This caused a couple glib-networking tests to break. That is much less worrisome, though, because unlike the soup_session_cancel_message deadlock, it's quite unlikely that any real application would be affected by that change.
Carlos Garcia Campos
Comment 7 2018-12-03 00:00:08 PST
Comment on attachment 356239 [details] Patch I think this is good to stop using deprecated API, but the actual bug should be fixed in glib-networking too IMO.
Michael Catanzaro
Comment 8 2018-12-03 10:43:36 PST
Problem is glib-networking is actually functioning as designed. It's only a bug once you combine three different libraries with different expectations. That's why I hesitated so long to decide what to do. You could say that a design change is needed in glib-networking to change this there: either to make the close call return a fake success code and queue the real close to occur later, or else to fail the close call with an error.
Michael Catanzaro
Comment 9 2018-12-03 10:45:16 PST
(In reply to Michael Catanzaro from comment #8) > You could say that a design change is needed in glib-networking to change this there: either to > make the close call return a fake success code and queue the real close to > occur later, or else to fail the close call with an error. And neither of those seem like great ideas to me, though either one would be OK and doable if needed.
WebKit Commit Bot
Comment 10 2018-12-03 11:11:37 PST
Comment on attachment 356239 [details] Patch Clearing flags on attachment: 356239 Committed r238805: <https://trac.webkit.org/changeset/238805>
WebKit Commit Bot
Comment 11 2018-12-03 11:11:39 PST
All reviewed patches have been landed. Closing bug.
Radar WebKit Bug Importer
Comment 12 2018-12-03 11:12:31 PST
Michael Catanzaro
Comment 13 2018-12-03 19:57:00 PST
(In reply to Michael Catanzaro from comment #8) > or else to fail the close call with an error. This actually works well, fixing the hang without requiring any changes in WebKit. The test I wrote for this isn't working quite right though. If I can fix my test, I'll land that change. This SoupSessionAsync -> SoupSession change is desirable regardless, of course.
Note You need to log in before you can comment on or make changes to this bug.