Bug 233269

Summary: [SOUP] DNS Issues and resource timeouts
Product: WebKit Reporter: Trev <trev>
Component: WebKitGTKAssignee: Nobody <webkit-unassigned>
Status: RESOLVED MOVED    
Severity: Major CC: bugs-noreply, mcatanzaro, pgriffis
Priority: P2    
Version: Other   
Hardware: PC   
OS: Linux   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=2024330
Attachments:
Description Flags
An animated gif that illustrates my problem.
none
Another test script none

Description Trev 2021-11-17 12:15:12 PST
Created attachment 444547 [details]
An animated gif that illustrates my problem.

* Browser: Epiphany (Gnome Web) version 41, RPM version.
* FlatPak from Flathub also tested, bug was reproduced
* WebKitGTK 2.34.1
* Fedora 35 using Gnome 41 - 5.14.17-301.fc35.x86_64

Please see Fedora Project forum thread for reference: https://ask.fedoraproject.org/t/epiphany-gnome-web-webkit-intermittently-breaks

I am getting browser-breaking DNS failures and content timeouts that are rendering Epiphany unusable. These problems are not occurring with Firefox or any Blink (Chromium) based browser. I am unsure of how to troubleshoot this issue and am happy to follow any steps required.
Comment 1 Trev 2021-11-17 15:31:16 PST
I've done some troubleshooting with Red Hat to try and rule out my system configuration as a problem. It looks like there's decent enough reason to believe that this is some sort of problem with Epiphany/WebKit:

https://bugzilla.redhat.com/show_bug.cgi?id=2024330
Comment 2 Michael Catanzaro 2021-11-18 08:45:44 PST
(In reply to Trev from comment #0)
> I am getting browser-breaking DNS failures and content timeouts that are
> rendering Epiphany unusable. These problems are not occurring with Firefox
> or any Blink (Chromium) based browser. I am unsure of how to troubleshoot
> this issue and am happy to follow any steps required.

What we've established is that GResolver is working fine outside WebKit, but WebKit is nevertheless blocking for very long periods of time when making DNS requests. So whatever is going wrong is happening at either libsoup level or WebKit level. I guess WebKit is more likely.

I wonder if the NetworkProcess is somehow struggling or blocking on something unexpected, but tbh I don't know how to debug this either.
Comment 3 Michael Catanzaro 2021-11-18 08:47:00 PST
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1842443 is the proof that something is going wrong with DNS.
Comment 4 Michael Catanzaro 2021-11-18 09:38:20 PST
So looking at where the timings come from, I found NetworkDataTaskSoup.cpp, which is just watching GSocketClientEvents. Next step might be to add some debug there to see what it's doing. Alternatively, we could write a test script that just tests GSocketClient on its own, though I have a suspicion that it will run fine, just like the GResolver test did.
Comment 5 Michael Catanzaro 2021-11-18 09:40:00 PST
BTW just to be certain: no network proxies configured, yes?
Comment 6 Trev 2021-11-18 12:47:45 PST
(In reply to Michael Catanzaro from comment #5)
> BTW just to be certain: no network proxies configured, yes?

Other than using public DNS, no proxies are being used. Problem was reproduced with both Google and Cloudflare public DNS services. As you know, Fedora tends to serve resolutions from a local stub after the fact.
Comment 7 Michael Catanzaro 2021-11-18 13:07:54 PST
Yeah, no proxies then. Network proxies are different. ;)
Comment 8 Michael Catanzaro 2021-11-22 08:52:53 PST
Created attachment 444970 [details]
Another test script

One more test script. This is to see if we can reproduce the problem using GSocketClient directly.

Note this test script discovered a strange bug in GSocketClient: it is emitting the G_SOCKET_CLIENT_CONNECTING event twice per connection, which is weird because we ought to have tests to notice that.
Comment 9 Michael Catanzaro 2021-11-22 10:33:33 PST
(In reply to Michael Catanzaro from comment #8) 
> Note this test script discovered a strange bug in GSocketClient: it is
> emitting the G_SOCKET_CLIENT_CONNECTING event twice per connection, which is
> weird because we ought to have tests to notice that.

Apparently it's actually allowed, per the documentation. This happens when connecting to a particular IP address times out, and the next one in the list is attempted.
Comment 10 Trev 2021-11-22 11:07:34 PST
(In reply to Michael Catanzaro from comment #8)
> Created attachment 444970 [details]
> Another test script
> 
> One more test script. This is to see if we can reproduce the problem using
> GSocketClient directly.
> 
> Note this test script discovered a strange bug in GSocketClient: it is
> emitting the G_SOCKET_CLIENT_CONNECTING event twice per connection, which is
> weird because we ought to have tests to notice that.

Output:

```
λ ./test-socket-client
Gjs-Message: 11:03:57.847: JS LOG: Connecting to example.com...
Gjs-Message: 11:03:57.849: JS LOG: Resolving example.com...
Gjs-Message: 11:03:57.850: JS LOG: Connecting to discussion.fedoraproject.org...
Gjs-Message: 11:03:57.850: JS LOG: Resolving discussion.fedoraproject.org...
Gjs-Message: 11:04:01.651: JS LOG: Resolved example.com after 3802.394 ms
Gjs-Message: 11:04:01.652: JS LOG: TCP/IP connecting to example.com...
Gjs-Message: 11:04:01.673: JS LOG: TCP/IP connected to example.com after 21.029 ms
Gjs-Message: 11:04:01.730: JS LOG: TLS handshaking for example.com...
Gjs-Message: 11:04:01.756: JS LOG: TLS handshaked for example.com after 26.555 ms
Gjs-Message: 11:04:01.757: JS LOG: Finished connection to example.com after 3909.739 ms total
Gjs-Message: 11:04:05.401: JS LOG: Resolved discussion.fedoraproject.org after 7550.626 ms
Gjs-Message: 11:04:05.401: JS LOG: TCP/IP connecting to discussion.fedoraproject.org...
Gjs-Message: 11:04:05.417: JS LOG: TCP/IP connected to discussion.fedoraproject.org after 15.541 ms
Gjs-Message: 11:04:05.420: JS LOG: TLS handshaking for discussion.fedoraproject.org...
Gjs-Message: 11:04:05.468: JS LOG: TLS handshaked for discussion.fedoraproject.org after 48.148 ms
Gjs-Message: 11:04:05.468: JS LOG: Finished connection to discussion.fedoraproject.org after 7618.382 ms total
```
Comment 11 Michael Catanzaro 2021-11-22 11:38:46 PST
OK, so not WebKit after all. Wasn't expecting that.

This is weird: how is it that GResolver can resolve the hostnames quickly, but GSocketClient goes super slow? That doesn't make sense, because it's just using GResolver.

Whatever. This is actually good because debugging GSocketClient should be way easier than debugging WebKit. Let's continue back in https://bugzilla.redhat.com/show_bug.cgi?id=2024330.