Bug 262794

Summary: [WPE][GTK] Web process cache suspend/resume does not work, WebProcessProxy::processIdentifier is not the pid of the actual web process
Product: WebKit Reporter: Michael Catanzaro <mcatanzaro>
Component: WebKitGTKAssignee: Diego Pino <dpino>
Status: RESOLVED FIXED    
Severity: Normal CC: bugs-noreply, mcatanzaro, pgriffis
Priority: P2    
Version: WebKit Nightly Build   
Hardware: PC   
OS: Linux   
See Also: https://bugs.webkit.org/show_bug.cgi?id=266489
https://bugs.webkit.org/show_bug.cgi?id=279026
https://bugs.webkit.org/show_bug.cgi?id=280014

Description Michael Catanzaro 2023-10-06 09:27:13 PDT
In https://commits.webkit.org/247615@main we added code to send SIGSTOP and SIGCONT to the web process.

However, it's broken because the pid tracked is actually the pid of the bwrap or flatpak-spawn process, not the real web process. The actual web process is never stopped unless the sandbox is disabled using WEBKIT_DISABLE_SANDBOX_THIS_IS_DANGEROUS=1.
Comment 1 Michael Catanzaro 2023-10-06 09:33:07 PDT
I think every function that uses WebProcessProxy::processIdentifier is broken if it attempts to use the identifier as a pid rather than just a unique identifier.

P.S. This explains why I have a huge number of stopped flatpak-spawn processes on my system. :)
Comment 2 Patrick Griffis 2023-10-06 10:06:56 PDT
So for the bubblewrap sandbox we can get the real pid fairly easily.

`bwrap` takes `--info-fd` which outputs something like this:

   {
        "child-pid": 310369,
        "cgroup-namespace": 4026533658,
        "ipc-namespace": 4026533656,
        "mnt-namespace": 4026533654,
        "net-namespace": 4026533661,
        "pid-namespace": 4026533657,
        "uts-namespace": 4026533655
    }

We can then use this PID for some operations. I don't think it necessarily makes sense to set `WebProcessProxy::processIdentifier` to this PID. Sometimes we may want to directly do things to the child PID, sometimes we may want to do it to bwrap. We'd just have to audit the current usage.

---

For Flatpak this is also easy, but I think it requires we stop using `flatpak-spawn`. It was the easy solution but we need to use the `org.freedesktop.portal.Flatpak` DBus API directly. It exposes a `SpawnSignal()` method.

Again I don't think we will have a directly usable PID for everything `WebProcessProxy::processIdentifier` represents but we can send it signals.
Comment 3 Michael Catanzaro 2023-12-15 08:13:27 PST
I'm not sure what to do here.

Auditing every single use of WebProcessProxy::processIdentifier is going to be disruptive. To confidently prevent bugs, we would indeed need to change it to ensure it cannot be converted to a pid. It would be *much* easier if we could instead just ensure we have the right pid of the WebKit child process (as seen within the UI process's pid namespace). I doubt we are ever going to want to perform operations on the bwrap process, because that's an implementation detail that the rest of WebKit does not know about.

Problem is org.freedesktop.portal.Flatpak is not going to work because the documentation says we cannot assume the returned pid is valid in the caller's pid namespace. So I guess we really do need to either (a) perform this audit and hope we do not actually need real pids except for things that can be handled by org.freedesktop.portal.Flatpak, or (b) change flatpak to provide us a usable pid.
Comment 4 Michael Catanzaro 2023-12-15 08:24:40 PST
Wait, WebProcessProxy::processIdentifier is not a pid. It's just an int that increments each time a new auxiliary process is created. There has been some refactoring since 247615@main. Now we have a new WTF::ProcessID class to represent pids. We need to audit this instead, not WebProcessProxy::processIdentifier.
Comment 5 Michael Catanzaro 2023-12-15 08:50:30 PST
WebCore::processIdentifier is fine too.

I see a *lot* of things using WTF::ProcessID, but it's mostly used after calling getCurrentProcessID(), which is fine. The problem is specifically AuxiliaryProcessProxy::processID, which is where the bogus pids are coming from. Still, I doubt fixing this is realistic. We probably need to change flatpak. Is it really not possible for it to return results in the caller's pid namespace? :/

Until we are able to fix this, I think we should change ProcessLauncherGLib to set m_processID = 0 instead of setting the wrong pid. There are just too many possible bugs otherwise. This will either prevent or expose them all. E.g. the first one I found: ProcessLauncher::terminateProcess will currently only terminate the flatpak-spawn supervisor process and not the actual WebKit child process; presumably that will cause the WebKit process to leak, rather than terminate?
Comment 6 Michael Catanzaro 2023-12-15 09:38:00 PST
Temporary workaround in bug #266489
Comment 7 Michael Catanzaro 2024-05-16 08:25:58 PDT
We might want to disable web process cache altogether, see bug #274261.
Comment 8 Michael Catanzaro 2024-05-23 11:06:05 PDT
(In reply to Michael Catanzaro from comment #3)
> Problem is org.freedesktop.portal.Flatpak is not going to work because the
> documentation says we cannot assume the returned pid is valid in the
> caller's pid namespace.

Actually there is a flag FLATPAK_SPAWN_FLAGS_EXPOSE_PIDS that we should be able to use!
Comment 9 Michael Catanzaro 2024-06-10 09:07:18 PDT
(In reply to Patrick Griffis from comment #2)
> So for the bubblewrap sandbox we can get the real pid fairly easily.
> 
> `bwrap` takes `--info-fd` which outputs something like this:
> 
>    {
>         "child-pid": 310369,
>         "cgroup-namespace": 4026533658,
>         "ipc-namespace": 4026533656,
>         "mnt-namespace": 4026533654,
>         "net-namespace": 4026533661,
>         "pid-namespace": 4026533657,
>         "uts-namespace": 4026533655
>     }
> 
> We can then use this PID for some operations. I don't think it necessarily
> makes sense to set `WebProcessProxy::processIdentifier` to this PID.
> Sometimes we may want to directly do things to the child PID, sometimes we
> may want to do it to bwrap. We'd just have to audit the current usage.

Carlos Garcia noticed that this pid is actually the pid of the intermediate bwrap process that gets used as pid 1 in the container, not the pid of the actual web process. We might need to add a new way to get the pid of the actual child.
Comment 10 Michael Catanzaro 2024-06-10 10:24:02 PDT
Somebody solved this problem already: https://github.com/containers/bubblewrap/pull/576
Comment 11 Michael Catanzaro 2024-06-13 09:47:49 PDT
Pull request: https://github.com/WebKit/WebKit/pull/29787
Comment 12 Michael Catanzaro 2024-06-13 10:23:36 PDT
(In reply to Michael Catanzaro from comment #8)
> Actually there is a flag FLATPAK_SPAWN_FLAGS_EXPOSE_PIDS that we should be
> able to use!

So this is necessary but not sufficient. It allows the parent process to see pids from the child's pid namespace, but we still need a guarantee that the pid returned is valid in the parent namespace. So the org.freedesktop.portal.Flatpak API is not going to work. I'm out of ideas.

It's possible to fix just suspend/resume if we use this pid with the SpawnSignal method call, instead of calling kill() manually. But that's surely not good enough as there could be plenty of other bugs and broken features if WebKit's understanding of the child process pid is not correct. I think we're going to need changes to flatpak to fix this.
Comment 13 Michael Catanzaro 2024-06-13 15:10:59 PDT
What that bwrap pull request does to get the command-pid is send Unix credentials via a socket. The kernel then translates the pid between pid namespaces automatically. We could possibly do that, but it would have to be done by the sandboxed process *after* exec() to avoid requiring a child setup function.
Comment 14 Michael Catanzaro 2024-06-20 13:00:30 PDT
Solution: instead of depending on a new bubblewrap, just have the WebKit subprocess manually send credentials to the parent process via a socket. Problem solved.
Comment 15 Michael Catanzaro 2024-07-14 11:30:11 PDT
Good news: this bug is the cause of our longstanding bwrap and xdg-dbus-proxy process leaks. The subprocesses are not actually being leaked as I had thought; they're just being stopped with SIGSTOP in error. They quit as expected if I manually send SIGCONT. So fixing this should solve that.
Comment 16 EWS 2024-07-29 01:11:09 PDT
Committed 281488@main (c3c75352f0ee): <https://commits.webkit.org/281488@main>

Reviewed commits have been landed. Closing PR #29787 and removing active labels.
Comment 17 Diego Pino 2024-08-01 06:23:21 PDT
Re-opening for pull request https://github.com/webkit/webkit/pull/31596
Comment 18 EWS 2024-08-01 06:39:36 PDT
Committed 281704@main (04cc3a119676): <https://commits.webkit.org/281704@main>

Reviewed commits have been landed. Closing PR #31596 and removing active labels.