Bug 262794
Summary: | [WPE][GTK] Web process cache suspend/resume does not work, WebProcessProxy::processIdentifier is not the pid of the actual web process | ||
---|---|---|---|
Product: | WebKit | Reporter: | Michael Catanzaro <mcatanzaro> |
Component: | WebKitGTK | Assignee: | Diego Pino <dpino> |
Status: | RESOLVED FIXED | ||
Severity: | Normal | CC: | bugs-noreply, mcatanzaro, pgriffis |
Priority: | P2 | ||
Version: | WebKit Nightly Build | ||
Hardware: | PC | ||
OS: | Linux | ||
See Also: |
https://bugs.webkit.org/show_bug.cgi?id=266489 https://bugs.webkit.org/show_bug.cgi?id=279026 https://bugs.webkit.org/show_bug.cgi?id=280014 |
Michael Catanzaro
In https://commits.webkit.org/247615@main we added code to send SIGSTOP and SIGCONT to the web process.
However, it's broken because the pid tracked is actually the pid of the bwrap or flatpak-spawn process, not the real web process. The actual web process is never stopped unless the sandbox is disabled using WEBKIT_DISABLE_SANDBOX_THIS_IS_DANGEROUS=1.
Attachments | ||
---|---|---|
Add attachment proposed patch, testcase, etc. |
Michael Catanzaro
I think every function that uses WebProcessProxy::processIdentifier is broken if it attempts to use the identifier as a pid rather than just a unique identifier.
P.S. This explains why I have a huge number of stopped flatpak-spawn processes on my system. :)
Patrick Griffis
So for the bubblewrap sandbox we can get the real pid fairly easily.
`bwrap` takes `--info-fd` which outputs something like this:
{
"child-pid": 310369,
"cgroup-namespace": 4026533658,
"ipc-namespace": 4026533656,
"mnt-namespace": 4026533654,
"net-namespace": 4026533661,
"pid-namespace": 4026533657,
"uts-namespace": 4026533655
}
We can then use this PID for some operations. I don't think it necessarily makes sense to set `WebProcessProxy::processIdentifier` to this PID. Sometimes we may want to directly do things to the child PID, sometimes we may want to do it to bwrap. We'd just have to audit the current usage.
---
For Flatpak this is also easy, but I think it requires we stop using `flatpak-spawn`. It was the easy solution but we need to use the `org.freedesktop.portal.Flatpak` DBus API directly. It exposes a `SpawnSignal()` method.
Again I don't think we will have a directly usable PID for everything `WebProcessProxy::processIdentifier` represents but we can send it signals.
Michael Catanzaro
I'm not sure what to do here.
Auditing every single use of WebProcessProxy::processIdentifier is going to be disruptive. To confidently prevent bugs, we would indeed need to change it to ensure it cannot be converted to a pid. It would be *much* easier if we could instead just ensure we have the right pid of the WebKit child process (as seen within the UI process's pid namespace). I doubt we are ever going to want to perform operations on the bwrap process, because that's an implementation detail that the rest of WebKit does not know about.
Problem is org.freedesktop.portal.Flatpak is not going to work because the documentation says we cannot assume the returned pid is valid in the caller's pid namespace. So I guess we really do need to either (a) perform this audit and hope we do not actually need real pids except for things that can be handled by org.freedesktop.portal.Flatpak, or (b) change flatpak to provide us a usable pid.
Michael Catanzaro
Wait, WebProcessProxy::processIdentifier is not a pid. It's just an int that increments each time a new auxiliary process is created. There has been some refactoring since 247615@main. Now we have a new WTF::ProcessID class to represent pids. We need to audit this instead, not WebProcessProxy::processIdentifier.
Michael Catanzaro
WebCore::processIdentifier is fine too.
I see a *lot* of things using WTF::ProcessID, but it's mostly used after calling getCurrentProcessID(), which is fine. The problem is specifically AuxiliaryProcessProxy::processID, which is where the bogus pids are coming from. Still, I doubt fixing this is realistic. We probably need to change flatpak. Is it really not possible for it to return results in the caller's pid namespace? :/
Until we are able to fix this, I think we should change ProcessLauncherGLib to set m_processID = 0 instead of setting the wrong pid. There are just too many possible bugs otherwise. This will either prevent or expose them all. E.g. the first one I found: ProcessLauncher::terminateProcess will currently only terminate the flatpak-spawn supervisor process and not the actual WebKit child process; presumably that will cause the WebKit process to leak, rather than terminate?
Michael Catanzaro
Temporary workaround in bug #266489
Michael Catanzaro
We might want to disable web process cache altogether, see bug #274261.
Michael Catanzaro
(In reply to Michael Catanzaro from comment #3)
> Problem is org.freedesktop.portal.Flatpak is not going to work because the
> documentation says we cannot assume the returned pid is valid in the
> caller's pid namespace.
Actually there is a flag FLATPAK_SPAWN_FLAGS_EXPOSE_PIDS that we should be able to use!
Michael Catanzaro
(In reply to Patrick Griffis from comment #2)
> So for the bubblewrap sandbox we can get the real pid fairly easily.
>
> `bwrap` takes `--info-fd` which outputs something like this:
>
> {
> "child-pid": 310369,
> "cgroup-namespace": 4026533658,
> "ipc-namespace": 4026533656,
> "mnt-namespace": 4026533654,
> "net-namespace": 4026533661,
> "pid-namespace": 4026533657,
> "uts-namespace": 4026533655
> }
>
> We can then use this PID for some operations. I don't think it necessarily
> makes sense to set `WebProcessProxy::processIdentifier` to this PID.
> Sometimes we may want to directly do things to the child PID, sometimes we
> may want to do it to bwrap. We'd just have to audit the current usage.
Carlos Garcia noticed that this pid is actually the pid of the intermediate bwrap process that gets used as pid 1 in the container, not the pid of the actual web process. We might need to add a new way to get the pid of the actual child.
Michael Catanzaro
Somebody solved this problem already: https://github.com/containers/bubblewrap/pull/576
Michael Catanzaro
Pull request: https://github.com/WebKit/WebKit/pull/29787
Michael Catanzaro
(In reply to Michael Catanzaro from comment #8)
> Actually there is a flag FLATPAK_SPAWN_FLAGS_EXPOSE_PIDS that we should be
> able to use!
So this is necessary but not sufficient. It allows the parent process to see pids from the child's pid namespace, but we still need a guarantee that the pid returned is valid in the parent namespace. So the org.freedesktop.portal.Flatpak API is not going to work. I'm out of ideas.
It's possible to fix just suspend/resume if we use this pid with the SpawnSignal method call, instead of calling kill() manually. But that's surely not good enough as there could be plenty of other bugs and broken features if WebKit's understanding of the child process pid is not correct. I think we're going to need changes to flatpak to fix this.
Michael Catanzaro
What that bwrap pull request does to get the command-pid is send Unix credentials via a socket. The kernel then translates the pid between pid namespaces automatically. We could possibly do that, but it would have to be done by the sandboxed process *after* exec() to avoid requiring a child setup function.
Michael Catanzaro
Solution: instead of depending on a new bubblewrap, just have the WebKit subprocess manually send credentials to the parent process via a socket. Problem solved.
Michael Catanzaro
Good news: this bug is the cause of our longstanding bwrap and xdg-dbus-proxy process leaks. The subprocesses are not actually being leaked as I had thought; they're just being stopped with SIGSTOP in error. They quit as expected if I manually send SIGCONT. So fixing this should solve that.
EWS
Committed 281488@main (c3c75352f0ee): <https://commits.webkit.org/281488@main>
Reviewed commits have been landed. Closing PR #29787 and removing active labels.
Diego Pino
Re-opening for pull request https://github.com/webkit/webkit/pull/31596
EWS
Committed 281704@main (04cc3a119676): <https://commits.webkit.org/281704@main>
Reviewed commits have been landed. Closing PR #31596 and removing active labels.