Bug 240006

Summary:

[GTK][WPE] Expensive atomic operations, overabundant semaphore signaling in GPUProcess streaming IPC

Product:

WebKit

Reporter:

Zan Dobersek <zan>

Component:

New Bugs

Assignee:

Nobody <webkit-unassigned>

Status:

NEW

Severity:

Normal

CC:

kdwkleung, kkinnunen

Priority:

Version:

WebKit Nightly Build

Hardware:

Unspecified

OS:

Unspecified

Bug Depends on:

239895

Bug Blocks:

238593

Attachments:

Description	Flags
Flattened WebProcess perf report	none
Flattened GPUProcess perf report	none

Zan Dobersek

Reported 2022-05-03 06:26:19 PDT

As messages traveling on the streaming IPC channel between WebProcess and GPUProcess increase in count (e.g. due to the complexity of the scene), atomic operations handling loading and exchanging client and server offset values start taking considerable chunk of CPU time. Additionally, there's aggressive semaphore signalling done from the WebProcess when all the data is flushed for the GPUProcess to pick up and process, to the point where the write syscall becomes visible on the profiling output. This has only been observed on Linux. Atomic ops might affect Cocoa ports as well, but signalling of the eventfd-based semaphore is specific to Linux, at least in how it evolves into a time waste. For now this will just try and log the problem, no solution has been devised yet.

Attachments
Flattened WebProcess perf report (17.75 KB, text/plain) 2022-05-03 06:54 PDT, Zan Dobersek	no flags	Details
Flattened GPUProcess perf report (14.27 KB, text/plain) 2022-05-03 06:55 PDT, Zan Dobersek	no flags	Details
View All Add attachment proposed patch, testcase, etc.

Zan Dobersek

Comment 1 2022-05-03 06:50:56 PDT

CPU load on the same WebGL load: Disabled GPUProcess: Performance counter stats for process id '3599069' (WebProcess): 5,035.64 msec task-clock # 0.629 CPUs utilized 7,635 context-switches # 1.516 K/sec 35 cpu-migrations # 6.950 /sec 180 page-faults # 35.745 /sec 6,841,684,581 cycles # 1.359 GHz 8,039,991,760 instructions # 1.18 insn per cycle 1,787,173,079 branches # 354.905 M/sec 52,361,168 branch-misses # 2.93% of all branches 8.003704734 seconds time elapsed Enabled GPUProcess: Performance counter stats for process id '3601229' (WebProcess): 4,166.83 msec task-clock # 0.521 CPUs utilized 4,042 context-switches # 970.042 /sec 18 cpu-migrations # 4.320 /sec 209 page-faults # 50.158 /sec 5,013,351,640 cycles # 1.203 GHz 6,129,645,128 instructions # 1.22 insn per cycle 1,387,168,345 branches # 332.907 M/sec 50,932,209 branch-misses # 3.67% of all branches 8.004450885 seconds time elapsed Performance counter stats for process id '3601322' (GPUProcess): 2,795.15 msec task-clock # 0.349 CPUs utilized 149,970 context-switches # 53.654 K/sec 17 cpu-migrations # 6.082 /sec 105 page-faults # 37.565 /sec 5,308,481,612 cycles # 1.899 GHz 7,038,762,193 instructions # 1.33 insn per cycle 1,542,811,483 branches # 551.959 M/sec 16,204,399 branch-misses # 1.05% of all branches 8.003696078 seconds time elapsed

Zan Dobersek

Comment 2 2022-05-03 06:54:56 PDT

Created attachment 458738 [details] Flattened WebProcess perf report

Zan Dobersek

Comment 3 2022-05-03 06:55:15 PDT

Created attachment 458739 [details] Flattened GPUProcess perf report

Zan Dobersek

Comment 4 2022-05-03 07:02:23 PDT

(In reply to Zan Dobersek from comment #2) > Created attachment 458738 [details] > Flattened WebProcess perf report (In reply to Zan Dobersek from comment #3) > Created attachment 458739 [details] > Flattened GPUProcess perf report These show, for each process, where time is spent when GPUProcess mode is active. StreamClientConnection and StreamServerConnection methods operating on the buffer offset atomics are never-inlined to isolate those atomic ops as much as possible. In WebProcess: 3.44% 3.42% WPEWebProcess libWPEWebKit-1.0.so.3.17.0 [.] IPC::StreamClientConnection::release 2.34% 2.32% WPEWebProcess libWPEWebKit-1.0.so.3.17.0 [.] IPC::StreamClientConnection::tryAcquire in GPUProcess: 8.27% 8.13% xtGL work queue libWPEWebKit-1.0.so.3.17.0 [.] IPC::StreamServerConnection::release 2.62% 2.59% xtGL work queue libWPEWebKit-1.0.so.3.17.0 [.] IPC::StreamServerConnection::tryAcquire Then, for semaphore signalling, in the WebProcess: 17.96% 0.33% WPEWebProcess libWPEWebKit-1.0.so.3.17.0 [.] IPC::Semaphore::signal I suspect semaphore signalling could be improved more easily than the atomics, but both could use improvements. On Linux there's futexes which kind of fit into this use case, but not completely and not without a large amount of changes around this code.

Kimmo Kinnunen

Comment 5 2022-05-23 04:09:38 PDT

You can see if the bug 239895 takes care of some of the semaphore signal overhead.

Kimmo Kinnunen

Comment 6 2022-05-23 04:41:19 PDT

I believe perf can show also the slow paths inside the acquire, release, etc. related functions, so those would be interesting to see which parts it thinks are slow

Note You need to log in before you can comment on or make changes to this bug.