Bug 240006 - [GTK][WPE] Expensive atomic operations, overabundant semaphore signaling in GPUProcess streaming IPC
Summary: [GTK][WPE] Expensive atomic operations, overabundant semaphore signaling in G...
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 239895
Blocks: 238593
  Show dependency treegraph
 
Reported: 2022-05-03 06:26 PDT by Zan Dobersek
Modified: 2023-04-17 02:48 PDT (History)
2 users (show)

See Also:


Attachments
Flattened WebProcess perf report (17.75 KB, text/plain)
2022-05-03 06:54 PDT, Zan Dobersek
no flags Details
Flattened GPUProcess perf report (14.27 KB, text/plain)
2022-05-03 06:55 PDT, Zan Dobersek
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zan Dobersek 2022-05-03 06:26:19 PDT
As messages traveling on the streaming IPC channel between WebProcess and GPUProcess increase in count (e.g. due to the complexity of the scene), atomic operations handling loading and exchanging client and server offset values start taking considerable chunk of CPU time. Additionally, there's aggressive semaphore signalling done from the WebProcess when all the data is flushed for the GPUProcess to pick up and process, to the point where the write syscall becomes visible on the profiling output.

This has only been observed on Linux. Atomic ops might affect Cocoa ports as well, but signalling of the eventfd-based semaphore is specific to Linux, at least in how it evolves into a time waste.

For now this will just try and log the problem, no solution has been devised yet.
Comment 1 Zan Dobersek 2022-05-03 06:50:56 PDT
CPU load on the same WebGL load:

Disabled GPUProcess:

 Performance counter stats for process id '3599069' (WebProcess):

          5,035.64 msec task-clock                #    0.629 CPUs utilized          
             7,635      context-switches          #    1.516 K/sec                  
                35      cpu-migrations            #    6.950 /sec                   
               180      page-faults               #   35.745 /sec                   
     6,841,684,581      cycles                    #    1.359 GHz                    
     8,039,991,760      instructions              #    1.18  insn per cycle         
     1,787,173,079      branches                  #  354.905 M/sec                  
        52,361,168      branch-misses             #    2.93% of all branches        

       8.003704734 seconds time elapsed

Enabled GPUProcess:

 Performance counter stats for process id '3601229' (WebProcess):

          4,166.83 msec task-clock                #    0.521 CPUs utilized          
             4,042      context-switches          #  970.042 /sec                   
                18      cpu-migrations            #    4.320 /sec                   
               209      page-faults               #   50.158 /sec                   
     5,013,351,640      cycles                    #    1.203 GHz                    
     6,129,645,128      instructions              #    1.22  insn per cycle         
     1,387,168,345      branches                  #  332.907 M/sec                  
        50,932,209      branch-misses             #    3.67% of all branches        

       8.004450885 seconds time elapsed

 Performance counter stats for process id '3601322' (GPUProcess):

          2,795.15 msec task-clock                #    0.349 CPUs utilized          
           149,970      context-switches          #   53.654 K/sec                  
                17      cpu-migrations            #    6.082 /sec                   
               105      page-faults               #   37.565 /sec                   
     5,308,481,612      cycles                    #    1.899 GHz                    
     7,038,762,193      instructions              #    1.33  insn per cycle         
     1,542,811,483      branches                  #  551.959 M/sec                  
        16,204,399      branch-misses             #    1.05% of all branches        

       8.003696078 seconds time elapsed
Comment 2 Zan Dobersek 2022-05-03 06:54:56 PDT
Created attachment 458738 [details]
Flattened WebProcess perf report
Comment 3 Zan Dobersek 2022-05-03 06:55:15 PDT
Created attachment 458739 [details]
Flattened GPUProcess perf report
Comment 4 Zan Dobersek 2022-05-03 07:02:23 PDT
(In reply to Zan Dobersek from comment #2)
> Created attachment 458738 [details]
> Flattened WebProcess perf report

(In reply to Zan Dobersek from comment #3)
> Created attachment 458739 [details]
> Flattened GPUProcess perf report

These show, for each process, where time is spent when GPUProcess mode is active.

StreamClientConnection and StreamServerConnection methods operating on the buffer offset atomics are never-inlined to isolate those atomic ops as much as possible.
In WebProcess:
     3.44%     3.42%  WPEWebProcess    libWPEWebKit-1.0.so.3.17.0                         [.] IPC::StreamClientConnection::release
     2.34%     2.32%  WPEWebProcess    libWPEWebKit-1.0.so.3.17.0                         [.] IPC::StreamClientConnection::tryAcquire
in GPUProcess:
     8.27%     8.13%  xtGL work queue  libWPEWebKit-1.0.so.3.17.0          [.] IPC::StreamServerConnection::release
     2.62%     2.59%  xtGL work queue  libWPEWebKit-1.0.so.3.17.0          [.] IPC::StreamServerConnection::tryAcquire

Then, for semaphore signalling, in the WebProcess:
    17.96%     0.33%  WPEWebProcess    libWPEWebKit-1.0.so.3.17.0                         [.] IPC::Semaphore::signal

I suspect semaphore signalling could be improved more easily than the atomics, but both could use improvements. On Linux there's futexes which kind of fit into this use case, but not completely and not without a large amount of changes around this code.
Comment 5 Kimmo Kinnunen 2022-05-23 04:09:38 PDT
You can see if the bug 239895 takes care of some of the semaphore signal overhead.
Comment 6 Kimmo Kinnunen 2022-05-23 04:41:19 PDT
I believe perf can show also the slow paths inside the acquire, release, etc. related functions, so those would be interesting to see which parts it thinks are slow