Bug 193179

Summary: Performance issue with WTF::Lock
Product: WebKit Reporter: stix.dima
Component: PlatformAssignee: Nobody <webkit-unassigned>
Status: RESOLVED INVALID    
Severity: Normal CC: don.olmstead, fpizlo, ggaren, keith_miller, mark.lam, ysuzuki
Priority: P2    
Version: WebKit Nightly Build   
Hardware: All   
OS: All   

Description stix.dima 2019-01-06 12:55:22 PST
I came across this article https://webkit.org/blog/6161/locking-in-webkit/ and I instantly spotted that there is a huge design flaw in SpinLock.
Both Intel and AMD provide guidelines for efficient SpinLock design and one of such articles was even referred in that post, however it was misunderstood.
So I'll take this https://trac.webkit.org/browser/trunk/Source/WTF/benchmarks/ToyLocks.h#L88 as reference code and will try to explain, why it's wrong and how it has to be designed.
while (!m_lock.compareExchangeWeak(0, 1, std::memory_order_acquire))
     asm volatile ("pause");
1. It always tries to do atomic cas, if first attemp has failed, most likely second one will fail too.
xchg instruction might be expensive. Modern CPU will try to lock cache in order to do atomic operation, however if data is not in cache or crosses cache boundaries, it will lock data bus. There is a chance that the first lock attempt will be a cache miss, so there will be a data bus lock.
2. It's platform specific - i.e. x86 instruction pause.

To avoid these issues we have to do a volatile read on lock first and if it's available - we should try to lock it with atomic cas.
i.e.
while (true) {
	if (m_lock.load(std::memory_order_relaxed) == 0) {
		if (m_lock.compareExchangeWeak(0, 1, std::memory_order_acquire)) {
			return;
		}
	}

	BusyWaitYield(); // either pause for x86 or yield for arm
}

3. Now let's move on to back off policy in case of failed attempt to lock spinlock. WTF::Lock uses Thread::yield() which, I assume, is either SwitchToThread on Windows or sched_yield on other systems. As far as I understand SwitchToThread will do a context switch (VERY expensive) if there are other threads waiting to be executed, meanwhile sched_yield will do a context switch only if there are other threads of the same or higher priority. i.e. thread scheduling on Windows and POSIX systems is different and there might be different performance hits. SpinLock should not protect code that takes a lot of time to execute and, thus, you don't want to have a context switch on a code that takes 10 times less to execute.
So locks should not use Thread::Yield, but should use pause or yield asm instructions that were designed specifically for waiting in busy loops.

4. Now let's talk about artificial benchmark from article. It clearly shows performance advantage of yield vs pause. That test was done on Broadwell? chip, where pause instruction takes 12-15 cycles (from the top of my head), while lock cmpxchg takes 20 cycles. It's clear, that CPU spends more time on locking cache than busy waiting. A call to sched_yield most likely lasts for a hundred+ cycles somewhere in kernel checking scheduler queue just to find there are no other threads to run (it's artificial benchmark, where will it find free threads?!), so less time is spent locking cache and giving more time for entire process to advance.

5. Magic number 40 in spin loop. Quote from the article:
"We suspect that this algorithm, including the spin limit set at 40, is portable enough for our needs. JikesRVM experiments found it to be optimal on a 12-way POWER machine in 1999."
This is just a joke. You can't rely on some benchmarks done on absolutely different platform 20 years ago. HW and SF changed a lot in this time. What was good back in the days is evil today.
At this point I assume it's clear that one should use pause\yield instruction in spin loop. So now it's time to decide how long should you spin. And the truth is there is no magical number for the amount of spin counts.
Every CPU has different latency for pause instruction. For example Intel CPUs before Skylake spent 12-15 cycles on pause, starting with Skylake pause takes 150 cycles. And this is why you might have another big performance problem if you'll hardcode any magic number for spin counts. The real jedi solution to this problem is to benchmark duration of pause instruction on process startup and spin for fixed magic number of cycles. This is how one should benchmark individual instructions on x86 https://www.agner.org/optimize/#testp Unfortunately I don't know how to do that on ARM CPUs.

I think that's all, I hope I didn't miss any thing.

Disclaimer. I am not using web kit in any way, so I am not familiar with it's code so I don't give you any patch. Setting up environment for WebKit will take more time than fixing this issue, so I decided to provide detailed description about what's wrong, why and how it can be fixed. However I can roll my own benchmarks on Windows comparing performance of your code and my code on a few different processors.
Discuss!

Regards,
Dmytro Vovk
Comment 1 stix.dima 2019-01-06 12:59:58 PST
I forgot to mention one important thing. Artificial benchmarks doesn't always represent performance behavior in real system. I have seen code that performs great in benchmarks but runs 5 times slower in real system because of low level hardware details. So always benchmark real systems
Comment 2 Geoffrey Garen 2019-01-10 09:57:17 PST
We use empirical measurements to make performance decisions. But there's no measurement of a performance problem in this bug report.