RESOLVED FIXED 204398
[Win] Use thread_local to hold Ref<WTF::Thread> in the thread rather than using FLS
https://bugs.webkit.org/show_bug.cgi?id=204398
Summary [Win] Use thread_local to hold Ref<WTF::Thread> in the thread rather than usi...
Fujii Hironori
Reported 2019-11-19 23:49:59 PST
[Win] Use thread_local to hold Ref<WTF::Thread> in the thread rather than using FLS thread_local seems faster than FLS. https://en.cppreference.com/w/cpp/keyword/thread_local 11:09 <yusukesuzuki> fujihiro: do you know the performance difference between FLS and `thread_local` in Windows? 11:12 <fujihiro> yusukesuzuki: I don't know. 11:12 <yusukesuzuki> fujihiro: it would be possible that `thread_local` is faster than FLS. If it is faster, it could be nice using thread_local for Thread for Windows (and maybe Linux?). 11:13 <yusukesuzuki> For Darwin, we are using `_pthread_getspecific_direct`, which is super fast (basically it is the same to TLS Local-Exec semantics IIRC). 11:15 <yusukesuzuki> IIRC, facebook folly is saying thread_local is faster than pthread_getspecific, and IIRC, FLS in Windows is very slow. It is possible that we could get perf improvement if we switch using thread_local for Windows's current Thread holder. 11:22 <fujihiro> yusukesuzuki: Sounds interesting. Will check. 11:28 <fujihiro> yusukesuzuki: Do you mean WTF::ThreadSpecific also should use `thread_local`? If so, how can I ensure destructing WTF::Thread after WTF::ThreadSpecific? 11:29 <yusukesuzuki> fujihiro: I mean, just using `thread_local` only for Thread's holder. 11:29 <yusukesuzuki> fujihiro: not considering about using `thread_local` for the other ThreadSpecific. 11:29 <yusukesuzuki> fujihiro: maybe, for Linux, we should use pthread_getspecific because of ordering thing (pthread works as we expect).
Attachments
WIP patch (5.78 KB, patch)
2019-11-27 22:07 PST, Fujii Hironori
no flags
benchmark of ThreadSpecific (5.07 KB, patch)
2019-12-11 22:08 PST, Fujii Hironori
no flags
benchmark result of comment#2 patch (54.04 KB, application/pdf)
2019-12-11 22:09 PST, Fujii Hironori
no flags
Patch (9.26 KB, patch)
2019-12-16 04:05 PST, Fujii Hironori
no flags
Fujii Hironori
Comment 2 2019-11-27 22:07:54 PST
Created attachment 384438 [details] WIP patch
Fujii Hironori
Comment 3 2019-11-27 23:27:49 PST
Octane 2.0 base: 22402 23199 22336 patched: 22697 22272 22508 speedometer 2.0 base: 65.0 65.71 65.5 patched: 64.9 65.21 64.9 slightly gets slower?
Yusuke Suzuki
Comment 4 2019-11-28 17:14:19 PST
Oh, that’s interesting, but I don’t know whether TLS can affect on these performance. Can we create a silly but a bit possible small benchmark app, which uses JSC APIs as an embedder? Everytime we use JSC APIs, we use JSLock, which uses TLS. Another fun thing is running locking micro benchmark included in WebKit tree. Our ParkingLot is using TLS. So it could be affected.
Fujii Hironori
Comment 5 2019-12-01 23:37:06 PST
https://chromium.googlesource.com/chromium/src/+/e8f14977868c566c4d00666db010d1255e931c86 https://chromium-review.googlesource.com/c/chromium/src/+/1873751 > thread_local is already used in Chrome, though not widely as it is still banned > by the style-guide. Add a benchmark to estimate its performance. > > Detailed results below, tldr: > - On Linux: 2-3x faster than the current TLS implementation for reading, ~10x > faster for writing > - On Android: ~4x faster for reading > (...) Re: [chromium-dev] Using C++11 thread_local storage class https://groups.google.com/a/chromium.org/d/msg/cxx/h7O5BdtWCZw/NUhcP-DBBwAJ https://groups.google.com/a/chromium.org/d/msg/cxx/h7O5BdtWCZw/Pa8v7RFBCAAJ > I've done some performance comparisons between using thread_local > and using WTF::ThreadSpecific, and there's a 2.5x performance > improvement[1] on both Windows and Mac. The generated code is > also significantly smaller, which is a nice benefit as well.
Fujii Hironori
Comment 6 2019-12-02 01:13:07 PST
I tried base_perftests on my PC. Windows 10 Pro Version 10.0.18362 Build 18362 AMD Ryzen 7 1700 Eight-Core Processor, 3000 Mhz, 8 Core(s), 16 Logical Processor(s) > autoninja -C out/Default base:base_perftests C:\work\chromium\src>.\out\Default\base_perftests.exe --gtest_filter=ThreadLocalStoragePerfTest.* Note: Google Test filter = ThreadLocalStoragePerfTest.* [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from ThreadLocalStoragePerfTest [ RUN ] ThreadLocalStoragePerfTest.ThreadLocalStorage *RESULT TLS read throughput: ThreadLocalStorage= 26301.53127515084 operations/ms *RESULT TLS read throughput: ThreadLocalStorage= 38 ns/operation *RESULT TLS write throughput: ThreadLocalStorage= 26061.755936868 operations/ms *RESULT TLS write throughput: ThreadLocalStorage= 38 ns/operation *RESULT TLS read-write throughput: ThreadLocalStorage= 12702.880886356217 operations/ms *RESULT TLS read-write throughput: ThreadLocalStorage= 78 ns/operation *RESULT TLS read throughput: ThreadLocalStorage 4 threads= 23584.01569752085 operations/ms *RESULT TLS read throughput: ThreadLocalStorage 4 threads= 42 ns/operation *RESULT TLS write throughput: ThreadLocalStorage 4 threads= 23240.46443744132 operations/ms *RESULT TLS write throughput: ThreadLocalStorage 4 threads= 43 ns/operation *RESULT TLS read-write throughput: ThreadLocalStorage 4 threads= 11242.043443752684 operations/ms *RESULT TLS read-write throughput: ThreadLocalStorage 4 threads= 88 ns/operation [ OK ] ThreadLocalStoragePerfTest.ThreadLocalStorage (3311 ms) [ RUN ] ThreadLocalStoragePerfTest.PlatformFls *RESULT TLS read throughput: PlatformFls= 142934.7359995426 operations/ms *RESULT TLS read throughput: PlatformFls= 6 ns/operation *RESULT TLS write throughput: PlatformFls= 113831.688465435 operations/ms *RESULT TLS write throughput: PlatformFls= 8 ns/operation *RESULT TLS read-write throughput: PlatformFls= 64341.78355424013 operations/ms *RESULT TLS read-write throughput: PlatformFls= 15 ns/operation *RESULT TLS read throughput: PlatformFls 4 threads= 136150.74610608869 operations/ms *RESULT TLS read throughput: PlatformFls 4 threads= 7 ns/operation *RESULT TLS write throughput: PlatformFls 4 threads= 106626.85930585915 operations/ms *RESULT TLS write throughput: PlatformFls 4 threads= 9 ns/operation *RESULT TLS read-write throughput: PlatformFls 4 threads= 55406.57343587243 operations/ms *RESULT TLS read-write throughput: PlatformFls 4 threads= 18 ns/operation [ OK ] ThreadLocalStoragePerfTest.PlatformFls (677 ms) [ RUN ] ThreadLocalStoragePerfTest.PlatformTls *RESULT TLS read throughput: PlatformTls= 289075.8245887896 operations/ms *RESULT TLS read throughput: PlatformTls= 3 ns/operation *RESULT TLS write throughput: PlatformTls= 182351.97578365763 operations/ms *RESULT TLS write throughput: PlatformTls= 5 ns/operation *RESULT TLS read-write throughput: PlatformTls= 127638.93498072651 operations/ms *RESULT TLS read-write throughput: PlatformTls= 7 ns/operation *RESULT TLS read throughput: PlatformTls 4 threads= 263511.55498168594 operations/ms *RESULT TLS read throughput: PlatformTls 4 threads= 3 ns/operation *RESULT TLS write throughput: PlatformTls 4 threads= 161124.00103119362 operations/ms *RESULT TLS write throughput: PlatformTls 4 threads= 6 ns/operation *RESULT TLS read-write throughput: PlatformTls 4 threads= 116123.7879579632 operations/ms *RESULT TLS read-write throughput: PlatformTls 4 threads= 8 ns/operation [ OK ] ThreadLocalStoragePerfTest.PlatformTls (370 ms) [ RUN ] ThreadLocalStoragePerfTest.Cpp11Tls *RESULT TLS read throughput: C++ thread_local TLS= 357449.24220760656 operations/ms *RESULT TLS read throughput: C++ thread_local TLS= 2 ns/operation *RESULT TLS write throughput: C++ thread_local TLS= 358628.60421747237 operations/ms *RESULT TLS write throughput: C++ thread_local TLS= 2 ns/operation *RESULT TLS read-write throughput: C++ thread_local TLS= 224039.43093984542 operations/ms *RESULT TLS read-write throughput: C++ thread_local TLS= 4 ns/operation *RESULT TLS read throughput: C++ thread_local TLS 4 threads= 327300.10146303143 operations/ms *RESULT TLS read throughput: C++ thread_local TLS 4 threads= 3 ns/operation *RESULT TLS write throughput: C++ thread_local TLS 4 threads= 109206.07185759529 operations/ms *RESULT TLS write throughput: C++ thread_local TLS 4 threads= 9 ns/operation *RESULT TLS read-write throughput: C++ thread_local TLS 4 threads= 51652.89256198347 operations/ms *RESULT TLS read-write throughput: C++ thread_local TLS 4 threads= 19 ns/operation [ OK ] ThreadLocalStoragePerfTest.Cpp11Tls (428 ms) [----------] 4 tests from ThreadLocalStoragePerfTest (4787 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (4791 ms total) [ PASSED ] 4 tests. C:\work\chromium\src>
Fujii Hironori
Comment 7 2019-12-02 01:36:20 PST
In Comment 6, Cpp11Tls was performed badly, but this seems anomaly. I tested several times, Cpp11Tls is performing nicely. Note: Google Test filter = ThreadLocalStoragePerfTest.* [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from ThreadLocalStoragePerfTest [ RUN ] ThreadLocalStoragePerfTest.ThreadLocalStorage *RESULT TLS read throughput: ThreadLocalStorage= 26487.891060601647 operations/ms *RESULT TLS read throughput: ThreadLocalStorage= 37 ns/operation *RESULT TLS write throughput: ThreadLocalStorage= 25845.202742692916 operations/ms *RESULT TLS write throughput: ThreadLocalStorage= 38 ns/operation *RESULT TLS read-write throughput: ThreadLocalStorage= 12489.306031710348 operations/ms *RESULT TLS read-write throughput: ThreadLocalStorage= 80 ns/operation *RESULT TLS read throughput: ThreadLocalStorage 4 threads= 23530.8512991383 operations/ms *RESULT TLS read throughput: ThreadLocalStorage 4 threads= 42 ns/operation *RESULT TLS write throughput: ThreadLocalStorage 4 threads= 23237.764155284036 operations/ms *RESULT TLS write throughput: ThreadLocalStorage 4 threads= 43 ns/operation *RESULT TLS read-write throughput: ThreadLocalStorage 4 threads= 11322.797093211531 operations/ms *RESULT TLS read-write throughput: ThreadLocalStorage 4 threads= 88 ns/operation [ OK ] ThreadLocalStoragePerfTest.ThreadLocalStorage (3324 ms) [ RUN ] ThreadLocalStoragePerfTest.PlatformFls *RESULT TLS read throughput: PlatformFls= 153862.72367793455 operations/ms *RESULT TLS read throughput: PlatformFls= 6 ns/operation *RESULT TLS write throughput: PlatformFls= 121104.96167027965 operations/ms *RESULT TLS write throughput: PlatformFls= 8 ns/operation *RESULT TLS read-write throughput: PlatformFls= 64830.662310046166 operations/ms *RESULT TLS read-write throughput: PlatformFls= 15 ns/operation *RESULT TLS read throughput: PlatformFls 4 threads= 137059.52495168653 operations/ms *RESULT TLS read throughput: PlatformFls 4 threads= 7 ns/operation *RESULT TLS write throughput: PlatformFls 4 threads= 104171.00712529689 operations/ms *RESULT TLS write throughput: PlatformFls 4 threads= 9 ns/operation *RESULT TLS read-write throughput: PlatformFls 4 threads= 56775.2777730465 operations/ms *RESULT TLS read-write throughput: PlatformFls 4 threads= 17 ns/operation [ OK ] ThreadLocalStoragePerfTest.PlatformFls (668 ms) [ RUN ] ThreadLocalStoragePerfTest.PlatformTls *RESULT TLS read throughput: PlatformTls= 298124.7950392034 operations/ms *RESULT TLS read throughput: PlatformTls= 3 ns/operation *RESULT TLS write throughput: PlatformTls= 176619.1561136721 operations/ms *RESULT TLS write throughput: PlatformTls= 5 ns/operation *RESULT TLS read-write throughput: PlatformTls= 128373.0005905158 operations/ms *RESULT TLS read-write throughput: PlatformTls= 7 ns/operation *RESULT TLS read throughput: PlatformTls 4 threads= 264228.71637689584 operations/ms *RESULT TLS read throughput: PlatformTls 4 threads= 3 ns/operation *RESULT TLS write throughput: PlatformTls 4 threads= 169497.2710939354 operations/ms *RESULT TLS write throughput: PlatformTls 4 threads= 5 ns/operation *RESULT TLS read-write throughput: PlatformTls 4 threads= 115937.99635954692 operations/ms *RESULT TLS read-write throughput: PlatformTls 4 threads= 8 ns/operation [ OK ] ThreadLocalStoragePerfTest.PlatformTls (371 ms) [ RUN ] ThreadLocalStoragePerfTest.Cpp11Tls *RESULT TLS read throughput: C++ thread_local TLS= 359828.7215285524 operations/ms *RESULT TLS read throughput: C++ thread_local TLS= 2 ns/operation *RESULT TLS write throughput: C++ thread_local TLS= 357193.8848406915 operations/ms *RESULT TLS write throughput: C++ thread_local TLS= 2 ns/operation *RESULT TLS read-write throughput: C++ thread_local TLS= 226911.7313365101 operations/ms *RESULT TLS read-write throughput: C++ thread_local TLS= 4 ns/operation *RESULT TLS read throughput: C++ thread_local TLS 4 threads= 325595.0249080194 operations/ms *RESULT TLS read throughput: C++ thread_local TLS 4 threads= 3 ns/operation *RESULT TLS write throughput: C++ thread_local TLS 4 threads= 325870.8899534005 operations/ms *RESULT TLS write throughput: C++ thread_local TLS 4 threads= 3 ns/operation *RESULT TLS read-write throughput: C++ thread_local TLS 4 threads= 206577.42521897206 operations/ms *RESULT TLS read-write throughput: C++ thread_local TLS 4 threads= 4 ns/operation [ OK ] ThreadLocalStoragePerfTest.Cpp11Tls (225 ms) [----------] 4 tests from ThreadLocalStoragePerfTest (4590 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (4597 ms total) [ PASSED ] 4 tests.
Fujii Hironori
Comment 8 2019-12-02 01:53:21 PST
Yusuke Suzuki
Comment 9 2019-12-02 19:49:14 PST
(In reply to Fujii Hironori from comment #8) > graph: https://ibb.co/ZKTHn08 Nice. One question: TLS typically has several modes. In UNIX-like env, we have init-exec, local-exec, lical-dynamic, and general-dynamic. The mode is selected based on how the fule is compiled and linked (in main objfile? in so file? etc.) Can you check whether the measured TLS’s mode is the same to WebKit’s mode?
Fujii Hironori
Comment 10 2019-12-04 02:31:55 PST
Oh, I didn't know that. I'm reading following materials: Thread Local Storage, part 8: Wrap-up « Nynaeve http://www.nynaeve.net/?p=190 Consequences of using variables declared __declspec(thread) | The Old New Thing https://devblogs.microsoft.com/oldnewthing/20101122-00/?p=12233 ELF Handling For Thread-Local Storage - Ulrich Drepper https://akkadia.org/drepper/tls.pdf /GA (Optimize for Windows Application) | Microsoft Docs https://docs.microsoft.com/en-us/cpp/build/reference/ga-optimize-for-windows-application?view=vs-2019 base_perftests.exe seems not to be built with /GA switch.
Fujii Hironori
Comment 11 2019-12-11 22:08:14 PST
Created attachment 385477 [details] benchmark of ThreadSpecific
Fujii Hironori
Comment 12 2019-12-11 22:09:21 PST
Created attachment 385478 [details] benchmark result of comment#2 patch
Fujii Hironori
Comment 13 2019-12-11 22:10:15 PST
It's faster.
Fujii Hironori
Comment 14 2019-12-13 04:06:03 PST
Windows doesn't support dllexport of thread_local variable. So, it's unable to access thread_local variable on other DLL. Compiler Error C2492 | Microsoft Docs https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2492?redirectedfrom=MSDN&view=vs-2019 Windows uses init-exec model as default, and local-exec models if /GA switch is specified and it's a EXE. without /GA 00007FF775A61324 mov ecx,dword ptr [_tls_index (07FF775A68150h)] 00007FF775A6132A mov rax,qword ptr gs:[58h] 00007FF775A61333 mov edx,104h 00007FF775A61338 mov rax,qword ptr [rax+rcx*8] with /GA 00007FF6D9431324 mov rax,qword ptr gs:[58h] 00007FF6D943132D mov edx,104h
Fujii Hironori
Comment 15 2019-12-16 04:05:45 PST
Yusuke Suzuki
Comment 16 2019-12-18 14:11:55 PST
Comment on attachment 385748 [details] Patch r=me
Fujii Hironori
Comment 17 2019-12-18 18:03:07 PST
Comment on attachment 385748 [details] Patch Clearing flags on attachment: 385748 Committed r253730: <https://trac.webkit.org/changeset/253730>
Fujii Hironori
Comment 18 2019-12-18 18:03:11 PST
All reviewed patches have been landed. Closing bug.
Radar WebKit Bug Importer
Comment 19 2019-12-18 18:04:26 PST
Yusuke Suzuki
Comment 20 2019-12-18 18:38:21 PST
Note You need to log in before you can comment on or make changes to this bug.