RESOLVED FIXED 204192
[WinCairo] dead lock in Thread::destructTLS
https://bugs.webkit.org/show_bug.cgi?id=204192
Summary [WinCairo] dead lock in Thread::destructTLS
Fujii Hironori
Reported 2019-11-13 23:35:38 PST
[WinCairo] dead lock in Thread::destructTLS By running run-webkit-tests, sometimes WebKitNetworkProcess.exe process remains alive. > python ./Tools/Scripts/run-webkit-tests --wincairo --debug --no-new-test-results --no-retry-failures --no-show-results -vf fast By attaching a debugger, I can get the callstack. There is only the main thread. Callstack: > [External Code] > WTF.dll!WTF::WordLock::lockSlow() Line 148 C++ > WTF.dll!WTF::WordLock::lock() Line 62 C++ > WTF.dll!WTF::Locker<WTF::WordLock>::lock() Line 114 C++ > WTF.dll!WTF::Locker<WTF::WordLock>::Locker(WTF::WordLock & lockable) Line 55 C++ > WTF.dll!WTF::holdLock<WTF::WordLock>(WTF::WordLock & lock) Line 127 C++ > WTF.dll!WTF::Thread::destructTLS(void * data) Line 343 C++ > WTF.dll!WTF::Thread::destructTLS::ThreadExitCallback::~ThreadExitCallback() Line 334 C++ > WTF.dll!??__Fcallback@?1??destructTLS@Thread@WTF@@KAXPEAX@Z@YAXXZ() Line 324 C++ > [External Code]
Attachments
screenshot of debugger (249.46 KB, image/png)
2019-11-13 23:56 PST, Fujii Hironori
no flags
screenshot of debugger (WebKitTestRunner.exe) (221.74 KB, image/png)
2019-11-14 01:43 PST, Fujii Hironori
no flags
screenshot of debugger (in WordLock::lockSlow) (279.38 KB, image/png)
2019-11-14 19:06 PST, Fujii Hironori
no flags
Patch (1.58 KB, patch)
2019-11-14 23:06 PST, Fujii Hironori
ysuzuki: review+
ews-watchlist: commit-queue-
Patch to use std::mutex (685 bytes, patch)
2019-11-15 01:28 PST, Fujii Hironori
no flags
Patch to use Mutex and ThreadCondition in WordLock (1.68 KB, patch)
2019-11-15 02:37 PST, Fujii Hironori
no flags
screenshot of debugger (with the patch to use Mutex and ThreadCondition in WordLock) (252.26 KB, image/png)
2019-11-15 02:38 PST, Fujii Hironori
no flags
Archive of layout-test-results from ews214 for win-future (14.83 MB, application/zip)
2019-11-15 02:43 PST, EWS Watchlist
no flags
a test program to test my hypothesis (900 bytes, patch)
2019-11-17 21:23 PST, Fujii Hironori
no flags
screenshot of debugger (326.99 KB, image/png)
2019-11-18 03:54 PST, Fujii Hironori
no flags
Fujii Hironori
Comment 1 2019-11-13 23:56:51 PST
Created attachment 383546 [details] screenshot of debugger
Fujii Hironori
Comment 2 2019-11-14 01:43:36 PST
Created attachment 383549 [details] screenshot of debugger (WebKitTestRunner.exe) I observed a case that WebKitTestRunner.exe didn't die.
Fujii Hironori
Comment 3 2019-11-14 19:06:12 PST
Created attachment 383593 [details] screenshot of debugger (in WordLock::lockSlow) Waited in the parking, but no other threads were alive.
Fujii Hironori
Comment 4 2019-11-14 23:06:01 PST
Yusuke Suzuki
Comment 5 2019-11-15 00:12:06 PST
Comment on attachment 383602 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=383602&action=review > Source/WTF/wtf/win/ThreadingWin.cpp:268 > +static Mutex threadMapMutex; What happens if you use std::mutex? I'm guessing this is the issue.
Fujii Hironori
Comment 6 2019-11-15 01:28:36 PST
Created attachment 383607 [details] Patch to use std::mutex I confirmed using std::mutex also seems solving the issue.
Fujii Hironori
Comment 7 2019-11-15 01:40:21 PST
How about the idea implementing ThreadSpecific in WTF::Thread? It'd be easy to ensure deref-ing WTF::Thread in TLS after destructing other ThreadSpecific. We won't need to defer deref-ing WTF::Thread in TLS.
Yusuke Suzuki
Comment 8 2019-11-15 01:54:15 PST
(In reply to Fujii Hironori from comment #7) > How about the idea implementing ThreadSpecific in WTF::Thread? > It'd be easy to ensure deref-ing WTF::Thread in TLS after destructing other > ThreadSpecific. > We won't need to defer deref-ing WTF::Thread in TLS. For Windows, I think it is OK. Windows TLS / FLS has very weird behavior on destructor calls, reviving the TLS entry when destroying the other TLS etc. We are doing weird workaround for now. Maybe, implementing TLS in WTF::Thread would improve stability, and it would be improving performance too (b/c we are using FLS in Windows, and IIRC, it is slow). For non-Windows (like, Darwin), I think the performance problem needs to be solved. And in Darwin / Linux pthread TLS, it seems that it behaves as we want. So I think we can just use pthread TLS on Darwin / Linux.
Yusuke Suzuki
Comment 9 2019-11-15 01:55:56 PST
(In reply to Yusuke Suzuki from comment #8) > > For Windows, I think it is OK. Windows TLS / FLS has very weird behavior on > destructor calls, reviving the TLS entry when destroying the other TLS etc. > We are doing weird workaround for now. Maybe, implementing TLS in > WTF::Thread would improve stability, and it would be improving performance > too (b/c we are using FLS in Windows, and IIRC, it is slow). > > For non-Windows (like, Darwin), I think the performance problem needs to be > solved. > And in Darwin / Linux pthread TLS, it seems that it behaves as we want. So I > think we can just use pthread TLS on Darwin / Linux. IIRC, we saw some performance problem in our WTF::Lock / JSLock in Windows due to slowness of Windows FLS.
Yusuke Suzuki
Comment 10 2019-11-15 02:00:48 PST
(In reply to Fujii Hironori from comment #6) > Created attachment 383607 [details] > Patch to use std::mutex > > I confirmed using std::mutex also seems solving the issue. This sounds very strange to me. WTF::WordLock is intentionally avoiding dependency to TLS to make it usable when destroying threads. Internally, it only uses stack, atomics, std::mutex, and std::condition_variable. If using std::mutex solved the problem, my guess is that root cause of this problem is not in WordLock, rather in std::condition_variable in Windows, which is used in WordLock. Given that, can you try replacing std::mutex in WordLock to Mutex (and functions), and replacing std::condition_variable in WordLock to ThreadCondition (and functions)? If it solves the problem, I think this is much better solution.
Yusuke Suzuki
Comment 11 2019-11-15 02:06:30 PST
Now, my guess is that using std::condition_variable / ThreadCondition after destroying some TLS is the problem.
Yusuke Suzuki
Comment 12 2019-11-15 02:08:05 PST
(In reply to Yusuke Suzuki from comment #11) > Now, my guess is that using std::condition_variable / ThreadCondition after > destroying some TLS is the problem. Can we know what implementation in Windows STL std::mutex / std::condition_variable is used? They recently released their STL code: https://github.com/microsoft/STL and it includes three implementations, vista, win7, and other one. If win7 version is used, I think the behavior is completely same between std::mutex and Mutex, std::condition_variable and ThreadCondition.
Yusuke Suzuki
Comment 13 2019-11-15 02:11:51 PST
(In reply to Yusuke Suzuki from comment #12) > (In reply to Yusuke Suzuki from comment #11) > > Now, my guess is that using std::condition_variable / ThreadCondition after > > destroying some TLS is the problem. > > Can we know what implementation in Windows STL std::mutex / > std::condition_variable is used? They recently released their STL code: > https://github.com/microsoft/STL and it includes three implementations, > vista, win7, and other one. If win7 version is used, I think the behavior is > completely same between std::mutex and Mutex, std::condition_variable and > ThreadCondition. But if CONCRT implementation is used, maybe it would be different. And if std::mutex / Mutex, and std::condition_variable / ThreadCondition implementations are different, it would be possible that using std::condition_variable when destroying TLS is the problem. Anyway, if we can fix the problem by replacing std::mutex to Mutex and replacing std::condition_variable to ThreadCondition in WordLock, it would be very nice since we can just continue using WordLock when such a weird edge case happens. And I think implementing TLS in WTF::Thread for Windows is nice idea.
Fujii Hironori
Comment 14 2019-11-15 02:37:19 PST
Created attachment 383608 [details] Patch to use Mutex and ThreadCondition in WordLock Unfortunately, using Mutex and ThreadCondition in WordLock doesn't solve the issue.
Fujii Hironori
Comment 15 2019-11-15 02:38:11 PST
Created attachment 383609 [details] screenshot of debugger (with the patch to use Mutex and ThreadCondition in WordLock)
Yusuke Suzuki
Comment 16 2019-11-15 02:41:53 PST
(In reply to Fujii Hironori from comment #15) > Created attachment 383609 [details] > screenshot of debugger (with the patch to use Mutex and ThreadCondition in > WordLock) Interesting. I think, this means that std::condition_variable is the root cause of this problem.
EWS Watchlist
Comment 17 2019-11-15 02:43:31 PST
Comment on attachment 383602 [details] Patch Attachment 383602 [details] did not pass win-ews (win): Output: https://webkit-queues.webkit.org/results/13257920 New failing tests: animations/no-style-recalc-during-accelerated-animation.html
EWS Watchlist
Comment 18 2019-11-15 02:43:34 PST
Created attachment 383610 [details] Archive of layout-test-results from ews214 for win-future The attached test failures were seen while running run-webkit-tests on the win-ews. Bot: ews214 Port: win-future Platform: CYGWIN_NT-10.0-17763-3.0.5-338.x86_64-x86_64-64bit
Yusuke Suzuki
Comment 19 2019-11-15 02:46:01 PST
Comment on attachment 383602 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=383602&action=review r=me >> Source/WTF/wtf/win/ThreadingWin.cpp:268 >> +static Mutex threadMapMutex; > > What happens if you use std::mutex? I'm guessing this is the issue. Can you add a comment that describes why we are using Mutex instead of WordLock here?
Fujii Hironori
Comment 20 2019-11-15 03:16:19 PST
It seems using stl_condition_variable_win7. > msvcp140.dll!Concurrency::details::stl_condition_variable_win7::wait(Concurrency::details::stl_critical_section_interface * lock) Line 180 C++ > msvcp140.dll!do_wait(_Cnd_internal_imp_t * cond, _Mtx_internal_imp_t * mtx, const xtime * target) Line 58 C++ > WTF.dll!std::condition_variable::wait(std::unique_lock<std::mutex> & _Lck) Line 603 C++ > WTF.dll!WTF::WordLock::lockSlow() Line 149 C++ > WTF.dll!WTF::WordLock::lock() Line 62 C++ > WTF.dll!WTF::`anonymous namespace'::enqueue<WTF::`anonymous-namespace'::ThreadData * <lambda>(void)>(const void * address, const WTF::ParkingLot::parkConditionallyImpl::__l2::WTF::`anonymous-namespace'::ThreadData * <lambda>(void) & functor) Line 492 C++ > WTF.dll!WTF::ParkingLot::parkConditionallyImpl(const void * address, const WTF::ScopedLambda<bool __cdecl(void)> & validation, const WTF::ScopedLambda<void __cdecl(void)> & beforeSleep, const WTF::TimeWithDynamicClockType & timeout) Line 578 C++ > WTF.dll!WTF::ParkingLot::parkConditionally<bool <lambda>(void),void <lambda>(void)>(const void * address, const WTF::ParkingLot::compareAndPark::__l2::bool <lambda>(void) & validation, const WTF::ParkingLot::compareAndPark::__l2::void <lambda>(void) & beforeSleep, const WTF::TimeWithDynamicClockType & timeout) Line 82 C++ > WTF.dll!WTF::ParkingLot::compareAndPark<unsigned char,unsigned char>(const WTF::Atomic<unsigned char> * address, unsigned char expected) Line 94 C++ > WTF.dll!WTF::LockAlgorithm<unsigned char,1,2,WTF::EmptyLockHooks<unsigned char>>::lockSlow(WTF::Atomic<unsigned char> & lock) Line 83 C++ > WTF.dll!WTF::Lock::lockSlow() Line 41 C++ > WTF.dll!WTF::Lock::lock() Line 60 C++ > WTF.dll!WTF::Locker<WTF::Lock>::lock() Line 114 C++ > WTF.dll!WTF::Locker<WTF::Lock>::Locker<WTF::Lock>(WTF::Lock & lockable) Line 55 C++ > WTF.dll!WTF::holdLock<WTF::Lock>(WTF::Lock & lock) Line 127 C++ > WTF.dll!WTF::WorkQueue::dispatch(WTF::Function<void __cdecl(void)> && function) Line 104 C++ > WebKit2.dll!IPC::Connection::invokeReadEventHandler() Line 233 C++ > WebKit2.dll!IPC::Connection::open::__l2::<lambda>() Line 253 C++ > WebKit2.dll!WTF::Detail::CallableWrapper<void <lambda>(void),void>::call() Line 52 C++ > WebKit2.dll!WTF::Function<void __cdecl(void)>::operator()() Line 80 C++ > WebKit2.dll!IPC::Connection::EventListener::callback(void * data, unsigned char timerOrWaitFired) Line 338 C++ > [External Code]
Fujii Hironori
Comment 21 2019-11-15 03:19:54 PST
Is it possible to kill a thread which is locking the mutex?
Yusuke Suzuki
Comment 22 2019-11-15 11:56:26 PST
(In reply to Fujii Hironori from comment #21) > Is it possible to kill a thread which is locking the mutex? Not sure.
Fujii Hironori
Comment 23 2019-11-15 16:03:36 PST
This is my wild guess: 1. A detached thread is looking the WordLock 2. the main thread is going to exit 3. The detached thread is killed while holding the lock 4. Windows reclaims locked mutexes the killed thread was holding, but the WordLock's atomic is not a OS-managed mutex. 5. ~ThreadExitCallback is executed in the main thread.
Yusuke Suzuki
Comment 24 2019-11-15 16:40:04 PST
(In reply to Fujii Hironori from comment #23) > This is my wild guess: > > 1. A detached thread is looking the WordLock > 2. the main thread is going to exit > 3. The detached thread is killed while holding the lock > 4. Windows reclaims locked mutexes the killed thread was holding, but the > WordLock's atomic is not a OS-managed mutex. > 5. ~ThreadExitCallback is executed in the main thread. Interesting. Several questions, 1. Can the main thread run after the system kills other threads? 2. Is stucking thread the main thread? If your guess is true, I think the best way is checking `isMainThread()` status in Thread::destructTLS to avoid execution. Wrap `Thread::destructTLS` with a function, which checks `isMainThread()`. (And rename the current destructTLS to destructTLSImpl etc.).
Yusuke Suzuki
Comment 25 2019-11-15 16:49:19 PST
(In reply to Yusuke Suzuki from comment #24) > (In reply to Fujii Hironori from comment #23) > > This is my wild guess: > > > > 1. A detached thread is looking the WordLock > > 2. the main thread is going to exit > > 3. The detached thread is killed while holding the lock > > 4. Windows reclaims locked mutexes the killed thread was holding, but the > > WordLock's atomic is not a OS-managed mutex. > > 5. ~ThreadExitCallback is executed in the main thread. > > Interesting. Several questions, > > 1. Can the main thread run after the system kills other threads? > 2. Is stucking thread the main thread? > > If your guess is true, I think the best way is checking `isMainThread()` > status in Thread::destructTLS to avoid execution. Wrap `Thread::destructTLS` > with a function, which checks `isMainThread()`. (And rename the current > destructTLS to destructTLSImpl etc.). BTW, I really want to remove MainThread abstraction in WTF and move it to WebCore/PAL since this MainThread is meaningless if we are using JavaScriptCore.framework. I think we should invent some way to get system main thread information from Windows system (like, pthread_main_np()) instead of recording the main thread... Like, bool isSystemMainThread()
Fujii Hironori
Comment 26 2019-11-17 21:23:46 PST
Created attachment 383730 [details] a test program to test my hypothesis I created a test program to test my hypothesis (Comment 23). > TestWTF.exe --gtest_filter=WTF_Lock.Test1 The test program also stuck with the same callstack. And, the test program exited promptly if I replace WordLock with Mutex.
Fujii Hironori
Comment 27 2019-11-17 23:54:52 PST
The old-fashioned theory on how processes exit | The Old New Thing https://devblogs.microsoft.com/oldnewthing/20070502-00/?p=27023 Quick overview of how processes exit on Windows XP | The Old New Thing https://devblogs.microsoft.com/oldnewthing/20070503-00/?p=27003 During process termination, the gates are now electrified | The Old New Thing https://devblogs.microsoft.com/oldnewthing/20100122-00/?p=15193 ExitProcess function (processthreadsapi.h) - Win32 apps | Microsoft Docs https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-exitprocess Terminating a Process - Win32 apps | Microsoft Docs https://docs.microsoft.com/en-us/windows/win32/procthread/terminating-a-process DllMain entry point (Process.h) - Win32 apps | Microsoft Docs https://docs.microsoft.com/en-us/windows/win32/dlls/dllmain Here are my understanding: * TLS are destructed on DLL_PROCESS_DETACH and DLL_THREAD_DETACH * ExitProcess terminates all other threads withouth notifying DLL_THREAD_DETACH (TLS of those thread aren't destructed) * ExitProcess sends DLL_PROCESS_DETACH (TLS of the thread are destructed)
Fujii Hironori
Comment 28 2019-11-18 03:05:13 PST
(In reply to Fujii Hironori from comment #27) > Here are my understanding: This seems wrong. Ignore it.
Fujii Hironori
Comment 29 2019-11-18 03:54:54 PST
Created attachment 383737 [details] screenshot of debugger I found a option to show call frames of external module in Visual Studio. This callstack looks better.
Fujii Hironori
Comment 30 2019-11-18 21:47:05 PST
(In reply to Yusuke Suzuki from comment #13) > And I think implementing TLS in WTF::Thread for Windows is nice idea. Filed: Bug 204341 – [Win] Implement WTF::ThreadSpecific in WTF::Thread
Fujii Hironori
Comment 31 2020-01-07 22:56:15 PST
r252687 removed threadMapMutex. I don't see any issue now. Closed.
Yusuke Suzuki
Comment 32 2020-01-08 11:23:28 PST
(In reply to Fujii Hironori from comment #31) > r252687 removed threadMapMutex. I don't see any issue now. Closed. That’s nice. So we could run JSC tests on Windows, right?
Fujii Hironori
Comment 33 2020-01-08 17:38:13 PST
Oh, I just meant I don't observe any other deadlock-during-shutdown issues being described in the article (Comment 27). JSC Windows port still has the suspended thread issue (Bug 191285).
Note You need to log in before you can comment on or make changes to this bug.