Bug 204192 - [WinCairo] dead lock in Thread::destructTLS
Summary: [WinCairo] dead lock in Thread::destructTLS
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Fujii Hironori
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-13 23:35 PST by Fujii Hironori
Modified: 2020-04-24 00:04 PDT (History)
13 users (show)

See Also:


Attachments
screenshot of debugger (249.46 KB, image/png)
2019-11-13 23:56 PST, Fujii Hironori
no flags Details
screenshot of debugger (WebKitTestRunner.exe) (221.74 KB, image/png)
2019-11-14 01:43 PST, Fujii Hironori
no flags Details
screenshot of debugger (in WordLock::lockSlow) (279.38 KB, image/png)
2019-11-14 19:06 PST, Fujii Hironori
no flags Details
Patch (1.58 KB, patch)
2019-11-14 23:06 PST, Fujii Hironori
ysuzuki: review+
ews-watchlist: commit-queue-
Details | Formatted Diff | Diff
Patch to use std::mutex (685 bytes, patch)
2019-11-15 01:28 PST, Fujii Hironori
no flags Details | Formatted Diff | Diff
Patch to use Mutex and ThreadCondition in WordLock (1.68 KB, patch)
2019-11-15 02:37 PST, Fujii Hironori
no flags Details | Formatted Diff | Diff
screenshot of debugger (with the patch to use Mutex and ThreadCondition in WordLock) (252.26 KB, image/png)
2019-11-15 02:38 PST, Fujii Hironori
no flags Details
Archive of layout-test-results from ews214 for win-future (14.83 MB, application/zip)
2019-11-15 02:43 PST, EWS Watchlist
no flags Details
a test program to test my hypothesis (900 bytes, patch)
2019-11-17 21:23 PST, Fujii Hironori
no flags Details | Formatted Diff | Diff
screenshot of debugger (326.99 KB, image/png)
2019-11-18 03:54 PST, Fujii Hironori
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fujii Hironori 2019-11-13 23:35:38 PST
[WinCairo] dead lock in Thread::destructTLS

By running run-webkit-tests, sometimes WebKitNetworkProcess.exe process remains alive.

> python ./Tools/Scripts/run-webkit-tests --wincairo --debug  --no-new-test-results --no-retry-failures --no-show-results -vf fast

By attaching a debugger, I can get the callstack. There is only the main thread.

Callstack:

> [External Code]	
> WTF.dll!WTF::WordLock::lockSlow() Line 148	C++
> WTF.dll!WTF::WordLock::lock() Line 62	C++
> WTF.dll!WTF::Locker<WTF::WordLock>::lock() Line 114	C++
> WTF.dll!WTF::Locker<WTF::WordLock>::Locker(WTF::WordLock & lockable) Line 55	C++
> WTF.dll!WTF::holdLock<WTF::WordLock>(WTF::WordLock & lock) Line 127	C++
> WTF.dll!WTF::Thread::destructTLS(void * data) Line 343	C++
> WTF.dll!WTF::Thread::destructTLS::ThreadExitCallback::~ThreadExitCallback() Line 334	C++
> WTF.dll!??__Fcallback@?1??destructTLS@Thread@WTF@@KAXPEAX@Z@YAXXZ() Line 324	C++
> [External Code]
Comment 1 Fujii Hironori 2019-11-13 23:56:51 PST
Created attachment 383546 [details]
screenshot of debugger
Comment 2 Fujii Hironori 2019-11-14 01:43:36 PST
Created attachment 383549 [details]
screenshot of debugger (WebKitTestRunner.exe)

I observed a case that WebKitTestRunner.exe didn't die.
Comment 3 Fujii Hironori 2019-11-14 19:06:12 PST
Created attachment 383593 [details]
screenshot of debugger (in WordLock::lockSlow)

Waited in the parking, but no other threads were alive.
Comment 4 Fujii Hironori 2019-11-14 23:06:01 PST
Created attachment 383602 [details]
Patch
Comment 5 Yusuke Suzuki 2019-11-15 00:12:06 PST
Comment on attachment 383602 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=383602&action=review

> Source/WTF/wtf/win/ThreadingWin.cpp:268
> +static Mutex threadMapMutex;

What happens if you use std::mutex? I'm guessing this is the issue.
Comment 6 Fujii Hironori 2019-11-15 01:28:36 PST
Created attachment 383607 [details]
Patch to use std::mutex

I confirmed using std::mutex also seems solving the issue.
Comment 7 Fujii Hironori 2019-11-15 01:40:21 PST
How about the idea implementing ThreadSpecific in WTF::Thread?
It'd be easy to ensure deref-ing WTF::Thread in TLS after destructing other ThreadSpecific.
We won't need to defer deref-ing WTF::Thread in TLS.
Comment 8 Yusuke Suzuki 2019-11-15 01:54:15 PST
(In reply to Fujii Hironori from comment #7)
> How about the idea implementing ThreadSpecific in WTF::Thread?
> It'd be easy to ensure deref-ing WTF::Thread in TLS after destructing other
> ThreadSpecific.
> We won't need to defer deref-ing WTF::Thread in TLS.

For Windows, I think it is OK. Windows TLS / FLS has very weird behavior on destructor calls, reviving the TLS entry when destroying the other TLS etc. We are doing weird workaround for now. Maybe, implementing TLS in WTF::Thread would improve stability, and it would be improving performance too (b/c we are using FLS in Windows, and IIRC, it is slow).

For non-Windows (like, Darwin), I think the performance problem needs to be solved.
And in Darwin / Linux pthread TLS, it seems that it behaves as we want. So I think we can just use pthread TLS on Darwin / Linux.
Comment 9 Yusuke Suzuki 2019-11-15 01:55:56 PST
(In reply to Yusuke Suzuki from comment #8)
> 
> For Windows, I think it is OK. Windows TLS / FLS has very weird behavior on
> destructor calls, reviving the TLS entry when destroying the other TLS etc.
> We are doing weird workaround for now. Maybe, implementing TLS in
> WTF::Thread would improve stability, and it would be improving performance
> too (b/c we are using FLS in Windows, and IIRC, it is slow).
> 
> For non-Windows (like, Darwin), I think the performance problem needs to be
> solved.
> And in Darwin / Linux pthread TLS, it seems that it behaves as we want. So I
> think we can just use pthread TLS on Darwin / Linux.

IIRC, we saw some performance problem in our WTF::Lock / JSLock in Windows due to slowness of Windows FLS.
Comment 10 Yusuke Suzuki 2019-11-15 02:00:48 PST
(In reply to Fujii Hironori from comment #6)
> Created attachment 383607 [details]
> Patch to use std::mutex
> 
> I confirmed using std::mutex also seems solving the issue.

This sounds very strange to me. WTF::WordLock is intentionally avoiding dependency to TLS to make it usable when destroying threads. Internally, it only uses stack, atomics, std::mutex, and std::condition_variable. If using std::mutex solved the problem, my guess is that root cause of this problem is not in WordLock, rather in std::condition_variable in Windows, which is used in WordLock.

Given that, can you try replacing std::mutex in WordLock to Mutex (and functions), and replacing std::condition_variable in WordLock to ThreadCondition (and functions)? If it solves the problem, I think this is much better solution.
Comment 11 Yusuke Suzuki 2019-11-15 02:06:30 PST
Now, my guess is that using std::condition_variable / ThreadCondition after destroying some TLS is the problem.
Comment 12 Yusuke Suzuki 2019-11-15 02:08:05 PST
(In reply to Yusuke Suzuki from comment #11)
> Now, my guess is that using std::condition_variable / ThreadCondition after
> destroying some TLS is the problem.

Can we know what implementation in Windows STL std::mutex / std::condition_variable is used? They recently released their STL code: https://github.com/microsoft/STL and it includes three implementations, vista, win7, and other one. If win7 version is used, I think the behavior is completely same between std::mutex and Mutex, std::condition_variable and ThreadCondition.
Comment 13 Yusuke Suzuki 2019-11-15 02:11:51 PST
(In reply to Yusuke Suzuki from comment #12)
> (In reply to Yusuke Suzuki from comment #11)
> > Now, my guess is that using std::condition_variable / ThreadCondition after
> > destroying some TLS is the problem.
> 
> Can we know what implementation in Windows STL std::mutex /
> std::condition_variable is used? They recently released their STL code:
> https://github.com/microsoft/STL and it includes three implementations,
> vista, win7, and other one. If win7 version is used, I think the behavior is
> completely same between std::mutex and Mutex, std::condition_variable and
> ThreadCondition.

But if CONCRT implementation is used, maybe it would be different. And if std::mutex / Mutex, and std::condition_variable / ThreadCondition implementations are different, it would be possible that using std::condition_variable when destroying TLS is the problem. Anyway, if we can fix the problem by replacing std::mutex to Mutex and replacing std::condition_variable to ThreadCondition in WordLock, it would be very nice since we can just continue using WordLock when such a weird edge case happens.

And I think implementing TLS in WTF::Thread for Windows is nice idea.
Comment 14 Fujii Hironori 2019-11-15 02:37:19 PST
Created attachment 383608 [details]
Patch to use Mutex and ThreadCondition in WordLock

Unfortunately, using Mutex and ThreadCondition in WordLock doesn't solve the issue.
Comment 15 Fujii Hironori 2019-11-15 02:38:11 PST
Created attachment 383609 [details]
screenshot of debugger (with the patch to use Mutex and ThreadCondition in WordLock)
Comment 16 Yusuke Suzuki 2019-11-15 02:41:53 PST
(In reply to Fujii Hironori from comment #15)
> Created attachment 383609 [details]
> screenshot of debugger (with the patch to use Mutex and ThreadCondition in
> WordLock)

Interesting. I think, this means that std::condition_variable is the root cause of this problem.
Comment 17 EWS Watchlist 2019-11-15 02:43:31 PST
Comment on attachment 383602 [details]
Patch

Attachment 383602 [details] did not pass win-ews (win):
Output: https://webkit-queues.webkit.org/results/13257920

New failing tests:
animations/no-style-recalc-during-accelerated-animation.html
Comment 18 EWS Watchlist 2019-11-15 02:43:34 PST
Created attachment 383610 [details]
Archive of layout-test-results from ews214 for win-future

The attached test failures were seen while running run-webkit-tests on the win-ews.
Bot: ews214  Port: win-future  Platform: CYGWIN_NT-10.0-17763-3.0.5-338.x86_64-x86_64-64bit
Comment 19 Yusuke Suzuki 2019-11-15 02:46:01 PST
Comment on attachment 383602 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=383602&action=review

r=me

>> Source/WTF/wtf/win/ThreadingWin.cpp:268
>> +static Mutex threadMapMutex;
> 
> What happens if you use std::mutex? I'm guessing this is the issue.

Can you add a comment that describes why we are using Mutex instead of WordLock here?
Comment 20 Fujii Hironori 2019-11-15 03:16:19 PST
It seems using stl_condition_variable_win7.

> msvcp140.dll!Concurrency::details::stl_condition_variable_win7::wait(Concurrency::details::stl_critical_section_interface * lock) Line 180	C++
> msvcp140.dll!do_wait(_Cnd_internal_imp_t * cond, _Mtx_internal_imp_t * mtx, const xtime * target) Line 58	C++
> WTF.dll!std::condition_variable::wait(std::unique_lock<std::mutex> & _Lck) Line 603	C++
> WTF.dll!WTF::WordLock::lockSlow() Line 149	C++
> WTF.dll!WTF::WordLock::lock() Line 62	C++
> WTF.dll!WTF::`anonymous namespace'::enqueue<WTF::`anonymous-namespace'::ThreadData * <lambda>(void)>(const void * address, const WTF::ParkingLot::parkConditionallyImpl::__l2::WTF::`anonymous-namespace'::ThreadData * <lambda>(void) & functor) Line 492	C++
> WTF.dll!WTF::ParkingLot::parkConditionallyImpl(const void * address, const WTF::ScopedLambda<bool __cdecl(void)> & validation, const WTF::ScopedLambda<void __cdecl(void)> & beforeSleep, const WTF::TimeWithDynamicClockType & timeout) Line 578	C++
> WTF.dll!WTF::ParkingLot::parkConditionally<bool <lambda>(void),void <lambda>(void)>(const void * address, const WTF::ParkingLot::compareAndPark::__l2::bool <lambda>(void) & validation, const WTF::ParkingLot::compareAndPark::__l2::void <lambda>(void) & beforeSleep, const WTF::TimeWithDynamicClockType & timeout) Line 82	C++
> WTF.dll!WTF::ParkingLot::compareAndPark<unsigned char,unsigned char>(const WTF::Atomic<unsigned char> * address, unsigned char expected) Line 94	C++
> WTF.dll!WTF::LockAlgorithm<unsigned char,1,2,WTF::EmptyLockHooks<unsigned char>>::lockSlow(WTF::Atomic<unsigned char> & lock) Line 83	C++
> WTF.dll!WTF::Lock::lockSlow() Line 41	C++
> WTF.dll!WTF::Lock::lock() Line 60	C++
> WTF.dll!WTF::Locker<WTF::Lock>::lock() Line 114	C++
> WTF.dll!WTF::Locker<WTF::Lock>::Locker<WTF::Lock>(WTF::Lock & lockable) Line 55	C++
> WTF.dll!WTF::holdLock<WTF::Lock>(WTF::Lock & lock) Line 127	C++
> WTF.dll!WTF::WorkQueue::dispatch(WTF::Function<void __cdecl(void)> && function) Line 104	C++
> WebKit2.dll!IPC::Connection::invokeReadEventHandler() Line 233	C++
> WebKit2.dll!IPC::Connection::open::__l2::<lambda>() Line 253	C++
> WebKit2.dll!WTF::Detail::CallableWrapper<void <lambda>(void),void>::call() Line 52	C++
> WebKit2.dll!WTF::Function<void __cdecl(void)>::operator()() Line 80	C++
> WebKit2.dll!IPC::Connection::EventListener::callback(void * data, unsigned char timerOrWaitFired) Line 338	C++
> [External Code]
Comment 21 Fujii Hironori 2019-11-15 03:19:54 PST
Is it possible to kill a thread which is locking the mutex?
Comment 22 Yusuke Suzuki 2019-11-15 11:56:26 PST
(In reply to Fujii Hironori from comment #21)
> Is it possible to kill a thread which is locking the mutex?

Not sure.
Comment 23 Fujii Hironori 2019-11-15 16:03:36 PST
This is my wild guess:

1. A detached thread is looking the WordLock
2. the main thread is going to exit
3. The detached thread is killed while holding the lock
4. Windows reclaims locked mutexes the killed thread was holding, but the WordLock's atomic is not a OS-managed mutex.
5. ~ThreadExitCallback is executed in the main thread.
Comment 24 Yusuke Suzuki 2019-11-15 16:40:04 PST
(In reply to Fujii Hironori from comment #23)
> This is my wild guess:
> 
> 1. A detached thread is looking the WordLock
> 2. the main thread is going to exit
> 3. The detached thread is killed while holding the lock
> 4. Windows reclaims locked mutexes the killed thread was holding, but the
> WordLock's atomic is not a OS-managed mutex.
> 5. ~ThreadExitCallback is executed in the main thread.

Interesting. Several questions,

1. Can the main thread run after the system kills other threads?
2. Is stucking thread the main thread?

If your guess is true, I think the best way is checking `isMainThread()` status in Thread::destructTLS to avoid execution. Wrap `Thread::destructTLS` with a function, which checks `isMainThread()`. (And rename the current destructTLS to destructTLSImpl etc.).
Comment 25 Yusuke Suzuki 2019-11-15 16:49:19 PST
(In reply to Yusuke Suzuki from comment #24)
> (In reply to Fujii Hironori from comment #23)
> > This is my wild guess:
> > 
> > 1. A detached thread is looking the WordLock
> > 2. the main thread is going to exit
> > 3. The detached thread is killed while holding the lock
> > 4. Windows reclaims locked mutexes the killed thread was holding, but the
> > WordLock's atomic is not a OS-managed mutex.
> > 5. ~ThreadExitCallback is executed in the main thread.
> 
> Interesting. Several questions,
> 
> 1. Can the main thread run after the system kills other threads?
> 2. Is stucking thread the main thread?
> 
> If your guess is true, I think the best way is checking `isMainThread()`
> status in Thread::destructTLS to avoid execution. Wrap `Thread::destructTLS`
> with a function, which checks `isMainThread()`. (And rename the current
> destructTLS to destructTLSImpl etc.).

BTW, I really want to remove MainThread abstraction in WTF and move it to WebCore/PAL since this MainThread is meaningless if we are using JavaScriptCore.framework. I think we should invent some way to get system main thread information from Windows system (like, pthread_main_np()) instead of recording the main thread... Like,

bool isSystemMainThread()
Comment 26 Fujii Hironori 2019-11-17 21:23:46 PST
Created attachment 383730 [details]
a test program to test my hypothesis

I created a test program to test my hypothesis (Comment 23).

> TestWTF.exe --gtest_filter=WTF_Lock.Test1

The test program also stuck with the same callstack.
And, the test program exited promptly if I replace WordLock with Mutex.
Comment 27 Fujii Hironori 2019-11-17 23:54:52 PST
The old-fashioned theory on how processes exit | The Old New Thing
https://devblogs.microsoft.com/oldnewthing/20070502-00/?p=27023

Quick overview of how processes exit on Windows XP | The Old New Thing
https://devblogs.microsoft.com/oldnewthing/20070503-00/?p=27003

During process termination, the gates are now electrified | The Old New Thing
https://devblogs.microsoft.com/oldnewthing/20100122-00/?p=15193

ExitProcess function (processthreadsapi.h) - Win32 apps | Microsoft Docs
https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-exitprocess

Terminating a Process - Win32 apps | Microsoft Docs
https://docs.microsoft.com/en-us/windows/win32/procthread/terminating-a-process

DllMain entry point (Process.h) - Win32 apps | Microsoft Docs
https://docs.microsoft.com/en-us/windows/win32/dlls/dllmain

Here are my understanding: 

* TLS are destructed on DLL_PROCESS_DETACH and DLL_THREAD_DETACH
* ExitProcess terminates all other threads withouth notifying DLL_THREAD_DETACH (TLS of those thread aren't destructed)
* ExitProcess sends DLL_PROCESS_DETACH (TLS of the thread are destructed)
Comment 28 Fujii Hironori 2019-11-18 03:05:13 PST
(In reply to Fujii Hironori from comment #27)
> Here are my understanding: 

This seems wrong. Ignore it.
Comment 29 Fujii Hironori 2019-11-18 03:54:54 PST
Created attachment 383737 [details]
screenshot of debugger

I found a option to show call frames of external module in Visual Studio. This callstack looks better.
Comment 30 Fujii Hironori 2019-11-18 21:47:05 PST
(In reply to Yusuke Suzuki from comment #13)
> And I think implementing TLS in WTF::Thread for Windows is nice idea.

Filed: Bug 204341 – [Win] Implement WTF::ThreadSpecific in WTF::Thread
Comment 31 Fujii Hironori 2020-01-07 22:56:15 PST
r252687 removed threadMapMutex. I don't see any issue now. Closed.
Comment 32 Yusuke Suzuki 2020-01-08 11:23:28 PST
(In reply to Fujii Hironori from comment #31)
> r252687 removed threadMapMutex. I don't see any issue now. Closed.

That’s nice. So we could run JSC tests on Windows, right?
Comment 33 Fujii Hironori 2020-01-08 17:38:13 PST
Oh, I just meant I don't observe any other deadlock-during-shutdown issues being described in the article (Comment 27).
JSC Windows port still has the suspended thread issue (Bug 191285).