[WinCairo][WK2] random crashes by heap corruption
Pavel Feldman reported:
> Unrelatedly, I was trying to build clang/ASan, but failed. We are
> seeing intermittent heap corruption in Win port running under
> HyperV. Running it with app verifier /full mode hides that issue,
> i.e. corruption never occurs. I know that VC asan for 64 bit is
> in the works, but I wonder if anyone was building asan.
> It never happens in normal setup, no matter how much I stress the
> system, only fails in HyperV. So it might have to do with limited
> cores available...
I'm observing this issue only in internal Jenkins jobs to run LayoutTest of WinCairo WK2.
I can't reproduce the crash on my PC by downloading the build artifact from the Jenkins job.
Created attachment 395142[details]
CrashLog_1eec_2020-03-31_23-48-43-898.txt
The internal Jenkins job to run WinCairo WK2 Debug LayoutTests is also showing the c0000374 crashes.
Created attachment 395660[details]
navigator-language-fr-crash-log.txt
I reproduced the c0000374 crash on my PC by limiting the number of CPU to 2 with msconfig.exe
I used WinCairo build artifact downloaded from internal Jenkins.
The crash logs were generated by the following tests.
fast/text/international/system-language/navigator-language/navigator-language-fr.html
storage/indexeddb/modern/idbdatabase-deleteobjectstore-failures.html
Steps to reproduce:
1. Limit the number of CPU to 2 with msconfig.exe
2. Reboot Windows
3. Repeat python ./Tools/Scripts/run-webkit-tests --release --no-new-test-results --no-retry-failures --wincairo -f fast/text/international/system-language/navigator-language
I tested several WinCairo WebKitTestRunner. This crash seems to have happened since the beginning of WinCairo WK2.
r246418: Crashed
r248403: Crashed
r249458: Crashed
r256008: Crashed
r258670: Crashed
r259306: Crashed
The crash can be reproduced without changing the boot parameter.
1. Start PowerShell
2. Open Task Manager
3. Right click at powershell.exe
4. Select "Set Affinity"
5. Check CPU 2 and CPU 3
6. python ./Tools/Scripts/run-webkit-tests --release --no-new-test-results --no-retry-failures --wincairo -f --iterations=10 --child-processes=2 fast/text/international/system-language/navigator-language
However, the reproduction rate is lower than Comment 12.
Created attachment 396013[details]
[Patch] Adding HeapValidate
I added HeapValidate(GetProcessHeap(), 0, nullptr) in TestController::clearIndexedDatabases before and after WKWebsiteDataStoreRemoveAllIndexedDatabases.
The first HeapValidate doesn't crash, but the second crashs.
So, WKWebsiteDataStoreRemoveAllIndexedDatabases seems the culprit.
Surprisingly, stopping calling WKWebsiteDataStoreRemoveAllIndexedDatabases doesn't solve the head corruption crashes.
After stopping calling WKWebsiteDataStoreRemoveAllIndexedDatabases, most crashes happens in WebKit::BackingStore::incorporateUpdate.
https://gist.github.com/fujii/78afed5686d6e48d8b42a2fdf9e6295e
My current hypothesis are:
1. There are several threading issues in WKWebsiteDataStoreRemoveAllIndexedDatabases and WebKit::BackingStore::incorporateUpdate and more.
2. There is a threading issue in thread primitive or fundamental part.
(In reply to Fujii Hironori from comment #15)
> 2. There is a threading issue in thread primitive or fundamental part.
This is unlikely. If so, Web process also should crash.
Created attachment 396132[details]
[Patch] Remove pendingCallbacks of WebsiteDataStore::removeData's CallbackAggregator
This patch seems to fix the heap corruption of WebsiteDataStore::removeData. But, I don't know why.
(In reply to Fujii Hironori from comment #15)
> Created attachment 396013[details]
> [Patch] Adding HeapValidate
>
> I added HeapValidate(GetProcessHeap(), 0, nullptr) in
> TestController::clearIndexedDatabases before and after
> WKWebsiteDataStoreRemoveAllIndexedDatabases.
> The first HeapValidate doesn't crash, but the second crashs.
> So, WKWebsiteDataStoreRemoveAllIndexedDatabases seems the culprit.
>
> Surprisingly, stopping calling WKWebsiteDataStoreRemoveAllIndexedDatabases
> doesn't solve the head corruption crashes.
> After stopping calling WKWebsiteDataStoreRemoveAllIndexedDatabases, most
> crashes happens in WebKit::BackingStore::incorporateUpdate.
> https://gist.github.com/fujii/78afed5686d6e48d8b42a2fdf9e6295e
>
> My current hypothesis are:
>
> 1. There are several threading issues in
> WKWebsiteDataStoreRemoveAllIndexedDatabases and
> WebKit::BackingStore::incorporateUpdate and more.
> 2. There is a threading issue in thread primitive or fundamental part.
It turned out these ideas were wrong.
The crashes happen by running the run loop.
Created attachment 396625[details]
Patch to use mimalloc
https://github.com/microsoft/mimalloc
1. Build mimalloc
2. Copy mimalloc-static.lib to WebKitLibraries/win/lib64
3. Copy mimalloc-override.h and mimalloc.h to mimalloc.hWebKitLibraries/win/include
4. Apply the patch
5. Build WinCairo
I observe no heap corruption crashes by using mimalloc.
This seems the only one reasonable workaround at the moment.
The async ReadFile operation should be canceled by using CancelIo
API in the thread called ReadFile before destructing
IPC::Connection.
However, It can't be possible because Windows port WorkQueue
doesn't ensure the same thread.
(In reply to Fujii Hironori from comment #25)
> The async ReadFile operation should be canceled by using CancelIo
> API in the thread called ReadFile before destructing
> IPC::Connection.
>
> However, It can't be possible because Windows port WorkQueue
> doesn't ensure the same thread.
To fix it, I filed : Bug 210785 – [Win] Use generic WorkQueue instead of WorkQueueWin.cpp
Bug 210785 landed at r260477.
Surprisingly, after r260477, even though CancelIo API is not used yet,
I don't observe the heap corruption crash in the internal Jenkins.
And, it can't be reproducible by using 2 CPU affinity (Comment 14).
However, I don't want to close this bug as WORKSFORME because I
believe the async ReadFile operation should be cancel by using
CancelIo API.
2020-03-31 22:07 PDT, Fujii Hironori
2020-03-31 22:07 PDT, Fujii Hironori
2020-03-31 22:07 PDT, Fujii Hironori
2020-04-01 01:16 PDT, Fujii Hironori
2020-04-01 01:16 PDT, Fujii Hironori
2020-04-01 01:16 PDT, Fujii Hironori
2020-04-07 01:21 PDT, Fujii Hironori
2020-04-07 01:21 PDT, Fujii Hironori
2020-04-09 14:45 PDT, Fujii Hironori
2020-04-10 15:46 PDT, Fujii Hironori
2020-04-12 18:59 PDT, Fujii Hironori
2020-04-12 19:04 PDT, Fujii Hironori
2020-04-15 23:18 PDT, Fujii Hironori
2020-04-16 13:06 PDT, Fujii Hironori
2020-04-16 14:12 PDT, Fujii Hironori
2020-04-19 14:46 PDT, Fujii Hironori
2020-04-19 22:12 PDT, Fujii Hironori
2020-04-22 21:04 PDT, Fujii Hironori