Summary: | Crash in JSC::SlotVisitor::visitChildren | ||
---|---|---|---|
Product: | WebKit | Reporter: | Michael Catanzaro <mcatanzaro> |
Component: | JavaScriptCore | Assignee: | Nobody <webkit-unassigned> |
Status: | NEW --- | ||
Severity: | Normal | CC: | asumu, bugs-noreply, caitp, fpizlo, justin_michaud, kris, mark.lam, mcatanzaro, xan.lopez, ysuzuki |
Priority: | P2 | ||
Version: | WebKit Nightly Build | ||
Hardware: | PC | ||
OS: | Linux | ||
Attachments: |
Description
Michael Catanzaro
2019-08-17 14:39:37 PDT
Hit this again today. Have you seen it when running any of WebKit tests or the whole thing? We're seeing it reproducing quite consistently in our integration (WebKit WPE 2.34.7) although the steps are somewhat awkward for now and it takes a bit of time to hit the crash, so not very useful yet. Working on steps to make it easier to reproduce. Basically it's leaving the thing alone in one of our HTML/JS apps. Any disturbance in concurrency (interaction, logs etc.) make the problem go away. (In reply to Krzysztof Konopko from comment #2) > Have you seen it when running any of WebKit tests or the whole thing? Probably saw it when running Epiphany. OK, thanks for confirming. Will be trying to chase it down. BTW, disabling concurrent GC (RipTide) also makes the problem go away so it might be something with that implementation. Created attachment 463515 [details]
HTML/JS test which reproduces the crash on a AArch64 platform (WPE)
This is a reduced test case which is a result of cutting down a web application bundled with WebPack. Therefore a lot of code does not make sense (as it's mostly WebPack's polyfill) but it still reproduces a crash in GC on a custom AArch64 platform with additional logging enabled in GC (separate attachement). It simulates events that the application was originally responding to.
Having the `window` object used as a global "storage" for extensions seems to be essential here. The crash does not reproduce when run with `jsc` (where `window` is simply replaced with an empty object `{}`).
Created attachment 463516 [details]
A temporary patch which adds/enables extra GC logging
This logging is not essential but helps with reproducing the crash more easily, ie. by subtly changing the timing in GC, it seems to trigger the crash.
Created attachment 463518 [details] core dump with concurrent GC disabled (In reply to Krzysztof Konopko from comment #4) > OK, thanks for confirming. Will be trying to chase it down. BTW, disabling > concurrent GC (RipTide) also makes the problem go away so it might be > something with that implementation. On a custom AArch64 platform I saw a crash with concurrent GC _disabled_. See the attachment. I could not reproduce the crash on RPi3 AArch64 WPE build. Also failed to reproduce on x86_64. When running Valgrind though, there are complaints similar to this one which seems to be in the same area where the crash occurs: ==23== Use of uninitialised value of size 8 ==23== at 0xF5A2FCA: JSC::WriteBarrierBase<JSC::SymbolTable, WTF::RawPtrTraits<JSC::SymbolTable> >::cell() const (WriteBarrier.h:153) ==23== by 0xF5953B8: JSC::WriteBarrierBase<JSC::SymbolTable, WTF::RawPtrTraits<JSC::SymbolTable> >::get() const (WriteBarrier.h:101) ==23== by 0x10A6A995: void JSC::SlotVisitor::append<JSC::SymbolTable, WTF::RawPtrTraits<JSC::SymbolTable> >(JSC::WriteBarrierBase<JSC::SymbolTable, WTF::RawPtrTraits<JSC::SymbolTable> > const&) (SlotVisitorInlines.h:110) ==23== by 0x10A1BCBA: void JSC::JSSymbolTableObject::visitChildrenImpl<JSC::SlotVisitor>(JSC::JSCell*, JSC::SlotVisitor&) (JSSymbolTableObject.cpp:45) ==23== by 0x10A1217E: JSC::JSSymbolTableObject::visitChildren(JSC::JSCell*, JSC::SlotVisitor&) (JSSymbolTableObject.cpp:48) ==23== by 0x109AC493: void JSC::JSLexicalEnvironment::visitChildrenImpl<JSC::SlotVisitor>(JSC::JSCell*, JSC::SlotVisitor&) (JSLexicalEnvironment.cpp:44) ==23== by 0x109A96A6: JSC::JSLexicalEnvironment::visitChildren(JSC::JSCell*, JSC::SlotVisitor&) (JSLexicalEnvironment.cpp:48) ==23== by 0x1021A8C6: JSC::MethodTable::visitChildren(JSC::JSCell*, JSC::SlotVisitor&) const (ClassInfo.h:115) ==23== by 0x1021B518: JSC::SlotVisitor::visitChildren(JSC::JSCell const*) (SlotVisitor.cpp:394) ==23== by 0x10216FD7: JSC::SlotVisitor::drain(WTF::MonotonicTime)::{lambda(JSC::MarkStackArray&)#1}::operator()(JSC::MarkStackArray&) const (SlotVisitor.cpp:504) ==23== by 0x1021A05C: WTF::IterationStatus JSC::SlotVisitor::forEachMarkStack<JSC::SlotVisitor::drain(WTF::MonotonicTime)::{lambda(JSC::MarkStackArray&)#1}>(JSC::SlotVisitor::drain(WTF::MonotonicTime)::{lambda(JSC::MarkStackArray&)#1} const&) (SlotVisitorInlines.h:184) ==23== by 0x102170EA: JSC::SlotVisitor::drain(WTF::MonotonicTime) (SlotVisitor.cpp:494) There are loads of these in GC area. Not sure if they are relevant. Most recent crashes on a custom AArch64 platform (WPE build), still not reproducible elsewhere: Thread 26 "HeapHelper" received signal SIGSEGV, Segmentation fault. [Switching to Thread 9205.9347] JSC::MarkedBlock::aboutToMark (markingVersion=3, this=0x572a9c9698db8000) at Source/JavaScriptCore/heap/MarkedBlock.h:571 571 Dependency dependency = Dependency::loadAndFence(&footer().m_markingVersion, version); (gdb) bt #0 JSC::MarkedBlock::aboutToMark (markingVersion=3, this=0x572a9c9698db8000) at Source/JavaScriptCore/heap/MarkedBlock.h:571 #1 JSC::Heap::testAndSetMarked (rawCell=0x572a9c9698dbada3, markingVersion=3) at Source/JavaScriptCore/heap/HeapInlines.h:86 #2 JSC::SlotVisitor::markAuxiliary (this=this@entry=0x7fb05cf200, base=0x572a9c9698dbada3) at Source/JavaScriptCore/heap/SlotVisitor.cpp:302 #3 0x0000007fb6c39d60 in JSC::JSObject::markAuxiliaryAndVisitOutOfLineProperties<JSC::SlotVisitor> (maxOffset=-1, structure=0x7f8c5fa8b0, butterfly=0x572a9c9698dbadab, visitor=..., this=0x7f27072de0) at /data/builds/yvos-buildroot/yvos-master/sagemcom.diw3930-atk-bolt/yvos-buildroot-build/host/aarch64-buildroot-linux-gnu/sysroot/usr/include/bits/string_fortified.h:34 #4 JSC::JSObject::visitButterflyImpl<JSC::SlotVisitor> (visitor=..., this=0x7f27072de0) at Source/JavaScriptCore/runtime/JSObject.cpp:400 #5 JSC::JSObject::visitButterfly<JSC::SlotVisitor> (visitor=..., this=0x7f27072de0) at Source/JavaScriptCore/runtime/JSObject.cpp:108 #6 JSC::JSObject::visitChildrenImpl<JSC::SlotVisitor> (visitor=..., cell=0x7f27072de0) at Source/JavaScriptCore/runtime/JSObject.cpp:423 #7 JSC::JSObject::visitChildren (cell=cell@entry=0x7f27072de0, visitor=...) at Source/JavaScriptCore/runtime/JSObject.cpp:426 #8 0x0000007fb6c3a43c in JSC::JSInternalFieldObjectImpl<2u>::visitChildrenImpl<JSC::SlotVisitor> (visitor=..., cell=0x7f27072de0) at Source/JavaScriptCore/runtime/JSInternalFieldObjectImplInlines.h:42 #9 JSC::JSInternalFieldObjectImpl<2u>::visitChildren (visitor=..., cell=0x7f27072de0) at Source/JavaScriptCore/runtime/JSInternalFieldObjectImplInlines.h:42 #10 JSC::JSPromise::visitChildrenImpl<JSC::SlotVisitor> (visitor=..., cell=0x7f27072de0) at Source/JavaScriptCore/runtime/JSPromise.cpp:75 #11 JSC::JSPromise::visitChildren (cell=0x7f27072de0, visitor=...) at Source/JavaScriptCore/runtime/JSPromise.cpp:78 #12 0x0000007fb6837af8 in JSC::MethodTable::visitChildren (visitor=..., cell=0x7f27072de0, this=<optimized out>) at Source/JavaScriptCore/runtime/ClassInfo.h:111 #13 JSC::SlotVisitor::visitChildren (cell=0x7f27072de0, this=0x7fb05cf200) at Source/JavaScriptCore/heap/SlotVisitor.cpp:396 #14 JSC::SlotVisitor::<lambda(JSC::MarkStackArray&)>::operator() (__closure=<optimized out>, stack=...) at Source/JavaScriptCore/heap/SlotVisitor.cpp:507 #15 JSC::SlotVisitor::forEachMarkStack<JSC::SlotVisitor::drain(WTF::MonotonicTime)::<lambda(JSC::MarkStackArray&)> > (func=..., this=0x7fb05cf200) at Source/JavaScriptCore/heap/SlotVisitorInlines.h:174 #16 JSC::SlotVisitor::drain (this=this@entry=0x7fb05cf200, timeout=...) at Source/JavaScriptCore/heap/SlotVisitor.cpp:497 #17 0x0000007fb6838440 in JSC::SlotVisitor::drainFromShared (this=this@entry=0x7fb05cf200, sharedDrainMode=sharedDrainMode@entry=JSC::SlotVisitor::HelperDrain, timeout=...) at Source/JavaScriptCore/heap/SlotVisitor.cpp:698 --Type <RET> for more, q to quit, c to continue without paging-- #18 0x0000007fb6807b44 in JSC::Heap::<lambda()>::operator() (__closure=0x7f7c81e0e8) at Source/JavaScriptCore/heap/Heap.cpp:1305 #19 WTF::SharedTaskFunctor<void(), JSC::Heap::runBeginPhase(JSC::GCConductor)::<lambda()> >::run(void) (this=0x7f7c81e0d8) at WTF/Headers/wtf/SharedTask.h:91 #20 0x0000007fb7158944 in WTF::ParallelHelperClient::runTask(WTF::RefPtr<WTF::SharedTask<void ()>, WTF::RawPtrTraits<WTF::SharedTask<void ()> >, WTF::DefaultRefDerefTraits<WTF::SharedTask<void ()> > > const&) (this=0x7f8c700448, task=...) at Source/WTF/wtf/ParallelHelperPool.cpp:110 #21 0x0000007fb7159a0c in WTF::ParallelHelperPool::Thread::work (this=0x7f7c8d2480) at Source/WTF/wtf/ParallelHelperPool.cpp:201 #22 0x0000007fb7138ab4 in WTF::AutomaticThread::<lambda()>::operator() (__closure=0x7f7c823c98) at Source/WTF/wtf/AutomaticThread.cpp:229 #23 WTF::Detail::CallableWrapper<WTF::AutomaticThread::start(const WTF::AbstractLocker&)::<lambda()>, void>::call(void) (this=0x7f7c823c90) at Source/WTF/wtf/Function.h:53 #24 0x0000007fb71615d4 in WTF::Function<void ()>::operator()() const (this=<synthetic pointer>) at Source/WTF/wtf/Function.h:79 #25 WTF::Thread::entryPoint (newThreadContext=0x7f7c81b190) at Source/WTF/wtf/Threading.cpp:187 #26 0x0000007fb71c6b8c in WTF::wtfThreadEntryPoint (context=<optimized out>) at Source/WTF/wtf/posix/ThreadingPOSIX.cpp:241 #27 0x0000007fb3007904 in start_thread (arg=0x7fffffda06) at pthread_create.c:479 #28 0x0000007fb3cd6cac in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 (gdb) x/i $pc => 0x7fb6836874 <JSC::SlotVisitor::markAuxiliary(void const*)+180>: ldr w0, [x0] (gdb) p/x $x0 $1 = 0x572a9c9698dbbef0 and Core was generated by `.../libexec/wpe-webkit-1.0/WPEWebProcess 10 22'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000007fb68367f8 in std::__atomic_base<bool>::load (__m=std::memory_order_relaxed, this=0x221f2b9a494037e9) at .../host/aarch64-buildroot-linux-gnu/include/c++/8.4.0/bits/atomic_base.h:390 390 load(memory_order __m = memory_order_seq_cst) const noexcept [Current thread is 1 (LWP 19959)] (gdb) bt #0 0x0000007fb68367f8 in std::__atomic_base<bool>::load (__m=std::memory_order_relaxed, this=0x221f2b9a494037e9) at .../host/aarch64-buildroot-linux-gnu/include/c++/8.4.0/bits/atomic_base.h:390 #1 std::atomic<bool>::load (__m=std::memory_order_relaxed, this=0x221f2b9a494037e9) at .../host/aarch64-buildroot-linux-gnu/include/c++/8.4.0/atomic:111 #2 WTF::Atomic<bool>::load (order=std::memory_order_relaxed, this=0x221f2b9a494037e9) at WTF/Headers/wtf/Atomics.h:62 #3 JSC::PreciseAllocation::isMarked (this=0x221f2b9a494037d0) at Source/JavaScriptCore/heap/PreciseAllocation.h:87 #4 JSC::PreciseAllocation::testAndSetMarked (this=0x221f2b9a494037d0) at Source/JavaScriptCore/heap/PreciseAllocation.h:133 #5 JSC::Heap::testAndSetMarked (rawCell=0x221f2b9a49403838, markingVersion=2) at Source/JavaScriptCore/heap/HeapInlines.h:84 #6 JSC::SlotVisitor::markAuxiliary (this=this@entry=0x7fb05cf200, base=0x221f2b9a49403838) at Source/JavaScriptCore/heap/SlotVisitor.cpp:302 #7 0x0000007fb6c39c50 in JSC::JSObject::markAuxiliaryAndVisitOutOfLineProperties<JSC::SlotVisitor> (maxOffset=-1, structure=0x7f8c5fa8b0, butterfly=0x221f2b9a49403840, visitor=..., this=0x7f7c0681e0) at .../host/aarch64-buildroot-linux-gnu/sysroot/usr/include/bits/string_fortified.h:34 #8 JSC::JSObject::visitButterflyImpl<JSC::SlotVisitor> (visitor=..., this=0x7f7c0681e0) at Source/JavaScriptCore/runtime/JSObject.cpp:400 #9 JSC::JSObject::visitButterfly<JSC::SlotVisitor> (visitor=..., this=0x7f7c0681e0) at Source/JavaScriptCore/runtime/JSObject.cpp:108 #10 JSC::JSObject::visitChildrenImpl<JSC::SlotVisitor> (visitor=..., cell=0x7f7c0681e0) at Source/JavaScriptCore/runtime/JSObject.cpp:423 #11 JSC::JSObject::visitChildren (cell=cell@entry=0x7f7c0681e0, visitor=...) at Source/JavaScriptCore/runtime/JSObject.cpp:426 #12 0x0000007fb6c3a32c in JSC::JSInternalFieldObjectImpl<2u>::visitChildrenImpl<JSC::SlotVisitor> (visitor=..., cell=0x7f7c0681e0) at Source/JavaScriptCore/runtime/JSInternalFieldObjectImplInlines.h:42 #13 JSC::JSInternalFieldObjectImpl<2u>::visitChildren (visitor=..., cell=0x7f7c0681e0) at Source/JavaScriptCore/runtime/JSInternalFieldObjectImplInlines.h:42 #14 JSC::JSPromise::visitChildrenImpl<JSC::SlotVisitor> (visitor=..., cell=0x7f7c0681e0) at Source/JavaScriptCore/runtime/JSPromise.cpp:75 #15 JSC::JSPromise::visitChildren (cell=0x7f7c0681e0, visitor=...) at Source/JavaScriptCore/runtime/JSPromise.cpp:78 #16 0x0000007fb6837af8 in JSC::MethodTable::visitChildren (visitor=..., cell=0x7f7c0681e0, this=<optimized out>) at Source/JavaScriptCore/runtime/ClassInfo.h:111 #17 JSC::SlotVisitor::visitChildren (cell=0x7f7c0681e0, this=0x7fb05cf200) at Source/JavaScriptCore/heap/SlotVisitor.cpp:396 --Type <RET> for more, q to quit, c to continue without paging-- #18 JSC::SlotVisitor::<lambda(JSC::MarkStackArray&)>::operator() (__closure=<optimized out>, stack=...) at Source/JavaScriptCore/heap/SlotVisitor.cpp:507 #19 JSC::SlotVisitor::forEachMarkStack<JSC::SlotVisitor::drain(WTF::MonotonicTime)::<lambda(JSC::MarkStackArray&)> > (func=..., this=0x7fb05cf200) at Source/JavaScriptCore/heap/SlotVisitorInlines.h:174 #20 JSC::SlotVisitor::drain (this=this@entry=0x7fb05cf200, timeout=...) at Source/JavaScriptCore/heap/SlotVisitor.cpp:497 #21 0x0000007fb6838440 in JSC::SlotVisitor::drainFromShared (this=this@entry=0x7fb05cf200, sharedDrainMode=sharedDrainMode@entry=JSC::SlotVisitor::HelperDrain, timeout=...) at Source/JavaScriptCore/heap/SlotVisitor.cpp:698 #22 0x0000007fb6807b44 in JSC::Heap::<lambda()>::operator() (__closure=0x7f7c82eee0) at Source/JavaScriptCore/heap/Heap.cpp:1305 #23 WTF::SharedTaskFunctor<void(), JSC::Heap::runBeginPhase(JSC::GCConductor)::<lambda()> >::run(void) (this=0x7f7c82eed0) at WTF/Headers/wtf/SharedTask.h:91 #24 0x0000007fb7158834 in WTF::ParallelHelperClient::runTask(WTF::RefPtr<WTF::SharedTask<void ()>, WTF::RawPtrTraits<WTF::SharedTask<void ()> >, WTF::DefaultRefDerefTraits<WTF::SharedTask<void ()> > > const&) (this=0x7f8c700448, task=...) at Source/WTF/wtf/ParallelHelperPool.cpp:110 #25 0x0000007fb71598fc in WTF::ParallelHelperPool::Thread::work (this=0x7f7c8d0480) at Source/WTF/wtf/ParallelHelperPool.cpp:201 #26 0x0000007fb71389a4 in WTF::AutomaticThread::<lambda()>::operator() (__closure=0x7fb05133c8) at Source/WTF/wtf/AutomaticThread.cpp:229 #27 WTF::Detail::CallableWrapper<WTF::AutomaticThread::start(const WTF::AbstractLocker&)::<lambda()>, void>::call(void) (this=0x7fb05133c0) at Source/WTF/wtf/Function.h:53 #28 0x0000007fb71614c4 in WTF::Function<void ()>::operator()() const (this=<synthetic pointer>) at Source/WTF/wtf/Function.h:79 #29 WTF::Thread::entryPoint (newThreadContext=0x7f7c8a2aa0) at Source/WTF/wtf/Threading.cpp:187 #30 0x0000007fb71c6a7c in WTF::wtfThreadEntryPoint (context=<optimized out>) at Source/WTF/wtf/posix/ThreadingPOSIX.cpp:241 #31 0x0000007fb3007904 in start_thread (arg=0x7fffffc866) at pthread_create.c:479 #32 0x0000007fb3cd6cac in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 Wondering if this one is related: https://bugs.webkit.org/show_bug.cgi?id=180315 Thanks. I tried reproducing this with various GC options on Apple Silicon macOS, but I cannot reproduce this issue. This indicates that this is Linux specific issue. CCing Igalia folks. (In reply to Yusuke Suzuki from comment #11) > This indicates that this is Linux specific issue. Not impossible, but since it's an uncommon race condition, I would expect it to be tough to reproduce on any platform. I see this crash every once in a while, but I never *expect* to see it. It's not common. Tips for debugging a GC related crash (like this one): 1. Does it reproduce with JSC_useGenerationalGC=0? 2. Does it reproduce with JSC_useConcurrentGC=0? These test if you have some sort of missing write barrier issue. 3. Does running with JSC_useZombieMode=1 make it reproduce more easily? Rules out incremental sweeping as a factor. Plus, helps make GC issues manifest sooner, though it may perturb the timing of the run and hide the issue. 4. Does it reproduce with a Debug build? Helps makes things easier to debug. Plus enable a lot more assertions to check invariants. 5. Does running with JSC_verifyGC=1 report any errors? Helps catch potential concurrent GC and generational GC issue and point to potential where the issue is. Note: though rare, may report a false positive. Some thoughts on your specific issue: 6. This appears to reproduce only on your "custom AArch64 platform". Is this "custom AArch64 platform" stable? Have you ruled out silicon or OS kernel bugs? From my past experience in the real world (not theoretical), I've known new CPUs (from a vendor whom I shall not name but is not Apple) to have either silicon or OS kernel configuration bugs that result in concurrency issues where the hardware itself does not enforce proper memory coherency despite the presence of the needed memory barriers. Has this been ruled out yet? 7. If you're running on custom silicon, are you also adding custom code to WebKit e.g. new types of Objects that are JSCells, or new functions that allocate and manipulate JSCells? If so, are you sure you have issued write barriers in all the needed places? One way to test this is to see if your issue still reproduces with the concurrent and generational GC disabled (see (1) and (2) above). If the reproduction stops, the next thing is to turn those back on, and start sprinting write barriers liberally in your code to see if it makes the issue goes away. If it does, gradually remove this sprinkling of write barriers, and see which one re-introduces the crash. If you've isolated it, then audit the code around there to figure out why that write barrier is needed, or not. There are also advanced techniques for debugging GC issues using JSC_verifyHeap=1 that requires writing a lot of custom code carefully: requires knowing what you are doing with GC related code, and understanding the art of bisecting bugs in time (vs in space). It's not a turn key solution for debugging such issues, but if you're the type who can dive in and reason deeply about how the system works, you can use this to help isolate the issue ... assuming it is a software issue. (In reply to Mark Lam from comment #13) > 6. This appears to reproduce only on your "custom AArch64 platform". > > Is this "custom AArch64 platform" stable? > Have you ruled out silicon or OS kernel bugs? It's definitely not an architecture-specific or hardware-specific issue, since I reported this originally, and I use x86_64. I cannot reproduce it, but it definitely happens sometimes. > 7. If you're running on custom silicon, are you also adding custom code to > WebKit e.g. new types of Objects that are JSCells, or new functions that > allocate and manipulate JSCells? I think we can discount this too. (In reply to Mark Lam from comment #13) > Tips for debugging a GC related crash (like this one): > Thanks! Very much appreciated! > 1. Does it reproduce with JSC_useGenerationalGC=0? Doesn't seem so. > 2. Does it reproduce with JSC_useConcurrentGC=0? > Less likely. Initially we had this as a work-around and believed this alleviates the issue, until I recently a crash despite `JSC_useConcurrentGC=0` being set but that was with a full-blown web app. See this comment: https://bugs.webkit.org/show_bug.cgi?id=200863#c7 > These test if you have some sort of missing write barrier issue. > Yup, there seems to be an issue with a barrier. > 3. Does running with JSC_useZombieMode=1 make it reproduce more easily? > > Rules out incremental sweeping as a factor. > Plus, helps make GC issues manifest sooner, though it may perturb the > timing of the run and hide the issue. > The reproducibility seems to be the same, ie. it's stil quite easy to reproduce the crash on "custom AArch64 platform" with the attached example and additionall logging patch. > 4. Does it reproduce with a Debug build? > > Helps makes things easier to debug. > Plus enable a lot more assertions to check invariants. > Yes, although it's more difficult to reproduce, and haven't managed to reproduce it with the simplified example attached. It was reproducible with a bigger web app though and many other things going on. The crash looked the same. I do reproduce it with a release build with debug symbols though using the attached example. Can try again a debug build. > 5. Does running with JSC_verifyGC=1 report any errors? > > Helps catch potential concurrent GC and generational GC issue and point > to potential where the issue is. > Note: though rare, may report a false positive. > Quickly checking it, I don't see any errors, although with the amount of logging enabled I could be missing something. Will take a closer look. > Some thoughts on your specific issue: > 6. This appears to reproduce only on your "custom AArch64 platform". > > Is this "custom AArch64 platform" stable? It's supposed to be, but what platform can be considered stable these days? There's a chance it's a platform issue which I do not rule out. > Have you ruled out silicon or OS kernel bugs? > These are always possible, aren't they? > From my past experience in the real world (not theoretical), I've known > new CPUs (from a vendor whom I shall not name but is not Apple) to have > either silicon or OS kernel configuration bugs that result in concurrency > issues where the hardware itself does not enforce proper memory coherency > despite the presence of the needed memory barriers. Has this been ruled out > yet? > Not ruled out. See this comment though: https://bugs.webkit.org/show_bug.cgi?id=200863#c12 This suggests the crash is not necessarily specific to the "custom AArch64 platform" where I can reproduce it. The original report was for x86_64. > 7. If you're running on custom silicon, are you also adding custom code to > WebKit e.g. new types of Objects that are JSCells, or new functions that > allocate and manipulate JSCells? > No. > If so, are you sure you have issued write barriers in all the needed > places? > Not applicable. > One way to test this is to see if your issue still reproduces with the > concurrent and generational GC disabled (see (1) and (2) above). > (1) seems not (easily) reproducible, (2) is known to still reproduce the crash although not so easily. > If the reproduction stops, the next thing is to turn those back on, and > start sprinting write barriers liberally in your code to see if it makes the > issue goes away. > > If it does, gradually remove this sprinkling of write barriers, and see > which one re-introduces the crash. If you've isolated it, then audit the > code around there to figure out why that write barrier is needed, or not. > Not applicable, we don't have any custom code. In fact, I ruled this out with a pristine WPE release build for the "custom AArch64 platform" and no patches/modifications. We do have a custom UI process (launcher) but the crash is in the Web process. > There are also advanced techniques for debugging GC issues using > JSC_verifyHeap=1 that requires writing a lot of custom code carefully: > requires knowing what you are doing with GC related code, and understanding > the art of bisecting bugs in time (vs in space). It's not a turn key > solution for debugging such issues, but if you're the type who can dive in > and reason deeply about how the system works, you can use this to help > isolate the issue ... assuming it is a software issue. OK, will try to explore this avenue, also look into more detail for any reports from `JSC_verifyGC=1`. And what do you think of Valgrind warnings, as in this comment: https://bugs.webkit.org/show_bug.cgi?id=200863#c8 ? (In reply to Krzysztof Konopko from comment #15) > (In reply to Mark Lam from comment #13) > > 4. Does it reproduce with a Debug build? > > > > Helps makes things easier to debug. > > Plus enable a lot more assertions to check invariants. > > > > Yes, although it's more difficult to reproduce, and haven't managed to > reproduce it with the simplified example attached. It was reproducible with > a bigger web app though and many other things going on. The crash looked > the same. I take this one back. I double-checked our issue tracker and this has not been reproduced on a Debug build. The best we got is a Release build with debug symbols. Certainly in Debug builds a lot of functions are not inlined therefore stack allocation looks different and all the timings looks different (also incurred by more function calls). Personally I suspect this might be related to some object(s) (memory) being allocated on the stack and used after that bit of stack is released. Of course the use of the stack may be intended but with a missing barrier somewhere, the object (memory) goes out of scope while (still_ being processed on another thread. (In reply to Michael Catanzaro from comment #14) > (In reply to Mark Lam from comment #13) > > 6. This appears to reproduce only on your "custom AArch64 platform". > > > > Is this "custom AArch64 platform" stable? > > Have you ruled out silicon or OS kernel bugs? > > It's definitely not an architecture-specific or hardware-specific issue, > since I reported this originally, and I use x86_64. I cannot reproduce it, > but it definitely happens sometimes. This is a faulty assumption. A crash in SlotVisitor::visitChildren can mean any number of things, for example: 1. You have 1 missing write barrier. 2. You have 1000 missing different write barriers. 3. You have some random memory corruption bug. 4. Your compiler toolchain had a bug. If any of these (and some others) go wrong, you can see a crash in SlotVisitor::visitChildren. Since you initially filed this bug against x86_64 in 2019, there have been some GC bugs that have been fixed. There has also been a lot of new code that have been added, which may or may not have introduced new GC bugs. So, the fact that this manifested for you back in 2019 on x86_64 tells you nothing about whether today's manifestation of this crash is due to the same bug. Hence, when it comes to GC type crashes like this, we shouldn't make such assumptions, especially based on a crash manifestation from many years ago. (In reply to Krzysztof Konopko from comment #15) > (In reply to Mark Lam from comment #13) > > Tips for debugging a GC related crash (like this one): > > > > Thanks! Very much appreciated! > > > 1. Does it reproduce with JSC_useGenerationalGC=0? > > Doesn't seem so. This is interesting if this continues to be true. > > 2. Does it reproduce with JSC_useConcurrentGC=0? > > See this comment: https://bugs.webkit.org/show_bug.cgi?id=200863#c7 I saw, but I had to add this case too for the benefit of anyone else seeking to learn about GC debugging by reading these comments. > Yup, there seems to be an issue with a barrier. I think this is a good likely scenario ... or something that has the effect of a missing barrier. > > 3. Does running with JSC_useZombieMode=1 make it reproduce more easily? > > > > Rules out incremental sweeping as a factor. > > Plus, helps make GC issues manifest sooner, though it may perturb the > > timing of the run and hide the issue. > > > > The reproducibility seems to be the same, ie. it's stil quite easy to > reproduce the crash on "custom AArch64 platform" with the attached example > and additionall logging patch. If the reproducibility is the same, then always run with useZombieMode=1 while you're still debugging this. It can only help. > > 4. Does it reproduce with a Debug build? > > > > Helps makes things easier to debug. > > Plus enable a lot more assertions to check invariants. > > > > Yes, although it's more difficult to reproduce, and haven't managed to > reproduce it with the simplified example attached. It was reproducible with > a bigger web app though and many other things going on. The crash looked > the same. > > I do reproduce it with a release build with debug symbols though using the > attached example. Can try again a debug build. A Release build with debug symbols does not add new info. A Debug build with optimizations forced to -O3 will add info. On Mac builds, it's easy to force the build to use -O3 (see `set-webkit-configuration --force-opt=O3`). However, that mechanism to force O3 only works with Xcode builds. You should look into doing the same with your own build system and see it reproduces with the Debug build. If speed (and therefore timing) is the reason it stops reproducing, then forcing O3 should make it easy to reproduce again. Using the Debug build is interesting because as I said earlier, it will "enable a lot more assertions to check invariants". This help catch the bug earlier. > > 5. Does running with JSC_verifyGC=1 report any errors? > > > > Helps catch potential concurrent GC and generational GC issue and point > > to potential where the issue is. > > Note: though rare, may report a false positive. > > > > Quickly checking it, I don't see any errors, although with the amount of > logging enabled I could be missing something. Will take a closer look. You can run with JSC_verifyGC=1 on Release builds too. > > Some thoughts on your specific issue: > > 6. This appears to reproduce only on your "custom AArch64 platform". > > > > Is this "custom AArch64 platform" stable? > > Not ruled out. See this comment though: > https://bugs.webkit.org/show_bug.cgi?id=200863#c12 If your crash the same as #c12? There may still be a source of crashes in the code base, but this manifests rarely. #c12 can be one of those. In your case, are you just seeing the crash once in a blue moon (to the extent that it's not reproducible on demand)? Or can you always reproduce it simply by running some workload for some determined period of time? It sounded like your scenario is the latter, which implies that you have a different bug here. Also see https://bugs.webkit.org/show_bug.cgi?id=200863#c17 regarding possible root causes of this crash. The question to ask is: are you also seeing other types of inexplicable crashes in other parts of the system at about the same rate as the SlotVisitor::visitChildren crash. The answer to this tells you whether there is something else at play in the core platform below WebKit. > > There are also advanced techniques for debugging GC issues using > > JSC_verifyHeap=1 that requires writing a lot of custom code carefully: > > requires knowing what you are doing with GC related code, and understanding > > the art of bisecting bugs in time (vs in space). It's not a turn key > > solution for debugging such issues, but if you're the type who can dive in > > and reason deeply about how the system works, you can use this to help > > isolate the issue ... assuming it is a software issue. > > OK, will try to explore this avenue, also look into more detail for any > reports from `JSC_verifyGC=1`. > > And what do you think of Valgrind warnings, as in this comment: > https://bugs.webkit.org/show_bug.cgi?id=200863#c8 > ? As Justin explained in https://webkit.slack.com/archives/CU5LWFM28/p1668090516378259?thread_ts=1668089239.370399&cid=CU5LWFM28, Valgrind makes assumptions about how the code works. Valgrind is not knowledgeable about how JSC works, and JSC does a lot of advance and tricky algorithms that Valgrind has no way to know about. As a result, you'll just be looking at a lot (possibly, all) false positives. If you want to continue to use it, you're on your own on deciphering whether the reported error is real or one of the many false positives that Valgrind will report. Right now, your data suggests that there may be a missing write barrier. The other possibility is that you have a compiler bug. Are you using gcc? Does replacing it with clang change the rate of reproduction significantly? Anyway, I've suggested / implied action items above to follow up on. You can also try using JSC_verifyHeap=1 if you like, but doing so requires that it is able to detect memory corruption due to the bug. You can always just try running with JSC_verifyHeap=1 as is and see if the current default configuration will already report an error (in the form of a crash). FYI, JSC_verifyGC=1 also reports its error with a crash. (In reply to Mark Lam from comment #18) > As Justin explained in > https://webkit.slack.com/archives/CU5LWFM28/ > p1668090516378259?thread_ts=1668089239.370399&cid=CU5LWFM28, Valgrind makes > assumptions about how the code works. Valgrind is not knowledgeable about > how JSC works, and JSC does a lot of advance and tricky algorithms that > Valgrind has no way to know about. As a result, you'll just be looking at a > lot (possibly, all) false positives. If you want to continue to use it, > you're on your own on deciphering whether the reported error is real or one > of the many false positives that Valgrind will report. > > Right now, your data suggests that there may be a missing write barrier. valgrind's memcheck tool is not prone to false positives, unless you are running leak check. (The leak check likes to complain about one-time leaks that may be intentional. It may also complain if you're doing something seriously weird.) Except for leak check results, I'd be surprised if memcheck discovers anything that's not either a real bug or at least non-ideal behavior that's worth fixing. It's really not normal for valgrind to report memory errors. The most benign result I ever saw turned out to be WebKit IPC writing uninitialized padding bytes to the IPC socket, which was harmless because the WebKit process on the other end of the pipe would just ignore them, but we still fixed that to ensure that running cleanly with valgrind is possible. In this case, looking at comment #8, I see that JSC is reading the value of uninitialized memory. That does not look benign. Also, the uninitalized read in the JSC::WriteBarrier class, which is pretty concerning considering your comment above regarding the possibility that there may be a missing write barrier! :) Based on the line numbers, it's clear that WriteBarrierBase::m_cell is being read by a call to WriteBarrierBase::cell before it is initialized, so the value of the StorageType returned by that call is undefined behavior. I don't know for certain whether it's relevant to this crash, but it is being called by JSC::SlotVisitor, so... seems like a pretty decent chance this is related. At the very least, it's bad. If WriteBarrierBase::m_cell were initialized to a default value, that would avoid the complaints from valgrind. Now I have the following set: JSC_useZombieMode=1 JSC_verifyGC=1 JSC_verifyHeap=1 It's still Release with debug symbols, "custom AArch64 platform", attached example with the logging patch. And it seems to always crash in the same way (although it takes different amount of time between 10-120 seconds): #0 JSC::JSValue::isHeapBigInt (this=0x7fe6de1a90) at Source/JavaScriptCore/runtime/JSCellInlines.h:227 #1 JSC::JSValue::isBigInt (this=0x7fe6de1a90) at Source/JavaScriptCore/runtime/JSCJSValueInlines.h:675 #2 JSC::JSValue::toBigIntOrInt32 (globalObject=0x7fa0e39068, this=0x7fe6de1a90) at Source/JavaScriptCore/runtime/JSCJSValueInlines.h:868 #3 JSC::bitwiseBinaryOp<JSC::jsBitwiseAnd(JSC::JSGlobalObject*, JSC::JSValue, JSC::JSValue)::{lambda(int, int)#1}&, JSC::jsBitwiseAnd(JSC::JSGlobalObject*, JSC::JSValue, JSC::JSValue)::{lambda(JSC::JSGlobalObject*, auto:1, auto:2)#2}&>(JSC::JSGlobalObject*, JSC::JSValue, JSC::JSValu e, JSC::jsBitwiseAnd(JSC::JSGlobalObject*, JSC::JSValue, JSC::JSValue)::{lambda(int, int)#1}&, JSC::jsBitwiseAnd(JSC::JSGlobalObject*, JSC::JSValue, JSC::JSValue)::{lambda(int, int)#1}&, char const*) (errorMessage=<optimized out>, bigIntOp=..., int32Op=..., v2=..., v1=..., globalObject=0x7fa0e39068) at Source/JavaScriptCore/runtime/Operations.h:823 #4 JSC::jsBitwiseAnd (v2=..., v1=..., globalObject=0x7fa0e39068) at Source/JavaScriptCore/runtime/Operations.h:863 #5 JSC::slow_path_bitand (callFrame=0x7fe6de1b80, pc=0x7fa0ed0c9e) at Source/JavaScriptCore/runtime/CommonSlowPaths.cpp:769 #6 0x0000007f80c0d1e8 in ?? () #7 0x0000007f78eeb800 in ?? () (gdb) x/i $pc => 0x7fa79804cc <JSC::slow_path_bitand(JSC::CallFrame*, JSC::Instruction const*)+236>: ldrb w0, [x28, #5] (gdb) p/x $x28 $2 = 0xbadbeef0 (gdb) x/4xw 0x7fe6de1a90 0x7fe6de1a90: 0xbadbeef0 0x00000000 0x79000000 0x0000007f Presumabely this originates from the JIT code and crashes on the _first_ call to `JSC::slow_path_bitand()` which is made sooner or later, depending on the timing.
> valgrind's memcheck tool is not prone to false positives, unless you are
> running leak check.
How does it check? I can't find any details in the documentation. Our write barrier class is intentionally used to read memory that it did not construct itself, so it doesn't seem strange to me that it would complain.
>
> #0 JSC::JSValue::isHeapBigInt (this=0x7fe6de1a90) at
> Source/JavaScriptCore/runtime/JSCellInlines.h:227
> #1 JSC::JSValue::isBigInt (this=0x7fe6de1a90) at
> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:675
> #2 JSC::JSValue::toBigIntOrInt32 (globalObject=0x7fa0e39068,
> this=0x7fe6de1a90) at Source/JavaScriptCore/runtime/JSCJSValueInlines.h:868
> #3 JSC::bitwiseBinaryOp<JSC::jsBitwiseAnd(JSC::JSGlobalObject*,
> JSC::JSValue, JSC::JSValue)::{lambda(int, int)#1}&,
> JSC::jsBitwiseAnd(JSC::JSGlobalObject*, JSC::JSValue,
Your line numbers seem to be pretty off from mine, what build is this?
(In reply to Justin Michaud from comment #21) > How does it check? I can't find any details in the documentation. Our write > barrier class is intentionally used to read memory that it did not construct > itself, so it doesn't seem strange to me that it would complain. The user-friendly documentation is here: https://valgrind.org/docs/manual/mc-manual.html#mc-manual.uninitvals If you are smarter than me and really want to know what it's doing, there are more docs here: https://valgrind.org/docs/manual/mc-manual.html#mc-manual.machine TL;DR WriteBarrierBase::cell is returning a garbage SymbolTable* pointer. (In reply to Justin Michaud from comment #22) > Your line numbers seem to be pretty off from mine, what build is this? Krzysztof reported WPE WebKit 2.34.7. It would be better to test with a newer version, but the code for 2.34.7 is here: https://trac.webkit.org/browser#webkit/releases/WPE%20WebKit/webkit-2.34.7 (Can't wait for this to move to GitHub....) (In reply to Michael Catanzaro from comment #23) > Krzysztof reported WPE WebKit 2.34.7. It would be better to test with a > newer version BTW the latest 2.38.2 is one year newer than 2.34. Really makes sense to try the latest code before reporting bugs; you never know whether it's already fixed. (In reply to Justin Michaud from comment #21) > > valgrind's memcheck tool is not prone to false positives, unless you are > > running leak check. > > How does it check? I can't find any details in the documentation. Our write > barrier class is intentionally used to read memory that it did not construct > itself, so it doesn't seem strange to me that it would complain. Justin, what you claimed here is not entirely accurate. We have conservative scanning of the stack. That's the only place that we would expect to access uninitialized memory. Everywhere else, we would expect initialized fields. In comment #c8, Krzysztof's crash trace seems to imply that JSSymbolTableObject::m_symbolTable is uninitialized. Thanks to Michael's links, I've got a refresher on how Valgrind works. So, if Valgrind is working properly, then this error message is concerning. However, JSSymbolTableObject::m_symbolTable is a WriteBarrier<SymbolTable>, and all WriteBarriers have a default constructor: ``` WriteBarrier() { this->setWithoutWriteBarrier(nullptr); } ``` ... which implies that Valgrind is wrong here. Maybe the line number on Krzysztof's crash trace is just out of date. Regardless, please do build and test with the latest WebKit. Reporting on a super old build which has not picked up all the bug fixes is not helpful. (In reply to Mark Lam from comment #25) > Maybe the line number on Krzysztof's crash trace is just out of date. > Regardless, please do build and test with the latest WebKit. Reporting on a > super old build which has not picked up all the bug fixes is not helpful. Crash call stacks are indeed from WPE 2.34.7. The latest release where we also reproduced it was 2.36.x. Will try to build the latest WebKit (WPE) for our platform and recheck. However, Valgrind complaints are from the latest mainline WebKit run on x86_64 (WPE build). Interestingly, in order to try out the latest WebKit code on our AArch64 platform, I had to switch from GCC 8.x to GCC 9.x. Just by doing that and without changing anything else, I can no longer reproduce the issue in any way, while with GCC 8.x the very same thing makes it fairly easy to reproduce the crash. It's conceivable that GCC could have a bug fixed in 9.x, yet I was not able to reproduce the crash on RPi3 AArch64 after I built the same WPE 2.34.7 with the same GCC 8.x. Also, as indicated in Comment #8, Valgrind warns about using uninitialised values originating on the stack, which comes from the latest WebKit main branch built with GCC 12.x for x86_64. So yeah, could be that GCC 9.x changes the timings but the issue remains buried even deeper, or maybe it's the issue with GCC 8.x. Yet to be found. Also Michael claimed in Comment #12 that he can "see this crash every once in a while", presumably on a more recent (?) WebKit and on x86_64 (?) (In reply to Mark Lam from comment #25) > However, JSSymbolTableObject::m_symbolTable is a WriteBarrier<SymbolTable>, > and all WriteBarriers have a default constructor: > ``` > WriteBarrier() > { > this->setWithoutWriteBarrier(nullptr); > } > ``` > ... which implies that Valgrind is wrong here. That just means that my assumption that m_cell is never initialized was wrong. A bug in WebKit is drastically more likely than a bug in valgrind. So if the m_cell is initially initialized, then it must be overwritten with uninitialized memory from someplace else later on. valgrind should be able to pinpoint where it's coming from, but not by default: you need to use --track-origins flag (which is slow!) to show that. There's a good chance that will give us the remaining info we need to fix this. Hopefully. (In reply to Krzysztof Konopko from comment #27) > Also Michael claimed in Comment #12 that he can "see this crash every once > in a while", presumably on a more recent (?) WebKit and on x86_64 (?) Unfortunately, I don't remember when the last time I saw this crash was. That suggests it's probably not an important top crasher. Honestly, if the crashes stopped after you upgraded GCC, then my primary concern here would be fixing the errors that valgrind is reporting. That seems likely to be the underlying bug here. (In reply to Michael Catanzaro from comment #28) > (In reply to Mark Lam from comment #25) > > However, JSSymbolTableObject::m_symbolTable is a WriteBarrier<SymbolTable>, > > and all WriteBarriers have a default constructor: > > ``` > > WriteBarrier() > > { > > this->setWithoutWriteBarrier(nullptr); > > } > > ``` > > ... which implies that Valgrind is wrong here. > > That just means that my assumption that m_cell is never initialized was > wrong. A bug in WebKit is drastically more likely than a bug in valgrind. So > if the m_cell is initially initialized, then it must be overwritten with > uninitialized memory from someplace else later on. valgrind should be able > to pinpoint where it's coming from, but not by default: you need to use > --track-origins flag (which is slow!) to show that. There's a good chance > that will give us the remaining info we need to fix this. Hopefully. > Well, I was running with `--track-origins` pretty much early on. I just thought I've already presented the results of this option. Apparently not, my bad! Anyway, here's another example I still have lingering in one of the terminals, with the origin pointing to something on the stack: ==116== Use of uninitialised value of size 8 ==116== at 0x313B069: JSC::MarkedBlock::aboutToMarkSlow(unsigned int) (MarkedBlock.cpp:233) ==116== by 0x1F191A1: JSC::MarkedBlock::aboutToMark(unsigned int) (MarkedBlock.h:573) ==116== by 0x315BB5A: JSC::Heap::testAndSetMarked(unsigned int, void const*) (HeapInlines.h:86) ==116== by 0x31579D6: JSC::SlotVisitor::appendJSCellOrAuxiliary(JSC::HeapCell*) (SlotVisitor.cpp:199) ==116== by 0x3157366: JSC::SlotVisitor::append(JSC::ConservativeRoots const&) (SlotVisitor.cpp:134) ==116== by 0x30B7E57: auto JSC::Heap::addCoreConstraints()::{lambda(auto:1&)#2}::operator()<JSC::SlotVisitor>(JSC::SlotVisitor&) (Heap.cpp:2788) ==116== by 0x30B8063: WTF::Detail::CallableWrapper<JSC::Heap::addCoreConstraints()::{lambda(auto:1&)#2}, void, JSC::SlotVisitor&>::call(JSC::SlotVisitor&) (Function.h:53) ==116== by 0x315CDEE: WTF::Function<void (JSC::SlotVisitor&)>::operator()(JSC::SlotVisitor&) const (Function.h:82) ==116== by 0x315C0B8: JSC::MarkingConstraintExecutorPair::execute(JSC::SlotVisitor&) (MarkingConstraintExecutorPair.h:45) ==116== by 0x315CE4A: void JSC::SimpleMarkingConstraint::executeImplImpl<JSC::SlotVisitor>(JSC::SlotVisitor&) (SimpleMarkingConstraint.cpp:48) ==116== by 0x3156D42: JSC::SimpleMarkingConstraint::executeImpl(JSC::SlotVisitor&) (SimpleMarkingConstraint.cpp:52) ==116== by 0x313E948: JSC::MarkingConstraint::execute(JSC::SlotVisitor&) (MarkingConstraint.cpp:58) ==116== Uninitialised value was created by a stack allocation ==116== at 0x32D2817: JSC::Interpreter::executeCall(JSC::JSGlobalObject*, JSC::JSObject*, JSC::CallData const&, JSC::JSValue, JSC::ArgList const&) (Interpreter.cpp:1016) I looked there but honestly couldn't figure out what could go wrong here. This is pretty recent WebKit mainline branch on x86_64. Note that it's a Valgring's limitation when it points at a function in case of the stack origin. Created attachment 463598 [details]
JS-only test
For the avoidance of doubt, attaching a JS-only test which is extracted JS code from the previously attached HTML/JS test. Note that this JS-only test does not reproduce the crash (inaccessible in JS-only env. global `window` object replaced with an empty one), yet it's still useful to run experiments, eg. with Valgrind.
Created attachment 463599 [details]
Valgrind output when running JS-only test with `jsc`
* Copy `test.js` to `./WebKitBuild/WPE/Debug/`
* Enter WebKit WPE Flatpack:
./Tools/Scripts/webkit-flatpak --wpe --debug
* Run the test with Valgrind
JSC_forceRAMSize=1G JSC_logGC=2 JSC_numberOfGCMarkers=2 \
valgrind --read-var-info=yes --track-origins=yes \
--log-file=./WebKitBuild/WPE/Debug/valgrind-jsc.log \
./WebKitBuild/WPE/Debug/bin/jsc ./WebKitBuild/WPE/Debug/test.js
`JSC_forceRAMSize=1G` is unlikely making any difference but I've been using it consistently initially hoping to make it a more representative env. of the platform where the crash was originally reproduced.
Created attachment 463600 [details]
Valgrind output when running HTML/JS test
This is Valgrind output when running the HTML/JS `test.html` attached earlier. This is run on a recent WebKit mainline on x86_64 WPE build.
* Copy `test.html` into `./WebKitBuild/WPE/Debug/`
* Run the following
WEBKIT_MINI_BROWSER_PREFIX="valgrind --trace-children=yes --trace-children-skip=WPENetworkProcess --read-var-info=yes --track-origins=yes --log-file=/app/webkit/WebKitBuild/Debug/valgrind-html.log " \
JSC_forceRAMSize=1G JSC_logGC=2 JSC_numberOfGCMarkers=2 \
./Tools/Scripts/run-minibrowser --wpe --debug -- \
--platform=headless --webprocess-failure=exit \
file:///app/webkit/WebKitBuild/Debug/test.html
Again, `JSC_forceRAMSize=1G` and other flags are unlikely making any difference. They are preserved though to make it more representative of the platform where the crash is reproduced.
(In reply to Krzysztof Konopko from comment #31) > Created attachment 463599 [details] > Valgrind output when running JS-only test with `jsc` Wow, looks like you've just found a lot of problems. :S We should endeavor to keep our valgrind output clean; otherwise, false positives may cause serious confusion when handling bug reports, e.g. see comment 10 in bug #182272 if you have permission to view it, where I incorrectly decided that a bunch of distinct security bugs were all duplicates of each other because the SUPPRESS_ASAN annotation was broken. I notice the majority of these somehow involve JSC::ConservativeRoots. Does this code intentionally make branching decisions based on uninitialized values? Because that is what is currently happening, in a bunch of different places. But it looks like a human has assessed this and decided that it is OK, because look: template<typename MarkHook> SUPPRESS_ASAN void ConservativeRoots::genericAddSpan(void* begin, void* end, MarkHook& markHook) it says SUPPRESS_ASAN here. I think we need to (a) understand why this code is OK, (b) figure out how to avoid triggering valgrind by altering behavior using RUNNING_ON_VALGRIND from bmalloc/valgrind.h. We cannot simply tell valgrind to ignore problems that go via this function like we can with asan. (OK, we *can* by using suppression files, but the design of those is pretty broken, so nobody does.) Instead, we can detect valgrind with RUNNING_ON_VALGRIND, then vary behavior somehow to avoid the normally-intentional use of uninitialized memory. But doing that requires clearly understanding what the code is doing, so I won't be much help there.... Anyway, back to the original bug: ==125== Use of uninitialised value of size 8 ==125== at 0x1EF5EB8: JSC::WriteBarrierBase<JSC::JSGlobalObject, WTF::RawPtrTraits<JSC::JSGlobalObject> >::cell() const (WriteBarrier.h:153) ==125== by 0x1EE9EC6: JSC::WriteBarrierBase<JSC::JSGlobalObject, WTF::RawPtrTraits<JSC::JSGlobalObject> >::get() const (WriteBarrier.h:101) ==125== by 0x24FA88D: void JSC::SlotVisitor::append<JSC::JSGlobalObject, WTF::RawPtrTraits<JSC::JSGlobalObject> >(JSC::WriteBarrierBase<JSC::JSGlobalObject, WTF::RawPtrTraits<JSC::JSGlobalObject> > const&) (SlotVisitorInlines.h:110) ==125== by 0x3ABFAAA: void JSC::Structure::visitChildrenImpl<JSC::SlotVisitor>(JSC::JSCell*, JSC::SlotVisitor&) (Structure.cpp:1275) ==125== by 0x3AB33B0: JSC::Structure::visitChildren(JSC::JSCell*, JSC::SlotVisitor&) (Structure.cpp:1303) ==125== by 0x315BA02: JSC::MethodTable::visitChildren(JSC::JSCell*, JSC::SlotVisitor&) const (ClassInfo.h:115) ==125== by 0x315C654: JSC::SlotVisitor::visitChildren(JSC::JSCell const*) (SlotVisitor.cpp:394) ==125== by 0x31580CB: JSC::SlotVisitor::drain(WTF::MonotonicTime)::{lambda(JSC::MarkStackArray&)#1}::operator()(JSC::MarkStackArray&) const (SlotVisitor.cpp:504) ==125== by 0x315B150: WTF::IterationStatus JSC::SlotVisitor::forEachMarkStack<JSC::SlotVisitor::drain(WTF::MonotonicTime)::{lambda(JSC::MarkStackArray&)#1}>(JSC::SlotVisitor::drain(WTF::MonotonicTime)::{lambda(JSC::MarkStackArray&)#1} const&) (SlotVisitorInlines.h:184) ==125== by 0x31581DE: JSC::SlotVisitor::drain(WTF::MonotonicTime) (SlotVisitor.cpp:494) ==125== by 0x3159373: JSC::SlotVisitor::donateAndDrain(WTF::MonotonicTime) (SlotVisitor.cpp:777) ==125== by 0x3158E73: JSC::SlotVisitor::drainInParallel(WTF::MonotonicTime) (SlotVisitor.cpp:703) ==125== Uninitialised value was created by a stack allocation ==125== at 0x386C00D: JSC::JSGlobalObject::init(JSC::VM&) (JSGlobalObject.cpp:786) I would have hoped that it would point us to the exact line where the uninitialized value is originating from. :( But knowing that it's coming from JSGlobalObject::init is still useful. That's unfortunately a pretty big function, but now we know the bug is probably *somewhere* there. (In reply to Michael Catanzaro from comment #33) > (In reply to Krzysztof Konopko from comment #31) > > Created attachment 463599 [details] > > Valgrind output when running JS-only test with `jsc` > > Wow, looks like you've just found a lot of problems. :S We should endeavor > to keep our valgrind output clean; otherwise, false positives may cause > serious confusion when handling bug reports, e.g. see comment 10 in bug > #182272 if you have permission to view it, where I incorrectly decided that > a bunch of distinct security bugs were all duplicates of each other because > the SUPPRESS_ASAN annotation was broken. > > I notice the majority of these somehow involve JSC::ConservativeRoots. Does > this code intentionally make branching decisions based on uninitialized > values? Because that is what is currently happening, in a bunch of different > places. But it looks like a human has assessed this and decided that it is > OK, because look: > > template<typename MarkHook> > SUPPRESS_ASAN > void ConservativeRoots::genericAddSpan(void* begin, void* end, MarkHook& > markHook) > > it says SUPPRESS_ASAN here. I think we need to (a) understand why this code > is OK, (b) figure out how to avoid triggering valgrind by altering behavior > using RUNNING_ON_VALGRIND from bmalloc/valgrind.h. We cannot simply tell > valgrind to ignore problems that go via this function like we can with asan. > (OK, we *can* by using suppression files, but the design of those is pretty > broken, so nobody does.) Instead, we can detect valgrind with > RUNNING_ON_VALGRIND, then vary behavior somehow to avoid the > normally-intentional use of uninitialized memory. But doing that requires > clearly understanding what the code is doing, so I won't be much help > there.... It is conservative GC stack scanning, so valgrind isn't correctly understanding it. (In reply to Yusuke Suzuki from comment #34) > (In reply to Michael Catanzaro from comment #33) > > (In reply to Krzysztof Konopko from comment #31) > > > Created attachment 463599 [details] > > > Valgrind output when running JS-only test with `jsc` > > > > Wow, looks like you've just found a lot of problems. :S We should endeavor > > to keep our valgrind output clean; otherwise, false positives may cause > > serious confusion when handling bug reports, e.g. see comment 10 in bug > > #182272 if you have permission to view it, where I incorrectly decided that > > a bunch of distinct security bugs were all duplicates of each other because > > the SUPPRESS_ASAN annotation was broken. > > > > I notice the majority of these somehow involve JSC::ConservativeRoots. Does > > this code intentionally make branching decisions based on uninitialized > > values? Because that is what is currently happening, in a bunch of different > > places. But it looks like a human has assessed this and decided that it is > > OK, because look: > > > > template<typename MarkHook> > > SUPPRESS_ASAN > > void ConservativeRoots::genericAddSpan(void* begin, void* end, MarkHook& > > markHook) > > > > it says SUPPRESS_ASAN here. I think we need to (a) understand why this code > > is OK, (b) figure out how to avoid triggering valgrind by altering behavior > > using RUNNING_ON_VALGRIND from bmalloc/valgrind.h. We cannot simply tell > > valgrind to ignore problems that go via this function like we can with asan. > > (OK, we *can* by using suppression files, but the design of those is pretty > > broken, so nobody does.) Instead, we can detect valgrind with > > RUNNING_ON_VALGRIND, then vary behavior somehow to avoid the > > normally-intentional use of uninitialized memory. But doing that requires > > clearly understanding what the code is doing, so I won't be much help > > there.... > > It is conservative GC stack scanning, so valgrind isn't correctly > understanding it. OK, so if it all can be explained as reasonable and Valgdrind warnings are a red herring, then fair enough. Nothing to see here. Well the warning I pasted at the bottom of comment #33 is not a red herring. That's almost certainly the bug that's causing your crash. (Even if the warnings from ConservativeRoots are red herrings, we still need to silence them, because without reading output from valgrind, we won't be able to fix memory corruption bugs. valgrind and asan are the only two games in town for debugging memory corruption, and asan is often not practical because it requires rebuilding so we can't tell users to try it.) (In reply to Michael Catanzaro from comment #36) > Well the warning I pasted at the bottom of comment #33 is not a red herring. > That's almost certainly the bug that's causing your crash. > Oh, missed this one. Indeed, looks like a genuine problem, unless anyone can tell otherwise? Not sure if there's anything else I can do: it's already a Debug (WPE) build of the latest WebKit... (In reply to Krzysztof Konopko from comment #37) > Oh, missed this one. Indeed, looks like a genuine problem, unless anyone > can tell otherwise? It's hard to imagine that it could possibly not be a bug. It's the same as in comment #8 except with JSC::JSGlobalObject instead of JSC::SymbolTable: the code is returning a pointer to a JSC::JSGlobalObject but the pointer is uninitialized memory, so it's garbage. At least, that's what it looks like to me. (In comment #36 I said it's "almost certainly" causing your crash, but that was silly of me: I have no clue what's causing your crash. But clearly the WriteBarrier is guarding a bogus pointer.) As for the rest of the valgrind warnings caused by the conservative GC: this might sound crazy, but I think we should just completely turn off GC when running under valgrind (unless it's possible to somehow avoid reading uninitialized memory, but I guess that's just not how the garbage collector works?) and just accept that will (likely drastically) reduce our ability to check for memory errors. The output isn't useful at all otherwise: I would never have found the WriteBarrierBase problem in that huge wall of warnings had Krzysztof not initially pointed it out separately in comment #8. |