Bug 166960 - Concurrent GC has a bug where we would detect a race but fail to rescan the object
Summary: Concurrent GC has a bug where we would detect a race but fail to rescan the o...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Local Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Saam Barati
URL:
Keywords: InRadar
Depends on:
Blocks: 165909 166944 166989
  Show dependency treegraph
 
Reported: 2017-01-11 19:01 PST by Saam Barati
Modified: 2017-01-12 19:13 PST (History)
13 users (show)

See Also:


Attachments
WIP (8.43 KB, patch)
2017-01-12 14:31 PST, Saam Barati
no flags Details | Formatted Diff | Diff
patch (12.56 KB, patch)
2017-01-12 15:14 PST, Saam Barati
fpizlo: review+
Details | Formatted Diff | Diff
patch for landing (12.28 KB, patch)
2017-01-12 16:09 PST, Saam Barati
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Saam Barati 2017-01-11 19:01:28 PST
I'm still investigating what's going on. I think I might have found it, but I'm about to run tests.
If you put the tests from bug #166707 into a directory, and run them using run-jsc-stress-tests, about a quarter of the run-jsc-stress-tests invocations will hit a crash.
Comment 1 Saam Barati 2017-01-11 19:05:59 PST
Ok, I think there is a place where we may be forgetting to writeBarrier. Testing out locally now.
Comment 2 Radar WebKit Bug Importer 2017-01-11 19:36:56 PST
<rdar://problem/29983526>
Comment 3 Saam Barati 2017-01-11 20:10:09 PST
(In reply to comment #1)
> Ok, I think there is a place where we may be forgetting to writeBarrier.
> Testing out locally now.

Ok, ran tests for over an hour with no crashes, so I think I've found it.
The bug is, we have code like this:

```
ALWAYS_INLINE PropertyOffset JSObject::prepareToPutDirectWithoutTransition(VM& vm, PropertyName propertyName, unsigned attributes, StructureID structureID, Structure* structure)
{
    unsigned oldOutOfLineCapacity = structure->outOfLineCapacity();
    PropertyOffset result;
    structure->addPropertyWithoutTransition(
        vm, propertyName, attributes,
        [&] (const GCSafeConcurrentJSLocker&, PropertyOffset offset, PropertyOffset newLastOffset) {
            unsigned newOutOfLineCapacity = Structure::outOfLineCapacity(newLastOffset);
            if (newOutOfLineCapacity != oldOutOfLineCapacity) {
                Butterfly* butterfly = allocateMoreOutOfLineStorage(vm, oldOutOfLineCapacity, newOutOfLineCapacity);
                nukeStructureAndSetButterfly(vm, structureID, butterfly);
                structure->setLastOffset(newLastOffset);
                WTF::storeStoreFence();
                setStructureIDDirectly(structureID);
            } else
                structure->setLastOffset(newLastOffset);
            result = offset;
        });
    return result;
}
```

However, we're missing a writeBarrier after re-setting the StructureID to be non-nuked.
So the code should be something like:

```
ALWAYS_INLINE PropertyOffset JSObject::prepareToPutDirectWithoutTransition(VM& vm, PropertyName propertyName, unsigned attributes, StructureID structureID, Structure* structure)
{
    unsigned oldOutOfLineCapacity = structure->outOfLineCapacity();
    PropertyOffset result;
    structure->addPropertyWithoutTransition(
        vm, propertyName, attributes,
        [&] (const GCSafeConcurrentJSLocker&, PropertyOffset offset, PropertyOffset newLastOffset) {
            unsigned newOutOfLineCapacity = Structure::outOfLineCapacity(newLastOffset);
            if (newOutOfLineCapacity != oldOutOfLineCapacity) {
                Butterfly* butterfly = allocateMoreOutOfLineStorage(vm, oldOutOfLineCapacity, newOutOfLineCapacity);
                nukeStructureAndSetButterfly(vm, structureID, butterfly);
                structure->setLastOffset(newLastOffset);
                WTF::storeStoreFence();
                setStructureIDDirectly(structureID);
            } else
                structure->setLastOffset(newLastOffset);
            result = offset;
            vm.heap.writeBarrier(this);
        });
    return result;
}
```

Not doing the writeBarrier could cause some weird issues. I believe this would be one:
(Mutator Thread: M, Collector Thread: C, assuming sequential consistency via proper barriers)

M: allocate new butterfly
M: Set nuked structure ID
M: Set butterfly (this does a barrier)
C: Start scanning O
C: load structure ID
C: See it's nuked and bail, relying on the barrier to rescan
Note, we never rescan, so the butterfly that we just allocated could not be kept alive, and could be allocated over!

I think there are probably other bugs, but I haven't worked out all the possibilities.
Comment 4 Filip Pizlo 2017-01-11 22:07:49 PST
It almost seems like setStructureIDDirectly should fire a write barrier by default, and the variant of it that we use in nukeBlahBlah should be a special one that doesn't barrier. Maybe that one can have an extra funky name like setStructureIDDirectlyWithoutWriteBarrier.
Comment 5 Michael Saboff 2017-01-11 22:13:23 PST
This looks like the same issue as my bug <rdar://problem/29779787>.
Comment 6 Saam Barati 2017-01-12 14:31:14 PST
Created attachment 298709 [details]
WIP

RFC. Going to write up a changelog.
Comment 7 Filip Pizlo 2017-01-12 14:35:44 PST
Comment on attachment 298709 [details]
WIP

View in context: https://bugs.webkit.org/attachment.cgi?id=298709&action=review

I think that this is right.

> Source/JavaScriptCore/heap/Heap.cpp:531
> +    // FIXME: maybe debug assert?

Release asserts are better.

> Source/JavaScriptCore/heap/Heap.cpp:2257
> +        "Mms", "Mutator/Race Mark Stack",

The abbreviation codes are meant to be created like this:

- Take first letter of each word on the long version.
- Capitalize the first letter.
- Downcase all other letters.

In this convention, the short name should become "Mrms".

The use of first-letter-capital means that we can print these without delimiters.  For example the GC might print "WrhCbMrms(100)", which means that it evaluated the weak reference harvester (Wrh), CodeBlock (Cb), and Mutator/Race Mark Stack (Mrms) constraints in that order.
Comment 8 Mark Lam 2017-01-12 14:42:58 PST
Comment on attachment 298709 [details]
WIP

View in context: https://bugs.webkit.org/attachment.cgi?id=298709&action=review

> Source/JavaScriptCore/heap/Heap.cpp:2257
> +        "Mms", "Mutator/Race Mark Stack",

How about "Mutator+Race" which is a more accurate description.

> Source/JavaScriptCore/runtime/Structure.cpp:798
> +    vm.heap.writeBarrier(object);

Is this still needed given that SlotVisitor::didRace() now ensures re-scanning on races?
Comment 9 Geoffrey Garen 2017-01-12 14:47:02 PST
Comment on attachment 298709 [details]
WIP

View in context: https://bugs.webkit.org/attachment.cgi?id=298709&action=review

>> Source/JavaScriptCore/runtime/Structure.cpp:798
>> +    vm.heap.writeBarrier(object);
> 
> Is this still needed given that SlotVisitor::didRace() now ensures re-scanning on races?

In general, write barriers are supposed to happen after stores rather than before. This change corrects the ordering.
Comment 10 Mark Lam 2017-01-12 14:49:36 PST
Comment on attachment 298709 [details]
WIP

View in context: https://bugs.webkit.org/attachment.cgi?id=298709&action=review

>>> Source/JavaScriptCore/runtime/Structure.cpp:798
>>> +    vm.heap.writeBarrier(object);
>> 
>> Is this still needed given that SlotVisitor::didRace() now ensures re-scanning on races?
> 
> In general, write barriers are supposed to happen after stores rather than before. This change corrects the ordering.

Yes, but I think the purpose of this writeBarrier is because we're changing the structureID, and may race with the GC.  The GC now automatically takes care this with the raceMarkStack.  Hence, there shouldn't be a need for these manually issue barriers anymore.  That is unless there's another reason we need this here (which is what I'm really asking)?
Comment 11 Saam Barati 2017-01-12 15:14:24 PST
Created attachment 298716 [details]
patch
Comment 12 Filip Pizlo 2017-01-12 15:16:22 PST
Comment on attachment 298716 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=298716&action=review

> Source/JavaScriptCore/heap/SlotVisitor.cpp:705
> +    heap()->addToRaceMarkStack(race.cell());

You should just inline addToRaceMarkStack here and make SlotVisitor a friend of Heap if it isn't already.  I bet it already is.
Comment 13 Mark Lam 2017-01-12 15:19:26 PST
Comment on attachment 298716 [details]
patch

r=me too.
Comment 14 Michael Saboff 2017-01-12 15:28:27 PST
Using the work in progress patch with a debug build I got the crash below on the 8th run.

Test:
DYLD_FRAMEWORK_PATH=WebKitBuild/Debug/ WebKitBuild/Debug/jsc  --useFTLJIT\=false --useFunctionDotArguments\=true --maxPerThreadStackUsage\=1572864 --thresholdForJITAfterWarmUp\=10 --thresholdForJITSoon\=10 --thresholdForOptimizeAfterWarmUp\=20 --thresholdForOptimizeAfterLongWarmUp\=20 --thresholdForOptimizeSoon\=20 --thresholdForFTLOptimizeAfterWarmUp\=20 --thresholdForFTLOptimizeSoon\=20 --maximumEvalCacheableSourceLength\=150000 --useEagerCodeBlockJettisonTiming\=true --collectContinuously\=true --useGenerationalGC\=false JSTests/stress/has-own-property-name-cache-string-keys.js

Backtrace:
ASSERTION FAILED: cell->cellState() == CellState::PossiblyGrey
/Volumes/Data/src/webkit/Source/JavaScriptCore/heap/SlotVisitor.cpp(287) : void JSC::SlotVisitor::appendToMarkStack(ContainerType &, JSC::JSCell *) [ContainerType = JSC::MarkedBlock]
1   0x108e6d71d WTFCrash
2   0x108cafa4a void JSC::SlotVisitor::appendToMarkStack<JSC::MarkedBlock>(JSC::MarkedBlock&, JSC::JSCell*)
3   0x108caf82d void JSC::SlotVisitor::setMarkedAndAppendToMarkStack<JSC::MarkedBlock>(JSC::MarkedBlock&, JSC::JSCell*)
4   0x108caf44a JSC::SlotVisitor::setMarkedAndAppendToMarkStack(JSC::JSCell*)
5   0x108caf4dc JSC::SlotVisitor::appendHidden(JSC::JSValue)
6   0x1088d4c25 void JSC::SlotVisitor::appendHidden<JSC::Unknown>(JSC::WriteBarrierBase<JSC::Unknown> const&)
7   0x108975ca6 JSC::JSSegmentedVariableObject::visitChildren(JSC::JSCell*, JSC::SlotVisitor&)
8   0x108cafe04 JSC::SlotVisitor::visitChildren(JSC::JSCell const*)
9   0x108cb02cb JSC::SlotVisitor::drain(WTF::MonotonicTime)
10  0x108cb0b78 JSC::SlotVisitor::drainFromShared(JSC::SlotVisitor::SharedDrainMode, WTF::MonotonicTime)
11  0x10870900e JSC::Heap::markToFixpoint(double)::$_2::operator()() const
12  0x108708c19 WTF::SharedTaskFunctor<void (), JSC::Heap::markToFixpoint(double)::$_2>::run()
13  0x108ea921d WTF::ParallelHelperClient::runTask(WTF::RefPtr<WTF::SharedTask<void ()> >)
14  0x108ea9f8f WTF::ParallelHelperPool::Thread::work()
15  0x108e7b86d WTF::AutomaticThread::start(WTF::Locker<WTF::LockBase> const&)::$_0::operator()() const
16  0x108e7b62d void std::__1::__invoke_void_return_wrapper<void>::__call<WTF::AutomaticThread::start(WTF::Locker<WTF::LockBase> const&)::$_0&>(WTF::AutomaticThread::start(WTF::Locker<WTF::LockBase> const&)::$_0&&&)
17  0x108e7b3c9 std::__1::__function::__func<WTF::AutomaticThread::start(WTF::Locker<WTF::LockBase> const&)::$_0, std::__1::allocator<WTF::AutomaticThread::start(WTF::Locker<WTF::LockBase> const&)::$_0>, void ()>::operator()()
18  0x10846586a std::__1::function<void ()>::operator()() const
19  0x108edeaa7 WTF::threadEntryPoint(void*)
20  0x108ee0471 WTF::wtfThreadEntryPoint(void*)
21  0x7fffbfc9faab _pthread_body
22  0x7fffbfc9f9f7 _pthread_body
23  0x7fffbfc9f221 thread_start
Segmentation fault: 11
Comment 15 Michael Saboff 2017-01-12 15:33:44 PST
For my crash, the main thread had this backtrace.  All other threads were waiting.

Thread 0:: Dispatch queue: com.apple.main-thread
0   com.apple.JavaScriptCore      	0x0000000108cdb991 JSC::Structure::pin(JSC::VM&, JSC::PropertyTable*) + 1 (Structure.cpp:804)
1   com.apple.JavaScriptCore      	0x0000000107a97622 int JSC::Structure::addPropertyWithoutTransition<JSC::JSObject::prepareToPutDirectWithoutTransition(JSC::VM&, JSC::PropertyName, unsigned int, unsigned int, JSC::Structure*)::'lambda'(JSC::GCSafeConcurrentJSLocker const&, int, int)>(JSC::VM&, JSC::PropertyName, unsigned int, JSC::JSObject::prepareToPutDirectWithoutTransition(JSC::VM&, JSC::PropertyName, unsigned int, unsigned int, JSC::Structure*)::'lambda'(JSC::GCSafeConcurrentJSLocker const&, int, int) const&) + 82 (StructureInlines.h:354)
2   com.apple.JavaScriptCore      	0x0000000107a9606a JSC::JSObject::prepareToPutDirectWithoutTransition(JSC::VM&, JSC::PropertyName, unsigned int, unsigned int, JSC::Structure*) + 138 (JSObjectInlines.h:192)
3   com.apple.JavaScriptCore      	0x00000001080e90c0 bool JSC::JSObject::putDirectInternal<(JSC::JSObject::PutMode)0>(JSC::VM&, JSC::PropertyName, JSC::JSValue, unsigned int, JSC::PutPropertySlot&) + 928 (JSObjectInlines.h:278)
4   com.apple.JavaScriptCore      	0x00000001080e89dc JSC::JSObject::putInline(JSC::JSCell*, JSC::ExecState*, JSC::PropertyName, JSC::JSValue, JSC::PutPropertySlot&) + 924 (JSObjectInlines.h:215)
5   com.apple.JavaScriptCore      	0x000000010893f845 JSC::JSObject::put(JSC::JSCell*, JSC::ExecState*, JSC::PropertyName, JSC::JSValue, JSC::PutPropertySlot&) + 69 (JSObject.cpp:751)
6   com.apple.JavaScriptCore      	0x00000001080e32fe JSC::JSValue::put(JSC::ExecState*, JSC::PropertyName, JSC::JSValue, JSC::PutPropertySlot&) + 206 (JSCJSValueInlines.h:867)
7   com.apple.JavaScriptCore      	0x00000001083caeb7 void JSC::DFG::operationPutByValInternal<false, false>(JSC::ExecState*, long long, long long, long long) + 807 (DFGOperations.cpp:145)
8   com.apple.JavaScriptCore      	0x00000001083cab7b operationPutByValNonStrict + 75 (DFGOperations.cpp:624)
9   ???                           	0x00003def31a03067 0 + 68097539059815
10  com.apple.JavaScriptCore      	0x0000000108a3928e vmEntryToJavaScript + 334
11  com.apple.JavaScriptCore      	0x000000010884d269 JSC::JITCode::execute(JSC::VM*, JSC::ProtoCallFrame*) + 329 (JITCode.cpp:81)
12  com.apple.JavaScriptCore      	0x0000000108805534 JSC::Interpreter::execute(JSC::ProgramExecutable*, JSC::ExecState*, JSC::JSObject*) + 4788 (Interpreter.cpp:871)
13  com.apple.JavaScriptCore      	0x00000001080ebd8d JSC::evaluate(JSC::ExecState*, JSC::SourceCode const&, JSC::JSValue, WTF::NakedPtr<JSC::Exception>&) + 669 (Completion.cpp:110)
14  jsc                           	0x000000010798a9eb runWithScripts(GlobalObject*, WTF::Vector<Script, 0ul, WTF::CrashOnOverflow, 16ul> const&, WTF::String const&, bool, bool, bool) + 2123 (jsc.cpp:2957)
15  jsc                           	0x000000010797c46a runJSC(JSC::VM*, CommandLine) + 1370 (jsc.cpp:3237)
16  jsc                           	0x000000010797afcd jscmain(int, char**) + 157 (jsc.cpp:3292)
17  jsc                           	0x000000010797af1e main + 46 (jsc.cpp:2804)
18  libdyld.dylib                 	0x00007fffbfa88255 start + 1
Comment 16 Saam Barati 2017-01-12 16:09:20 PST
Created attachment 298729 [details]
patch for landing

Discussed with Fil and other how the race could happen over IRC and why that assert is wrong:

[15:36:24]  <&pizlo>	1) mutator fires write barrier on black-unmarked object during full GC
[15:36:33]  <&pizlo>	2) mutator CASes the color of the object from black to white
[15:36:37]  <&pizlo>	3) GC marks the object
[15:36:43]  <&pizlo>	4) GC sets the object's color to grey
[15:36:54]  <&pizlo>	5) mutator realizes that the object is now marked
[15:36:59]  <&pizlo>	6) mutator sets the object's color to black
[15:37:07]  <&pizlo>	7) GC crashes because the color of the object is not grey

Also note, that it could end up being white, since the setState(PossiblyGrey) isn't CAS and can leave it being white in a not sequentially consistent world.
Comment 17 WebKit Commit Bot 2017-01-12 19:13:14 PST
Comment on attachment 298729 [details]
patch for landing

Clearing flags on attachment: 298729

Committed r210694: <http://trac.webkit.org/changeset/210694>
Comment 18 WebKit Commit Bot 2017-01-12 19:13:21 PST
All reviewed patches have been landed.  Closing bug.