Bug 141990

Summary: MachineThreads::Thread clean up has a use after free race condition
Product: WebKit Reporter: Mark Lam <mark.lam>
Component: JavaScriptCoreAssignee: Mark Lam <mark.lam>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, clopez, ddkilzer, fpizlo, ggaren, mhahnenb, mmirman, msaboff, oliver, ossy, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 142041    
Bug Blocks:    
Attachments:
Description Flags
patch for rolling out r179753
none
the patch.
msaboff: review+
JSC benchmark results
none
patch for rolling out r180602 and associated revs.
none
Approach 3: synchronize the construction and destruction of MachineThreads.
mark.lam: review-
patch with fix for Filip fpizlo: review+

Description Mark Lam 2015-02-24 15:54:06 PST
Before r179753 (https://bugs.webkit.org/show_bug.cgi?id=141317), there was a race condition that could result in a small window for a use after free scenario.  After r179753, there is a memory leak.  We will roll out r179753 here and implement a fix for the race condition using a different approach.
Comment 1 Mark Lam 2015-02-24 16:04:55 PST
Created attachment 247275 [details]
patch for rolling out r179753
Comment 2 Mark Lam 2015-02-24 16:05:48 PST
r179753 was rolled out in r180591: <http://trac.webkit.org/r180591>.
Comment 3 Radar WebKit Bug Importer 2015-02-24 16:07:20 PST
<rdar://problem/19945725>
Comment 4 Mark Lam 2015-02-24 17:10:09 PST
The code for _pthread_tsd_cleanup_key() looks like this:

    _pthread_tsd_cleanup_key(pthread_t self, pthread_key_t key)
    {
        void (*destructor)(void *);
        if (_pthread_key_get_destructor(key, &destructor)) {
            void **ptr = &self->tsd[key];
            void *value = *ptr;

            // At this point, this thread has cached the destructor and value (which is a MachineThreads*).
            // If the VM gets destructed (along with its MachineThreads registry) by another thread,
            // then this thread will have no way of knowing that the MachineThreads* is now pointing
            // to freed memory.  Calling the destructor below will therefore result in a use after free
            // scenario when it tries to access the MachineThreads' data members.

            if (value) {
                *ptr = NULL;
                if (destructor) {
                   destructor(value);
                }
            }
        }
    }

The solution is simply to change MachineThreads from a per VM thread registry to a process global singleton thread registry i.e. the MachineThreads registry is now immortal and we cannot have a use after free scenario.

The cost of this change is that all VM instances will scan stacks of all threads ever touched by a VM, and not just those that touched this specific VM.  However, stacks tend to be shallow.  Hence, those additional scans will tend to be cheap.  Secondly, it is not common for there to be multiple JSC VMs in use concurrently on multiple threads.  Hence, this cost should rarely manifest in real world applications.
Comment 5 Mark Lam 2015-02-24 17:48:30 PST
Created attachment 247286 [details]
the patch.
Comment 6 Mark Lam 2015-02-24 17:49:39 PST
Created attachment 247287 [details]
JSC benchmark results

A benchmark run shows perf to be neutral with this change.  Results attached.
Comment 7 Michael Saboff 2015-02-24 18:15:11 PST
Comment on attachment 247286 [details]
the patch.

r=me.  Be sure to run the worker tests.
Comment 8 Mark Lam 2015-02-24 18:33:39 PST
My local run of the layout test and JSC regression tests all passed without regressions, and I had landed the patch just before I saw the EWS bot complaints.  I will re-run the layout test with -2 to see if there's an issue.  If so, I'll rollout.

For the record, landed in r180602: <http://trac.webkit.org/r180602>.
Comment 9 Mark Lam 2015-02-24 18:54:49 PST
I don't see any regressions with the wk2 run of the layout tests either.  I'll consider this bug resolved.
Comment 10 Alexey Proskuryakov 2015-02-24 21:58:39 PST
Build fixes: 

http://trac.webkit.org/changeset/180608
http://trac.webkit.org/changeset/180613
http://trac.webkit.org/changeset/180617

And it looks like this broke ASan tests. This is probably a false positive, but we need to find a way to fix this ASAP.

    #0 0x10849737b in wrap_memcpy (/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/6.2.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib+0x3837b)
    #1 0x112bccb11 in _ZN3JSC14MachineThreads23tryCopyOtherThreadStackEPNS0_6ThreadEPvmPm (/Volumes/Data/slave/yosemite-asan-production-wk2-tests/build/buildToTest/JavaScriptCore.framework/Versions/A/JavaScriptCore+0x9c9b11)
    #2 0x112bcced1 in _ZN3JSC14MachineThreads24tryCopyOtherThreadStacksERN3WTF6LockerINS1_5MutexEEEPvmPm (/Volumes/Data/slave/yosemite-asan-production-wk2-tests/build/buildToTest/JavaScriptCore.framework/Versions/A/JavaScriptCore+0x9c9ed1)
    #3 0x112bcd1c5 in _ZN3JSC14MachineThreads23gatherConservativeRootsERNS_17ConservativeRootsERNS_17JITStubRoutineSetERNS_12CodeBlockSetEPvRA37_i (/Volumes/Data/slave/yosemite-asan-production-wk2-tests/build/buildToTest/JavaScriptCore.framework/Versions/A/JavaScriptCore+0x9ca1c5)
    #4 0x112960236 in _ZN3JSC4Heap9markRootsEd (/Volumes/Data/slave/yosemite-asan-production-wk2-tests/build/buildToTest/JavaScriptCore.framework/Versions/A/JavaScriptCore+0x75d236)
    #5 0x112962185 in _ZN3JSC4Heap7collectENS_13HeapOperationE (/Volumes/Data/slave/yosemite-asan-production-wk2-tests/build/buildToTest/JavaScriptCore.framework/Versions/A/JavaScriptCore+0x75f185)
Comment 11 Mark Lam 2015-02-24 22:38:39 PST
(In reply to comment #10)
> And it looks like this broke ASan tests. This is probably a false positive,
> but we need to find a way to fix this ASAP.

I’m looking into it.
Comment 12 David Kilzer (:ddkilzer) 2015-02-25 07:29:25 PST
Is ASan just complaining about use of overlapping memcpy?  That is safe to ignore on Darwin platforms because the undefined behavior is well-defined to do the same thing as memmove.

However, this will affect other platforms where overlapping memcpy behavior is NOT well-defined (perhaps Windows and Linux), so we should still fix this.
Comment 13 David Kilzer (:ddkilzer) 2015-02-25 07:31:51 PST
(In reply to comment #12)
> Is ASan just complaining about use of overlapping memcpy?  That is safe to
> ignore on Darwin platforms because the undefined behavior is well-defined to
> do the same thing as memmove.
> 
> However, this will affect other platforms where overlapping memcpy behavior
> is NOT well-defined (perhaps Windows and Linux), so we should still fix this.

Nope, that's not the issue.  Perhaps the ASan test bots are not using the blacklist?

  Tools/asan/webkit-asan-ignore.txt

This particular issue was supposedly fixed here:

ASan does not like JSC::MachineThreads::tryCopyOtherThreadStack
https://bugs.webkit.org/show_bug.cgi?id=141672
Comment 14 Mark Lam 2015-02-25 08:04:20 PST
Strange, I can't reproduce this ASan issue with a local build at all.
Comment 15 Alexey Proskuryakov 2015-02-25 10:26:42 PST
The blacklist would only help if the violation was detected in tryCopyOtherThreadStack, but it is detected in memcpy. As far as I can tell, there is simply no way to use memcpy on potentially poisoned regions, like the whole stack.
Comment 16 Mark Lam 2015-02-25 16:34:22 PST
Alexey is right that ASan was not happy with JSC memcpy'ing the stack to be GC scanned.  Strictly speaking, this is the issue reported in https://webkit.org/b/141672.  I've fixed it by introducing our own asanUnsafeMemcpy() to be used in tryCopyOtherThreadStack(), and telling ASan to ignore asanUnsafeMemcpy().

The fix was landed in r180649: <http://trac.webkit.org/r180649> for https://webkit.org/b/141672.

There is no further issue.  Closing this bug.
Comment 17 Carlos Alberto Lopez Perez 2015-02-26 04:31:05 PST
(In reply to comment #8)
> For the record, landed in r180602: <http://trac.webkit.org/r180602>.

(In reply to comment #12)
> However, this will affect other platforms where overlapping memcpy behavior
> is NOT well-defined (perhaps Windows and Linux), so we should still fix this.

This seems to have cause a regression on both GTK+ and EFL. Check bug 142041
Comment 18 Mark Lam 2015-02-26 08:53:16 PST
(In reply to comment #17)
> (In reply to comment #8)
> > For the record, landed in r180602: <http://trac.webkit.org/r180602>.
> 
> (In reply to comment #12)
> > However, this will affect other platforms where overlapping memcpy behavior
> > is NOT well-defined (perhaps Windows and Linux), so we should still fix this.
> 
> This seems to have cause a regression on both GTK+ and EFL. Check bug 142041

There may be a regression, but it should be about overlapping memory.  That is because tryCopyOtherThreadStack() doesn’t memcpy to overlapping memory.
Comment 19 Mark Lam 2015-02-26 09:05:36 PST
(In reply to comment #18)
> There may be a regression, but it should be about overlapping memory.  That
> is because tryCopyOtherThreadStack() doesn’t memcpy to overlapping memory.

typo: … it should *not* be about overlapping memory.
Comment 20 Filip Pizlo 2015-02-26 10:05:04 PST
Am I understanding right that after this fix, every GC will scan every thread ever known to any VM?

I realize that workers suck and all, but this seems like we're punishing them a bit too much.  Moreover, JSC benchmark results aren't going to tell us much here; to my knowledge none of our benchmarks exercise workers.  We could run the *real* pdfjs (not the Octane benchmark but the actual app/framework), which uses workers, to see what the effects of this are.
Comment 21 Mark Lam 2015-02-26 10:54:03 PST
Created attachment 247432 [details]
patch for rolling out r180602 and associated revs.
Comment 22 Mark Lam 2015-02-26 11:38:46 PST
Re-opening to use a different solution that won’t add any perf impact when worker threads are present.
Comment 23 Mark Lam 2015-02-26 11:45:17 PST
r180602 (and associated revs) rolled out in r180690: <http://trac.webkit.org/r180690>.
Comment 24 Mark Lam 2015-02-26 14:47:34 PST
Created attachment 247448 [details]
Approach 3: synchronize the construction and destruction of MachineThreads.
Comment 25 Mark Lam 2015-02-26 14:55:12 PST
Comment on attachment 247448 [details]
Approach 3: synchronize the construction and destruction of MachineThreads.

Filip pointed out there's an ABA problem.  Will fix.
Comment 26 Mark Lam 2015-02-26 15:18:01 PST
Created attachment 247451 [details]
patch with fix for Filip
Comment 27 Mark Lam 2015-02-26 17:27:52 PST
EWS bots are happy build for all ports.  A local run of the fast/workers tests on a GTK build does not show the regressions from the previous implementation.

Landed in r180716: <http://trac.webkit.org/r180716>.
Comment 28 Csaba Osztrogonác 2015-02-27 03:37:16 PST
(In reply to comment #27)
> EWS bots are happy build for all ports.  A local run of the fast/workers
> tests on a GTK build does not show the regressions from the previous
> implementation.
> 
> Landed in r180716: <http://trac.webkit.org/r180716>.

Thanks for the fix, everything works fine on EFL and GTK too.