Bug 27077 - Workers + garbage collector: weird crashes
Summary: Workers + garbage collector: weird crashes
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P1 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-08 05:23 PDT by Zoltan Herczeg
Modified: 2009-07-21 22:31 PDT (History)
5 users (show)

See Also:


Attachments
The source code which causes the crash (174.36 KB, application/octet-stream)
2009-07-08 05:26 PDT, Zoltan Herczeg
no flags Details
The whole test (289.21 KB, application/octet-stream)
2009-07-13 06:46 PDT, Zoltan Herczeg
no flags Details
Patch for op_method_check (unlinkCallers) (3.47 KB, patch)
2009-07-16 03:22 PDT, Zoltan Herczeg
oliver: review+
Details | Formatted Diff | Diff
Cache the structure of the prototype (3.88 KB, patch)
2009-07-21 00:01 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
layout test for op_method_call bug (3.67 KB, patch)
2009-07-21 06:43 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zoltan Herczeg 2009-07-08 05:23:35 PDT
I want to run SunSpider in multiple threads to explore the benefits of multi core systems. To accomplish this task I decided to put the whole SunSpider into a Worker thread. Unfortunately, I got weird crashes here:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1278313552 (LWP 1285)]
0xb733c000 in JSC::Heap::markConservatively () from /home/hzoli/Webkit-ARM/WebKit-arm/WebKitBuild/Release/lib/libQtWebKit.so.4

I think it is related to the Garbage Collector. The crashes are random events, sometimes the script run successfully, but it is always in function markConservatively.
Comment 1 Zoltan Herczeg 2009-07-08 05:26:05 PDT
Created attachment 32449 [details]
The source code which causes the crash
Comment 2 Zoltan Herczeg 2009-07-09 04:46:54 PDT
I belive I found the multithread problem:

Collector.cpp:
static inline void* currentThreadStackBase()

There are some static variables here:
    static void* stackBase = 0;
    static size_t stackSize = 0;
    static pthread_t stackThread;

When these values are written by different threads in the same time, it cause crash.

I put a "static Mutex mutex;" to solve this problem, but perhaps a thread local static variable would the the best solution. Do WebKit support thread local variables?

After your response about the best solution I will file a patch.
Comment 3 Alexey Proskuryakov 2009-07-10 22:15:16 PDT
Why are these variables even static? Perhaps they can be changed to normal ones.
Comment 4 Zoltan Herczeg 2009-07-11 00:35:07 PDT
(In reply to comment #3)
> Why are these variables even static? Perhaps they can be changed to normal
> ones.

I presume they are intended to use for caching. If the thread does not changed, it can use the cached value in the static variable. Since webkit is designed to be single threaded, the detecting of stack base becomes faster.

Unfortunately, there are still other crashes with V8, and WindScorpion (SunSpider is ok now). Corrupted double-linked list detected by malloc, or infinite loops happen frequently. How the Workers were tested in the past? Were they added by Google without testing them on JavaScriptCore?

I hope I can find a way to make JSC thread-safe, but I would appreciate any help from you (I can attach the other benchmarks as well).
Comment 5 Alexey Proskuryakov 2009-07-11 08:33:38 PDT
> I presume they are intended to use for caching.

Yes - I was wondering if that's premature optimization, perhaps its effect is negligible.

> How the Workers were tested in the past? Were
> they added by Google without testing them on JavaScriptCore?

Workers were added by Apple, and there are some tests in fast/workers. The crash we're talking about here is due to platform-specific code that's not executed on Mac or Windows, which is why it wasn't caught earlier.
Comment 6 Zoltan Herczeg 2009-07-13 06:46:47 PDT
Created attachment 32658 [details]
The whole test

Contains all 3 tests for WebKit.
Comment 7 Zoltan Herczeg 2009-07-13 06:59:35 PDT
> Yes - I was wondering if that's premature optimization, perhaps its effect is
> negligible.

Ok, then I will file a patch which removes those static variables.

The next bug happens on Mac as well. Run v8 suite (1 worker, 2 tabs). It stops at v8-earley-boyer (usually).

What I know now:
 - With command line jsc, the benchmark does not throw any exception.
 - With full WebKit engine, a notAFunctionError is raised (because the cell points to an object).
 - when the error is raised, a toString() method is called for the object, which raises a notAFunctionError again. This happens 32 times, when something stops the process, and the evaluate() returns with an exception.

Call stack [32 times]:

#0  0xb72f641d in JSC::Interpreter::execute ()
#1  0xb734f495 in JSC::JSFunction::call ()
#2  0xb732eefa in JSC::call ()
#3  0xb7355b50 in JSC::JSObject::defaultValue ()
#4  0xb7265b2f in JSC::JSObject::toPrimitive ()
#5  0xb7353c2e in JSC::JSObject::toString ()
#6  0xb72f391e in JSC::createErrorMessage ()
#7  0xb72f4538 in JSC::createNotAFunctionError ()
#8  0xb72d02c0 in cti_op_call_NotJSFunction ()
#9  0xb597cddf in ?? ()
#10 0x00000000 in ?? ()

It seems the toString() method of that object is jitted as well. And the code is large and complex.

If you have any idea, please tell me.
Comment 8 Zoltan Herczeg 2009-07-14 06:54:11 PDT
I am still struggling with my second bug.

This is what I know now:

In v8-earley-boyer.js there is a function called sc_toDisplayString(). This function is marked by a JSWrapperObject as seen in the call stack below:

#1  0xb7385ddd in JSC::JSFunction::mark ()
#2  0xb738a795 in JSC::JSObject::mark ()            \ - the mark() is virtual
#3  0xb738eac5 in JSC::JSWrapperObject::mark ()     / - this is one object
#4  0xb738a795 in JSC::JSObject::mark ()
#5  0xb738a795 in JSC::JSObject::mark ()            \ - the mark() is virtual
#6  0xb72b4a02 in JSC::JSGlobalObject::mark ()      / - this is one object
#7  0xb73f5d8f in WebCore::JSDOMGlobalObject::mark ()
#8  0xb78f3ddd in WebCore::JSWorkerContext::mark ()
#9  0xb7385dba in JSC::JSFunction::mark ()
#10 0xb7366656 in JSC::Heap::markConservatively ()
#11 0xb7366b4f in JSC::Heap::markCurrentThreadConservativelyInternal ()
#12 0xb7366bde in JSC::Heap::markCurrentThreadConservatively ()
#13 0xb7366c01 in JSC::Heap::markStackObjectsConservatively ()
#14 0xb7366c59 in JSC::Heap::collect ()
#15 0xb736709b in JSC::Heap::allocate ()
#16 0xb7309150 in cti_op_construct_JSConstruct ()

Somehow, when I run the test in a worker thread, the 3rd object in the propertyStorage() of JSWrapperObject() - which is the JSFunction representation of toDisplayString() method - get a new value. However, its old value is cached by the JIT, and causes exception, since the cell space of the old value is reused by some new JSObject. This bug looks platform independent. Anyway, the garbage collector works perfectly, since the JSFunction is not referenced by other objects.

My qestions:

- As far as I know, the existing functions are never replaced by new ones, otherwise pointer caching becomes impossible. What happens here?

- Is it true, that only one JSGlobalObject created for all threads and this global object is accessed without mutexes?

Thanks,
Zoltan
Comment 9 Alexey Proskuryakov 2009-07-14 08:41:55 PDT
(In reply to comment #8)
> - Is it true, that only one JSGlobalObject created for all threads and this
> global object is accessed without mutexes?

No, all worker's JS objects (including the global object) are allocated in a separate JS heap.
Comment 10 Oliver Hunt 2009-07-15 05:45:17 PDT
This effects all jitted platforms, and appears to be caused by op_method_check.
Comment 11 Zoltan Herczeg 2009-07-15 06:05:02 PDT
(In reply to comment #10)
> This effects all jitted platforms, and appears to be caused by op_method_check.

I try to give a more detailed description:

In the attached tgz, there is a V8.js. The test T_v8_earley_boyer() fails with an exception. This is due to the cached JSFunction value "return o.sc_toDisplayString();" refers to an object, which has already been collected by GC. It only throws an exception, since the JSCell is reused by some JSObject.

I was a wrong track for a long time, but Oiver helped me to find the real solution: the "emitOpcode(op_method_check);" should be commented out in BytecodeGenerator.cpp, and it is working now.

Run the test:
extract the attached gzip file.
set var benchmarkSelector to 1 in index.html
run a browser: ./QtLauncher index.html

You can specify index.html multiple times, and can increase numOfWorkers in index.html as well.

See rhe results:
Numbers will appear after the test names. However, it stops in v8-early-boyer.
Note: it is not necessary happens all the time. Run it multiple times.

Actually it works with Standalone jsc for me, but you need to change postMessage()-s to print() in V8.js.
Comment 12 Zoltan Herczeg 2009-07-16 03:22:06 PDT
Created attachment 32851 [details]
Patch for op_method_check (unlinkCallers)
Comment 13 Oliver Hunt 2009-07-16 04:11:10 PDT
Comment on attachment 32851 [details]
Patch for op_method_check (unlinkCallers)

r=me, this seems to be the correct fix according to my logic.
Comment 14 Oliver Hunt 2009-07-16 04:18:26 PDT
Committed r45969
Comment 15 Gavin Barraclough 2009-07-16 21:00:16 PDT
Sorry, but reverted in r46004, unfortunately this does not appear to be a valid fix.

This change will cause the method check optimization to be invalidated upon the first JSFunction referencing a codeblock being GC'ed.  This is unnecessarily conservative, and does not appear to be addressing the real cause of the bug.  (This patch may well just be masking the issue by changing when GC occurs.)

Consider the following code fragment:

a = { b:(function() {}) };
function c() { return (function(d) { d.b(); }); }
e = c();
for (i=0; i<5; ++i) e(a);
f = c();
f = null;
gc();
e(a);

'a' is an object with a method 'b'.
'c' is a function that produces a closure, that will make a method call to 'b' its argument.
'e' is a function created by 'c';
'e' is called repeatedly, which makes a method call to 'b' on its argument, which is 'a'.  This will cause the call in the closure to be optimized to call the function that is the property 'a.b'.
A new JSFunction referencing the same closure is created and assigned to 'f', then overwritten.
The gc runs, which will reap the initial value of 'f', and when the JSFunction's destructor is run unlinkCallers will be called on the closure's CodeBlock.
With the (now reverted) patch, this will de-optimize the method call so that it no longer is optimized to call 'a.b'.
The final call 'e(a)'; will again call the method 'a.b()' (via the closure), however this time the method call will not be optimized.

The de-optimization described above is unnecessary and undesired.  The optimized method check within the closure is still perfectly valid.
Comment 16 Zoltan Herczeg 2009-07-16 23:46:06 PDT
Hi Gavin,

I am happy to hear about you again. I thought you are on Holiday. I have tracked the bug (with help of Oliver) and this is what happen:

part of sc_toDisplayString(o) byte code:

[  95] op_method_check
[  96] get_by_id         r12, r-9, sc_toDisplayString(@id1)
[ 104] mov               r13, r-9
[ 107] call              r12, r12, 1, 22

The op_method_check caches a JSFunction value, which is used by the op_call later. However, the check in op_method_check only depends on the structure. The old JSFunction itself is freed by gc. However, a new JSFunction is created, which has the _same_ structure info (and I think this is expected), so op_method_check gives back a wrong cached value.

Probably I need to create a small example which crashes. I try to do it now.

Oliver also mentioned, that this conservative fix might kill the benefit of op_method_check, but look, an optimization which sometimes crashes cannot be called as an optimization. We wouldn't want a browser which can be killed by malicious code.

By the way, could you take a look at the ARM port :)
Comment 17 Zoltan Herczeg 2009-07-17 06:41:30 PDT
Ok, I have contined to work on this bug, again.

This is what I found so far:

The structure, which is checked by op_method_check, is JSGlobalData->stringStructure. v8-earley-boyer adds two methods to it:

String.prototype.sc_toDisplayString = function() { ... }
String.prototype.sc_toWriteString = function() { ... }

Both time StringObject::put is called. The prototype object is changed by:
setStructure(Structure::despecifyFunctionTransition(m_structure, propertyName));
in JSObject.h:445

Structure::despecifyFunctionTransition (Structure.cpp:495) allocates memory for the prototype by a create(...) function. Unfortunately, the previously freed memory is reused by malloc.

By pointers:

String.prototype was 0x0830f740. After String.prototype.sc_toDisplayString = function() { ... }, it is changed to 0x08449578. Followed by String.prototype.sc_toWriteString = function() { ... }, it was set to 0x0830f740, because malloc reused the memory space. Of course the pointer "meaning" is changed, but its physical value is not.

Gavin, you were right (again). The bug was related to prototype caching, since prototypes are also memory objects, which can be reused after free().

What should I do now?
Comment 18 Zoltan Herczeg 2009-07-17 12:08:19 PDT
I had an idea. Can we protect the prototype with a ref() / deref() pair as the structure is protected?
Comment 19 Oliver Hunt 2009-07-18 17:50:19 PDT
(In reply to comment #18)
> I had an idea. Can we protect the prototype with a ref() / deref() pair as the
> structure is protected?

The prototype should not be changed  concurrently with execution as all js in a single context group runs in a single thread.
Comment 20 Zoltan Herczeg 2009-07-20 02:52:49 PDT
Hi Gavin,

my idea is following:

JIT::patchMethodCallProto() (JITPropertyAccess.cpp) stores 4 pointers in the code:

- structure
- proto
- proto->structure()
- callee (the cached value)

However, only structure is protected by a ref() call. I think we should also protect the proto->structure() as well. Probably this would be enough to eliminate the allocation problem. We had a little chat with Oliver today, and he said I should ask you about your opinion about this solution. (And we also wondered whether the "proto" should be protected)
Comment 21 Gavin Barraclough 2009-07-20 13:42:24 PDT
Gah, yes! - that's definitely a bug, we should be ref'ing/deref'ing the prototype structure in the same way we do the structure.

There should be no need to explicitly protect the prototype object (it's existence is implied by the structure check matching).  If the prototype object has been freed & GC'ed then no objects can exist with this structure, so the structure check must fail.  We cannot get a false positive from the structure check through the structure being freed and a new structure happening to be allocated in the same location, since the codeblock preserved the structure.

In the case of the callee, we also don't need protect the object, but for slightly subtler reasons.  The prototype structure implies the prototype has a property of a given name with a specific value.  It would be possible for the callee to be GC'ed, and a new object to be allocated in its place.  For this to happen all objects of the prototype stucture must have also been destroyed (or their structures must have changed).  The new object allocated in place of the callee could be added to an object with the same name, follow the same structure transition, and reach the same prototype structure as had previously been reached.  However in such a set of events an the correct result for op method check to produce is still callee (albeit a new value of callee that just happens to be at the same location).
Comment 22 Zoltan Herczeg 2009-07-21 00:01:34 PDT
Created attachment 33149 [details]
Cache the structure of the prototype
Comment 23 Zoltan Herczeg 2009-07-21 06:43:14 PDT
Created attachment 33174 [details]
layout test for op_method_call bug
Comment 24 Gavin Barraclough 2009-07-21 21:04:22 PDT
Committed with fix to make the layout test pass.

Sending        JavaScriptCore/ChangeLog
Sending        JavaScriptCore/bytecode/CodeBlock.cpp
Sending        JavaScriptCore/bytecode/CodeBlock.h
Sending        JavaScriptCore/jit/JITPropertyAccess.cpp
Sending        LayoutTests/ChangeLog
Adding         LayoutTests/fast/js/method-check-expected.txt
Adding         LayoutTests/fast/js/method-check.html
Adding         LayoutTests/fast/js/resources/method-check.js
Transmitting file data ........
Committed revision 46210.
Comment 25 Sam Weinig 2009-07-21 21:23:08 PDT
There is a way to force gc from a layout test, GCController.collect().  This should usually be accompanied by a loop to force gc if the method is not available (such as when called from the browser it self).  See fast/js/activation-object-function-lifetime.html for an example.