Bug 85076 - ARM JIT causes segmentation fault on javascript-heavy pages
: ARM JIT causes segmentation fault on javascript-heavy pages
Status: UNCONFIRMED
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore
: 528+ (Nightly build)
: Other Linux
: P2 Normal
Assigned To: Nobody
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-27 10:08 PDT by Daniel Drake
Modified: 2015-02-23 11:23 PST (History)
7 users (show)

See Also:


Attachments
Attempted gdb diagnostics (5.27 KB, text/plain)
2012-04-27 10:08 PDT, Daniel Drake
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Drake 2012-04-27 10:08:19 PDT
Created attachment 139221 [details]
Attempted gdb diagnostics

OLPC is moving from xulrunner to webkit. This is working great on our x86 laptops, but not on our latest ("XO-1.75") ARMv7 laptop.

We are running Fedora 17 with webkitgtk-1.8.1 (GTK3).

On the ARM platform, loading a javascript-heavy webpage causes a crash. Reproduced in Epiphany and OLPC's own "Browse activity" for the Sugar desktop. Reproduces very easily - loading gmail or Google Docs will cause an instant crash most of the time.

Unfortunately gdb is not helpful with the crash. With all relevant debuginfo packages installed:

(gdb) bt
#0  0x00000024 in ?? ()
#1  0x49f0eaf4 in ?? ()
#2  0x49f0eaf4 in ?? ()

The crash can't be reproduced on identical configuration on x86.

WebKit was built with these options:

WebKit was configured with the following options:
Build configuration:
 Enable debugging (slow)                                  : no
 Compile with debug symbols (slow)                        : no
 Enable debug features (slow)                             : no
 Enable GCC build optimization                            : yes
 Code coverage support                                    : no
 Unicode backend                                          : icu
 Font backend                                             : freetype
 Optimized memory allocator                               : yes
 Accelerated Compositing                                  : no
Features:
 WebGL                                                    : yes
 Blob support                                             : yes
 DOM mutation observer support                            : no
 DeviceOrientation support                                : no
 Directory upload                                         : no
 Fast Mobile Scrolling                                    : no
 JIT compilation                                          : yes
 Filters support                                          : yes
 Geolocation support                                      : yes
 JavaScript debugger/profiler support                     : yes
 Gamepad support                                          : no
 MathML support                                           : yes
 Media source                                             : no
 Media statistics                                         : no
 MHTML support                                            : no
 HTML5 channel messaging support                          : yes
 HTML5 meter element support                              : yes
 HTML5 microdata support                                  : no
 Page Visibility API support                              : no
 HTML5 progress element support                           : yes
 HTML5 client-side session and persistent storage support : yes
 SQL client-side database storage support                 : yes
 HTML5 datagrid support                                   : no
 HTML5 data transfer items support                        : no
 HTML5 FileSystem API support                             : no
 Quota API support                                        : no
 HTML5 sandboxed iframe support                           : yes
 HTML5 video element support                              : yes
 HTML5 track element support                              : no
 Fullscreen API support                                   : yes
 Media stream support                                     : no
 Icon database support                                    : yes
 Image resizer support                                    : no
 Link prefetch support                                    : no
 Opcode stats                                             : no
 Shadow DOM support                                       : yes
 SharedWorkers support                                    : yes
 Color input support                                      : no
 Speech input support                                     : no
 SVG support                                              : yes
 SVG fonts support                                        : yes
 Web Audio support                                        : no
 Web Sockets support                                      : yes
 Web Timing support                                       : no
 Web Workers support                                      : yes
 XSLT support                                             : yes
 Spellcheck support                                       : yes
 Animation API                                            : no
 RequestAnimationFrame support                            : yes
 Touch Icon Loading support                               : no
 Register Protocol Handler support                        : no
 WebKit2 support                                          : no
 WebKit2 plugin process                                   : no
GTK+ configuration:
 GTK+ version                                             : 3.0
 GDK target                                               : x11
 Hildon UI extensions                                     : no
 GStreamer version                                        : 0.10
 Introspection support                                    : yes
 Generate documentation                                   : no
Comment 1 Daniel Drake 2012-04-30 08:55:03 PDT
Recompiling webkit with --disable-jit "solves" the issue.

So it seems to be a bug in the ARM JIT. This would also explain why gdb can't tell which library this code is coming from.
Comment 2 Zoltan Herczeg 2012-04-30 12:04:33 PDT
This is very interesting from your log:

pc             0x24	0x24

I see you disassembled the content of the link register. Could you disasseble a bit back? For example:

x/i $lr-32, $lr+4

Anyway a reduced test case would also be helpful.
Comment 3 Daniel Drake 2012-04-30 12:14:38 PDT
Thanks for looking at this, Zoltan.

(gdb) x/i $lr-32, $lr+4
   0x49f0eaf8:	mov	r2, lr
(gdb) x/12i $lr-32
   0x49f0ead4:	blx	r8
   0x49f0ead8:	b	0x49f0d0d0
   0x49f0eadc:	mov	r0, sp
   0x49f0eae0:	str	r4, [sp, #3118288]	; 0x60
   0x49f0eae4:	ldr	r3, [pc, #33757136]	; 0x49f0ed3c
   0x49f0eae8:	str	r4, [r3]
   0x49f0eaec:	ldr	r8, [pc, #33757136]	; 0x49f0ed40
   0x49f0eaf0:	blx	r8
   0x49f0eaf4:	b	0x49f0b164
   0x49f0eaf8:	mov	r2, lr
   0x49f0eafc:	str	r2, [r4, #-3118288]
   0x49f0eb00:	ldr	r8, [pc, #33757136]	; 0x49f0ed48

Finding a less complex webpage that reliably reproduces this is difficult. On other sites we're finding that it crashes, but not always. I'll keep an eye open though.
Comment 4 Daniel Drake 2012-04-30 12:27:02 PDT
Core dump of the above crash: http://dev.laptop.org/~dsd/20120430/webkit85076.core.bz2
Comment 5 Zoltan Herczeg 2012-04-30 13:47:49 PDT
>    0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
>    0x49f0eaf0:    blx    r8

This should be the culprit. Could you check address 0x49f0ed40? (= $pc + #33757136)

I suspect it will be 0x24
Comment 6 Daniel Drake 2012-04-30 14:00:58 PDT
(gdb) x 0x49f0ed40
0x49f0ed40:	0x41d5d15c

Is that what you're looking for?
Comment 7 Daniel Drake 2012-04-30 14:03:07 PDT
Guessing here, but maybe this is also interesting:

(gdb) x/10i 0x41d5d15c
   0x41d5d15c <_ZN3JSC4Heap9markRootsEb+1536>:	eor	r9, r9, r9, lsl #12
   0x41d5d160 <_ZN3JSC4Heap9markRootsEb+1540>:	eor	r9, r9, r9, lsr #7
   0x41d5d164 <_ZN3JSC4Heap9markRootsEb+1544>:	eor	r9, r9, r9, lsl #2
   0x41d5d168 <_ZN3JSC4Heap9markRootsEb+1548>:	eor	r9, r9, r9, lsr #20
   0x41d5d16c <_ZN3JSC4Heap9markRootsEb+1552>:	orr	r9, r9, #1
   0x41d5d170 <_ZN3JSC4Heap9markRootsEb+1556>:	
    b	0x41d5d17c <_ZN3JSC4Heap9markRootsEb+1568>
   0x41d5d174 <_ZN3JSC4Heap9markRootsEb+1560>:	cmp	r1, #0
   0x41d5d178 <_ZN3JSC4Heap9markRootsEb+1564>:	
    beq	0x41d5d1dc <_ZN3JSC4Heap9markRootsEb+1664>
   0x41d5d17c <_ZN3JSC4Heap9markRootsEb+1568>:	cmp	r2, #0
   0x41d5d180 <_ZN3JSC4Heap9markRootsEb+1572>:	moveq	r2, r9
Comment 8 Zoltan Herczeg 2012-04-30 14:20:41 PDT
> Is that what you're looking for?

Yeah, if the constants are not changed. I mean pc+#33757136 can be different if you rerun the program.

0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40

Anyway, this is clearly a rubish not a valid function:

   0x41d5d15c <_ZN3JSC4Heap9markRootsEb+1536>:    eor    r9, r9, r9, lsl #12
   0x41d5d160 <_ZN3JSC4Heap9markRootsEb+1540>:    eor    r9, r9, r9, lsr #7

This is clearly a fallbackpath:

   0x49f0eadc:    mov    r0, sp
   0x49f0eae0:    str    r4, [sp, #3118288]    ; 0x60
   0x49f0eae4:    ldr    r3, [pc, #33757136]    ; 0x49f0ed3c
   0x49f0eae8:    str    r4, [r3]
   0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
   0x49f0eaf0:    blx    r8
   0x49f0eaf4:    b    0x49f0b164

Question is, what pc+#33757136 should contain in the right case. Btw is webkitgtk-1.8.1 contains the latest trunk? I mean this might already been fixed...

Ah an idea! Instead of x/i write it as x/x and the x/x number again. I mean lets pc+#33757136 be 0x49f0ed40. Type x/x 0x49f0ed40 it will write you a number. x/x that number again, and tell me what it is.
Comment 9 Daniel Drake 2012-04-30 14:27:04 PDT
I'm working from the same core dump so nothing should change.

Yes, I agree it looks strange that it is jumping right into the middle of a function.

(gdb) x/x 0x49f0ed40
0x49f0ed40:	0x41d5d15c
(gdb) x/x 0x41d5d15c
0x41d5d15c <_ZN3JSC4Heap9markRootsEb+1536>:	0xe0299609

I'm not in a good position to test webkit trunk at the moment. I will try to build it on Wednesday.

In the mean time please let me know if you have any other ideas.
Comment 10 Zoltan Herczeg 2012-04-30 15:29:45 PDT
> Yes, I agree it looks strange that it is jumping right into the middle of a function.

Unlikely. I think this is simply the closest symbol gdb can find. 1536 is just too big.

Could you check the other constants? These are fallback functions, following each other one-by-one:

   0x49f0ead4:    blx    r8
   0x49f0ead8:    b    0x49f0d0d0
--- fallback
   0x49f0eadc:    mov    r0, sp
   0x49f0eae0:    str    r4, [sp, #3118288]    ; 0x60
   0x49f0eae4:    ldr    r3, [pc, #33757136]    ; 0x49f0ed3c
   0x49f0eae8:    str    r4, [r3]
   0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
   0x49f0eaf0:    blx    r8
   0x49f0eaf4:    b    0x49f0b164
--- fallback
   0x49f0eaf8:    mov    r2, lr
   0x49f0eafc:    str    r2, [r4, #-3118288]
   0x49f0eb00:    ldr    r8, [pc, #33757136]    ; 0x49f0ed48

They all have such sequence:
   0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
   0x49f0eaf0:    blx    r8

Could you check whether their constant points to a valid function? So this is the only exception or something totally messed up in the constant pool.
Comment 11 Daniel Drake 2012-04-30 16:49:01 PDT
Sorry, think I've wasted a bit of your time.
It looks like I had installed a different webkit build since the crash, and this was affecting the gdb output.

Putting the right build back (the one from which the core was captured), I get different output.

So, stepping back a bit.
lr is still 0x49f0eaf4

The preceding instructions:

   0x49f0ead0:	ldr	r8, [pc, #26091512]	; 0x49f0ed34
   0x49f0ead4:	blx	r8
   0x49f0ead8:	b	0x49f0d0d0
   0x49f0eadc:	mov	r0, sp
   0x49f0eae0:	str	r4, [sp, #3118288]	; 0x60
   0x49f0eae4:	ldr	r3, [pc, #26091512]	; 0x49f0ed3c
   0x49f0eae8:	str	r4, [r3]
   0x49f0eaec:	ldr	r8, [pc, #26091512]	; 0x49f0ed40
   0x49f0eaf0:	blx	r8
   0x49f0eaf4:	b	0x49f0b164

So, value of 0x49f0ed40

(gdb) x/x 0x49f0ed40
0x49f0ed40:	0x41d5d15c

Nothing new until now. But lets look at that code with the right library in place:

   0x41d5d15c <cti_op_get_by_id_proto_fail+8>:	
    ldr	lr, [sp, #3118288]	; 0x40
   0x41d5d160 <cti_op_get_by_id_proto_fail+12>:	mov	pc, lr
   0x41d5d164 <cti_op_get_by_id_array_fail>:	
    str	lr, [sp, #3118288]	; 0x40
   0x41d5d168 <cti_op_get_by_id_array_fail+4>:	bl	0x41cae2e8

This looks suspicious. Does it tell you anything?



Just to compare, the previous fallback condition is:
   0x49f0ead0:	ldr	r8, [pc, #26091512]	; 0x49f0ed34
   0x49f0ead4:	blx	r8

(gdb) x/x 0x49f0ed34
0x49f0ed34:	0x41d5d1ac
(gdb) x/4i 0x41d5d1ac
   0x41d5d1ac <cti_op_del_by_id+8>:	ldr	lr, [sp, #3118288]	; 0x40
   0x41d5d1b0 <cti_op_del_by_id+12>:	mov	pc, lr
   0x41d5d1b4 <cti_op_mul>:	str	lr, [sp, #3118288]	; 0x40
   0x41d5d1b8 <cti_op_mul+4>:	bl	0x41caf998
Comment 12 Zoltan Herczeg 2012-05-01 01:10:45 PDT
No problem. This is entirely different now.

> Nothing new until now. But lets look at that code with the right library in place:
> 
>    0x41d5d15c <cti_op_get_by_id_proto_fail+8>:    
>     ldr    lr, [sp, #3118288]    ; 0x40
>    0x41d5d160 <cti_op_get_by_id_proto_fail+12>:    mov    pc, lr
>    0x41d5d164 <cti_op_get_by_id_array_fail>:    
>     str    lr, [sp, #3118288]    ; 0x40
>    0x41d5d168 <cti_op_get_by_id_array_fail+4>:    bl    0x41cae2e8
> 
> This looks suspicious. Does it tell you anything?

Yeah it is really suspicious. The sequence should look like this:

str    lr, [sp, ...]
bl     ...
ldr    lr, [sp, ...]
mov    pc, lr

Generated by:

#define DEFINE_STUB_FUNCTION(rtype, op) \
    extern "C" { \
        rtype JITStubThunked_##op(STUB_ARGS_DECLARATION); \
    }; \
    asm ( \
        ".globl " SYMBOL_STRING(cti_##op) "\n" \
        SYMBOL_STRING(cti_##op) ":" "\n" \
        "str lr, [sp, #" STRINGIZE_VALUE_OF(THUNK_RETURN_ADDRESS_OFFSET) "]" "\n" \
        "bl " SYMBOL_STRING(JITStubThunked_##op) "\n" \
        "ldr lr, [sp, #" STRINGIZE_VALUE_OF(THUNK_RETURN_ADDRESS_OFFSET) "]" "\n" \
        "mov pc, lr" "\n" \
        ); \
    rtype JITStubThunked_##op(STUB_ARGS_DECLARATION)

and

#define THUNK_RETURN_ADDRESS_OFFSET      0x38

(so #3118288 is somewhat way too big for me)

In other words, something added 8 to the offset of these so called "stubs". Same as the second function. Question is why... Perhaps a very simple web page with simple JS with calling fallbacks like could also reveal this error:

<script>
var a = {}; a["a"]=5;
</script>
Comment 13 Daniel Drake 2012-05-01 15:33:23 PDT
Working from home today, with a different laptop.
So not using the same trace as earlier. Lets start over with a new crash.

(gdb) bt
#0  0x000013e4 in ?? ()
#1  0x499fe5dc in ?? ()
#2  0x499fe5dc in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0             0x4996e240	1234625088
r1             0xfffffffb	4294967291
r2             0x4996e240	1234625088
r3             0xfffffffb	4294967291
r4             0x47626688	1197631112
r5             0x1d2	466
r6             0x47626588	1197630856
r7             0x5357c57c	1398261116
r8             0x41d9f3e4	1104802788
r9             0x47626570	1197630832
r10            0x45a67400	1168536576
r11            0x41f55058	1106595928
r12            0x4c1d0ab0	1276971696
sp             0xbe8aed78	0xbe8aed78
lr             0x499fe5dc	1235215836
pc             0x13e4	0x13e4
cpsr           0x600f0010	1611595792

Check around the LR area again:
(gdb) x/12i $lr-32
   0x499fe5bc:	mov	r0, sp
   0x499fe5c0:	str	r4, [sp, #3118288]	; 0x60
   0x499fe5c4:	mov	r8, #408	; 0x198
   0x499fe5c8:	str	r8, [r4, #-3118288]	; 0x2c
   0x499fe5cc:	ldr	r3, [pc, #12638680]	; 0x499fea40
   0x499fe5d0:	str	r4, [r3]
   0x499fe5d4:	ldr	r8, [pc, #12638680]	; 0x499fea44
   0x499fe5d8:	blx	r8
   0x499fe5dc:	str	r0, [r4, #3118288]	; 0x70
   0x499fe5e0:	str	r1, [r4, #3118288]	; 0x74
   0x499fe5e4:	b	0x499fc264
   0x499fe5e8:	b	0x499fe618

Looking carefully at this instruction:
   0x499fe5d4:	ldr	r8, [pc, #12638680]	; 0x499fea44

Lets try this calculation by hand. PC is always 8 bytes ahead of the current instruction, so pc=0x499fe5d4 + 8.
Then we add 12638680, and we read from that memory location.

0x499fe5d4 + 8 + 12638680 = 0x4a60bfb4 so:

(gdb) x/x 0x4a60bfb4
0x4a60bfb4:	0x00000000
Hmm, unlikely.

But gdb's annotation said 0x499fea44.

After asking around a bit I've been told that the number inside the square brackets is not to be taken literally. It includes flags and other things. However, the address in the annotation can be trusted.

So, lets check the memory at that address.

(gdb) x/x 0x499fea44
0x499fea44:	0x41d9f3e4
(gdb) x/4i 0x41d9f3e4
   0x41d9f3e4 <cti_op_resolve_global>:	str	lr, [sp, #3118288]	; 0x40
   0x41d9f3e8 <cti_op_resolve_global+4>:	bl	0x41cf4aac
   0x41d9f3ec <cti_op_resolve_global+8>:	
    ldr	lr, [sp, #3118288]	; 0x40
   0x41d9f3f0 <cti_op_resolve_global+12>:	mov	pc, lr

And lets also recall that r8 was programmed with this address before we branched. Checking back to the original register dump, r8=0x41d9f3e4 which is the same as cti_op_resolve_global. Things are making some sense.

The offset in the str/ldr lr lines of 3118288 is huge of course. Again I think we have to ignore it. I think the offset being used is 0x40, as shown by the comment.

Now lets think about the value of lr. It got set as 0x499fe5dc because of the "blx r8" that we followed earlier. Now almost immediately inside cti_op_resolve_global we call "bl", which will change the value of lr. However, lr does *not* reflect the return location for the "bl 0x41cf4aac" call. This means either:
 1. We crashed before executing the bl inside cti_op_resolve_global (seems impossible), or
 2. We executed the bl inside cti_op_resolve_global, and then restored lr, and returned. (seems likely)

So lets go back to the code pasted at the top of this comment (around 0x499fe5dc), since thats where we're returning to.

   0x499fe5dc:	str	r0, [r4, #3118288]	; 0x70
   0x499fe5e0:	str	r1, [r4, #3118288]	; 0x74

Lets see if we executed those instructions:
r4=0x47626688
r4 + 0x70 = 0x476266f8

(gdb) x/x 0x476266f8
0x476266f8:	0x4996e240

That matches the value of r0.

Looking at the "str r1":

(gdb) x/x 0x476266fc
0x476266fc:	0xfffffffb

That matches the value of r1.

So it seems like we have returned and executed these 2 instructions at least. Next is:

   0x499fe5e4:	b	0x499fc264

Lets look:

(gdb) x/12i 0x499fc264
   0x499fc264:	mov	r2, #1
   0x499fc268:	mvn	r7, #0
   0x499fc26c:	ldr	r0, [r4, #3118288]	; 0x50
   0x499fc270:	ldr	r1, [r4, #3118288]	; 0x54
   0x499fc274:	cmn	r7, #1
   0x499fc278:	bne	0x499fe5e8
   0x499fc27c:	cmn	r1, #5
   0x499fc280:	bne	0x499fe5e8
   0x499fc284:	ldr	r8, [r0]
   0x499fc288:	ldr	r3, [pc, #28956432]	; 0x499fca44
   0x499fc28c:	cmp	r8, r3
   0x499fc290:	bne	0x499fe5ec

This code looks odd.
Seems to set r7 to a fixed value and then compare its value against 1?
Looking at register values and memory I'm having trouble convincing myself that this code has run, but it might have.
Anyway, out of time for today unfortunately.


I tried the test webpage that you provided. It doesn't trigger the crash.
Also after a few runs I haven't managed to reproduce the problem where the stub offset is off by 8. Maybe that one was a bad dump.
Comment 14 Hojong Han 2012-05-01 19:37:54 PDT
I don't know exactly what's going on here but I experienced that this kind of crash, related to lr and pc values, could be occurred when cache flush was not run in the requested range.
Comment 15 Zoltan Herczeg 2012-05-02 01:39:23 PDT
> Now lets think about the value of lr. It got set as 0x499fe5dc because of the "blx r8" that we followed earlier. Now almost immediately inside cti_op_resolve_global we call "bl", which will change the value of lr. However, lr does *not* reflect the return location for the "bl 0x41cf4aac" call. This means either:
>  1. We crashed before executing the bl inside cti_op_resolve_global (seems impossible), or
>  2. We executed the bl inside cti_op_resolve_global, and then restored lr, and returned. (seems likely)

This makes sense, and there is a third option. Actually the purpose of such stub code is allowing returning to anywhere in JIT, mainly used by exception handlers. So the return value is stored on the stack (like x86), can be changed (like a buffer overflow attack, but this is intentional here) so the c++ function can return anywhere, including a catch handler.

So we have two new options:
1) A wrong handler was set
2) Something overwrites the return value

Would be good to know if an exception occures just before the return...

Perhaps the following code also crashes:

try {
  var a = "a";
  a++;
} catch(e) { }
Comment 16 Daniel Drake 2012-05-02 09:49:09 PDT
(In reply to comment #15)
> Would be good to know if an exception occures just before the return...

How can I check this?

> Perhaps the following code also crashes:
> 
> try {
>   var a = "a";
>   a++;
> } catch(e) { }

No crash, unfortunately.

Just FYI, I have a feeling that finding a simplistic test case will be difficult. Sometimes when the crash happens, I go back to the same page and it loads just fine without crashing. gmail seems to cause the crash every time, but sometimes it takes a good few seconds longer than normal before the crash happens.

Also, when I run epiphany under gdb, the crash is very hard to reproduce, even on gmail. (thats why I've been mostly working with core dumps)
Comment 17 Daniel Drake 2012-05-14 13:53:17 PDT
This has also been reproduced on a trimslice (also running as armv7hl). We'll disable the ARM JIT in Fedora for the time being in order to avoid this crash.

If anyone with the right experience is interested in working on this issue, we can ship hardware. Send me an email if interested.
Comment 18 Simon Schampijer 2012-08-09 05:34:11 PDT
The workaround we were using to disable the ARM JIT in Fedora does not work anymore because the Heap code now requires JIT to be enabled (building 1.9.5). So the situation is a bit chicken and egg now. Has there been any progress on the original issue?