Bug 85076 - ARM JIT causes segmentation fault on javascript-heavy pages
: ARM JIT causes segmentation fault on javascript-heavy pages
Status: UNCONFIRMED
: WebKit
JavaScriptCore
: 528+ (Nightly build)
: Other Linux
: P2 Normal
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2012-04-27 10:08 PST by
Modified: 2013-01-14 12:38 PST (History)


Attachments
Attempted gdb diagnostics (5.27 KB, text/plain)
2012-04-27 10:08 PST, Daniel Drake
no flags Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2012-04-27 10:08:19 PST
Created an attachment (id=139221) [details]
Attempted gdb diagnostics

OLPC is moving from xulrunner to webkit. This is working great on our x86 laptops, but not on our latest ("XO-1.75") ARMv7 laptop.

We are running Fedora 17 with webkitgtk-1.8.1 (GTK3).

On the ARM platform, loading a javascript-heavy webpage causes a crash. Reproduced in Epiphany and OLPC's own "Browse activity" for the Sugar desktop. Reproduces very easily - loading gmail or Google Docs will cause an instant crash most of the time.

Unfortunately gdb is not helpful with the crash. With all relevant debuginfo packages installed:

(gdb) bt
#0  0x00000024 in ?? ()
#1  0x49f0eaf4 in ?? ()
#2  0x49f0eaf4 in ?? ()

The crash can't be reproduced on identical configuration on x86.

WebKit was built with these options:

WebKit was configured with the following options:
Build configuration:
 Enable debugging (slow)                                  : no
 Compile with debug symbols (slow)                        : no
 Enable debug features (slow)                             : no
 Enable GCC build optimization                            : yes
 Code coverage support                                    : no
 Unicode backend                                          : icu
 Font backend                                             : freetype
 Optimized memory allocator                               : yes
 Accelerated Compositing                                  : no
Features:
 WebGL                                                    : yes
 Blob support                                             : yes
 DOM mutation observer support                            : no
 DeviceOrientation support                                : no
 Directory upload                                         : no
 Fast Mobile Scrolling                                    : no
 JIT compilation                                          : yes
 Filters support                                          : yes
 Geolocation support                                      : yes
 JavaScript debugger/profiler support                     : yes
 Gamepad support                                          : no
 MathML support                                           : yes
 Media source                                             : no
 Media statistics                                         : no
 MHTML support                                            : no
 HTML5 channel messaging support                          : yes
 HTML5 meter element support                              : yes
 HTML5 microdata support                                  : no
 Page Visibility API support                              : no
 HTML5 progress element support                           : yes
 HTML5 client-side session and persistent storage support : yes
 SQL client-side database storage support                 : yes
 HTML5 datagrid support                                   : no
 HTML5 data transfer items support                        : no
 HTML5 FileSystem API support                             : no
 Quota API support                                        : no
 HTML5 sandboxed iframe support                           : yes
 HTML5 video element support                              : yes
 HTML5 track element support                              : no
 Fullscreen API support                                   : yes
 Media stream support                                     : no
 Icon database support                                    : yes
 Image resizer support                                    : no
 Link prefetch support                                    : no
 Opcode stats                                             : no
 Shadow DOM support                                       : yes
 SharedWorkers support                                    : yes
 Color input support                                      : no
 Speech input support                                     : no
 SVG support                                              : yes
 SVG fonts support                                        : yes
 Web Audio support                                        : no
 Web Sockets support                                      : yes
 Web Timing support                                       : no
 Web Workers support                                      : yes
 XSLT support                                             : yes
 Spellcheck support                                       : yes
 Animation API                                            : no
 RequestAnimationFrame support                            : yes
 Touch Icon Loading support                               : no
 Register Protocol Handler support                        : no
 WebKit2 support                                          : no
 WebKit2 plugin process                                   : no
GTK+ configuration:
 GTK+ version                                             : 3.0
 GDK target                                               : x11
 Hildon UI extensions                                     : no
 GStreamer version                                        : 0.10
 Introspection support                                    : yes
 Generate documentation                                   : no
------- Comment #1 From 2012-04-30 08:55:03 PST -------
Recompiling webkit with --disable-jit "solves" the issue.

So it seems to be a bug in the ARM JIT. This would also explain why gdb can't tell which library this code is coming from.
------- Comment #2 From 2012-04-30 12:04:33 PST -------
This is very interesting from your log:

pc             0x24    0x24

I see you disassembled the content of the link register. Could you disasseble a bit back? For example:

x/i $lr-32, $lr+4

Anyway a reduced test case would also be helpful.
------- Comment #3 From 2012-04-30 12:14:38 PST -------
Thanks for looking at this, Zoltan.

(gdb) x/i $lr-32, $lr+4
   0x49f0eaf8:    mov    r2, lr
(gdb) x/12i $lr-32
   0x49f0ead4:    blx    r8
   0x49f0ead8:    b    0x49f0d0d0
   0x49f0eadc:    mov    r0, sp
   0x49f0eae0:    str    r4, [sp, #3118288]    ; 0x60
   0x49f0eae4:    ldr    r3, [pc, #33757136]    ; 0x49f0ed3c
   0x49f0eae8:    str    r4, [r3]
   0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
   0x49f0eaf0:    blx    r8
   0x49f0eaf4:    b    0x49f0b164
   0x49f0eaf8:    mov    r2, lr
   0x49f0eafc:    str    r2, [r4, #-3118288]
   0x49f0eb00:    ldr    r8, [pc, #33757136]    ; 0x49f0ed48

Finding a less complex webpage that reliably reproduces this is difficult. On other sites we're finding that it crashes, but not always. I'll keep an eye open though.
------- Comment #4 From 2012-04-30 12:27:02 PST -------
Core dump of the above crash: http://dev.laptop.org/~dsd/20120430/webkit85076.core.bz2
------- Comment #5 From 2012-04-30 13:47:49 PST -------
>    0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
>    0x49f0eaf0:    blx    r8

This should be the culprit. Could you check address 0x49f0ed40? (= $pc + #33757136)

I suspect it will be 0x24
------- Comment #6 From 2012-04-30 14:00:58 PST -------
(gdb) x 0x49f0ed40
0x49f0ed40:    0x41d5d15c

Is that what you're looking for?
------- Comment #7 From 2012-04-30 14:03:07 PST -------
Guessing here, but maybe this is also interesting:

(gdb) x/10i 0x41d5d15c
   0x41d5d15c <_ZN3JSC4Heap9markRootsEb+1536>:    eor    r9, r9, r9, lsl #12
   0x41d5d160 <_ZN3JSC4Heap9markRootsEb+1540>:    eor    r9, r9, r9, lsr #7
   0x41d5d164 <_ZN3JSC4Heap9markRootsEb+1544>:    eor    r9, r9, r9, lsl #2
   0x41d5d168 <_ZN3JSC4Heap9markRootsEb+1548>:    eor    r9, r9, r9, lsr #20
   0x41d5d16c <_ZN3JSC4Heap9markRootsEb+1552>:    orr    r9, r9, #1
   0x41d5d170 <_ZN3JSC4Heap9markRootsEb+1556>:    
    b    0x41d5d17c <_ZN3JSC4Heap9markRootsEb+1568>
   0x41d5d174 <_ZN3JSC4Heap9markRootsEb+1560>:    cmp    r1, #0
   0x41d5d178 <_ZN3JSC4Heap9markRootsEb+1564>:    
    beq    0x41d5d1dc <_ZN3JSC4Heap9markRootsEb+1664>
   0x41d5d17c <_ZN3JSC4Heap9markRootsEb+1568>:    cmp    r2, #0
   0x41d5d180 <_ZN3JSC4Heap9markRootsEb+1572>:    moveq    r2, r9
------- Comment #8 From 2012-04-30 14:20:41 PST -------
> Is that what you're looking for?

Yeah, if the constants are not changed. I mean pc+#33757136 can be different if you rerun the program.

0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40

Anyway, this is clearly a rubish not a valid function:

   0x41d5d15c <_ZN3JSC4Heap9markRootsEb+1536>:    eor    r9, r9, r9, lsl #12
   0x41d5d160 <_ZN3JSC4Heap9markRootsEb+1540>:    eor    r9, r9, r9, lsr #7

This is clearly a fallbackpath:

   0x49f0eadc:    mov    r0, sp
   0x49f0eae0:    str    r4, [sp, #3118288]    ; 0x60
   0x49f0eae4:    ldr    r3, [pc, #33757136]    ; 0x49f0ed3c
   0x49f0eae8:    str    r4, [r3]
   0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
   0x49f0eaf0:    blx    r8
   0x49f0eaf4:    b    0x49f0b164

Question is, what pc+#33757136 should contain in the right case. Btw is webkitgtk-1.8.1 contains the latest trunk? I mean this might already been fixed...

Ah an idea! Instead of x/i write it as x/x and the x/x number again. I mean lets pc+#33757136 be 0x49f0ed40. Type x/x 0x49f0ed40 it will write you a number. x/x that number again, and tell me what it is.
------- Comment #9 From 2012-04-30 14:27:04 PST -------
I'm working from the same core dump so nothing should change.

Yes, I agree it looks strange that it is jumping right into the middle of a function.

(gdb) x/x 0x49f0ed40
0x49f0ed40:    0x41d5d15c
(gdb) x/x 0x41d5d15c
0x41d5d15c <_ZN3JSC4Heap9markRootsEb+1536>:    0xe0299609

I'm not in a good position to test webkit trunk at the moment. I will try to build it on Wednesday.

In the mean time please let me know if you have any other ideas.
------- Comment #10 From 2012-04-30 15:29:45 PST -------
> Yes, I agree it looks strange that it is jumping right into the middle of a function.

Unlikely. I think this is simply the closest symbol gdb can find. 1536 is just too big.

Could you check the other constants? These are fallback functions, following each other one-by-one:

   0x49f0ead4:    blx    r8
   0x49f0ead8:    b    0x49f0d0d0
--- fallback
   0x49f0eadc:    mov    r0, sp
   0x49f0eae0:    str    r4, [sp, #3118288]    ; 0x60
   0x49f0eae4:    ldr    r3, [pc, #33757136]    ; 0x49f0ed3c
   0x49f0eae8:    str    r4, [r3]
   0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
   0x49f0eaf0:    blx    r8
   0x49f0eaf4:    b    0x49f0b164
--- fallback
   0x49f0eaf8:    mov    r2, lr
   0x49f0eafc:    str    r2, [r4, #-3118288]
   0x49f0eb00:    ldr    r8, [pc, #33757136]    ; 0x49f0ed48

They all have such sequence:
   0x49f0eaec:    ldr    r8, [pc, #33757136]    ; 0x49f0ed40
   0x49f0eaf0:    blx    r8

Could you check whether their constant points to a valid function? So this is the only exception or something totally messed up in the constant pool.
------- Comment #11 From 2012-04-30 16:49:01 PST -------
Sorry, think I've wasted a bit of your time.
It looks like I had installed a different webkit build since the crash, and this was affecting the gdb output.

Putting the right build back (the one from which the core was captured), I get different output.

So, stepping back a bit.
lr is still 0x49f0eaf4

The preceding instructions:

   0x49f0ead0:    ldr    r8, [pc, #26091512]    ; 0x49f0ed34
   0x49f0ead4:    blx    r8
   0x49f0ead8:    b    0x49f0d0d0
   0x49f0eadc:    mov    r0, sp
   0x49f0eae0:    str    r4, [sp, #3118288]    ; 0x60
   0x49f0eae4:    ldr    r3, [pc, #26091512]    ; 0x49f0ed3c
   0x49f0eae8:    str    r4, [r3]
   0x49f0eaec:    ldr    r8, [pc, #26091512]    ; 0x49f0ed40
   0x49f0eaf0:    blx    r8
   0x49f0eaf4:    b    0x49f0b164

So, value of 0x49f0ed40

(gdb) x/x 0x49f0ed40
0x49f0ed40:    0x41d5d15c

Nothing new until now. But lets look at that code with the right library in place:

   0x41d5d15c <cti_op_get_by_id_proto_fail+8>:    
    ldr    lr, [sp, #3118288]    ; 0x40
   0x41d5d160 <cti_op_get_by_id_proto_fail+12>:    mov    pc, lr
   0x41d5d164 <cti_op_get_by_id_array_fail>:    
    str    lr, [sp, #3118288]    ; 0x40
   0x41d5d168 <cti_op_get_by_id_array_fail+4>:    bl    0x41cae2e8

This looks suspicious. Does it tell you anything?



Just to compare, the previous fallback condition is:
   0x49f0ead0:    ldr    r8, [pc, #26091512]    ; 0x49f0ed34
   0x49f0ead4:    blx    r8

(gdb) x/x 0x49f0ed34
0x49f0ed34:    0x41d5d1ac
(gdb) x/4i 0x41d5d1ac
   0x41d5d1ac <cti_op_del_by_id+8>:    ldr    lr, [sp, #3118288]    ; 0x40
   0x41d5d1b0 <cti_op_del_by_id+12>:    mov    pc, lr
   0x41d5d1b4 <cti_op_mul>:    str    lr, [sp, #3118288]    ; 0x40
   0x41d5d1b8 <cti_op_mul+4>:    bl    0x41caf998
------- Comment #12 From 2012-05-01 01:10:45 PST -------
No problem. This is entirely different now.

> Nothing new until now. But lets look at that code with the right library in place:
> 
>    0x41d5d15c <cti_op_get_by_id_proto_fail+8>:    
>     ldr    lr, [sp, #3118288]    ; 0x40
>    0x41d5d160 <cti_op_get_by_id_proto_fail+12>:    mov    pc, lr
>    0x41d5d164 <cti_op_get_by_id_array_fail>:    
>     str    lr, [sp, #3118288]    ; 0x40
>    0x41d5d168 <cti_op_get_by_id_array_fail+4>:    bl    0x41cae2e8
> 
> This looks suspicious. Does it tell you anything?

Yeah it is really suspicious. The sequence should look like this:

str    lr, [sp, ...]
bl     ...
ldr    lr, [sp, ...]
mov    pc, lr

Generated by:

#define DEFINE_STUB_FUNCTION(rtype, op) \
    extern "C" { \
        rtype JITStubThunked_##op(STUB_ARGS_DECLARATION); \
    }; \
    asm ( \
        ".globl " SYMBOL_STRING(cti_##op) "\n" \
        SYMBOL_STRING(cti_##op) ":" "\n" \
        "str lr, [sp, #" STRINGIZE_VALUE_OF(THUNK_RETURN_ADDRESS_OFFSET) "]" "\n" \
        "bl " SYMBOL_STRING(JITStubThunked_##op) "\n" \
        "ldr lr, [sp, #" STRINGIZE_VALUE_OF(THUNK_RETURN_ADDRESS_OFFSET) "]" "\n" \
        "mov pc, lr" "\n" \
        ); \
    rtype JITStubThunked_##op(STUB_ARGS_DECLARATION)

and

#define THUNK_RETURN_ADDRESS_OFFSET      0x38

(so #3118288 is somewhat way too big for me)

In other words, something added 8 to the offset of these so called "stubs". Same as the second function. Question is why... Perhaps a very simple web page with simple JS with calling fallbacks like could also reveal this error:

<script>
var a = {}; a["a"]=5;
</script>
------- Comment #13 From 2012-05-01 15:33:23 PST -------
Working from home today, with a different laptop.
So not using the same trace as earlier. Lets start over with a new crash.

(gdb) bt
#0  0x000013e4 in ?? ()
#1  0x499fe5dc in ?? ()
#2  0x499fe5dc in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0             0x4996e240    1234625088
r1             0xfffffffb    4294967291
r2             0x4996e240    1234625088
r3             0xfffffffb    4294967291
r4             0x47626688    1197631112
r5             0x1d2    466
r6             0x47626588    1197630856
r7             0x5357c57c    1398261116
r8             0x41d9f3e4    1104802788
r9             0x47626570    1197630832
r10            0x45a67400    1168536576
r11            0x41f55058    1106595928
r12            0x4c1d0ab0    1276971696
sp             0xbe8aed78    0xbe8aed78
lr             0x499fe5dc    1235215836
pc             0x13e4    0x13e4
cpsr           0x600f0010    1611595792

Check around the LR area again:
(gdb) x/12i $lr-32
   0x499fe5bc:    mov    r0, sp
   0x499fe5c0:    str    r4, [sp, #3118288]    ; 0x60
   0x499fe5c4:    mov    r8, #408    ; 0x198
   0x499fe5c8:    str    r8, [r4, #-3118288]    ; 0x2c
   0x499fe5cc:    ldr    r3, [pc, #12638680]    ; 0x499fea40
   0x499fe5d0:    str    r4, [r3]
   0x499fe5d4:    ldr    r8, [pc, #12638680]    ; 0x499fea44
   0x499fe5d8:    blx    r8
   0x499fe5dc:    str    r0, [r4, #3118288]    ; 0x70
   0x499fe5e0:    str    r1, [r4, #3118288]    ; 0x74
   0x499fe5e4:    b    0x499fc264
   0x499fe5e8:    b    0x499fe618

Looking carefully at this instruction:
   0x499fe5d4:    ldr    r8, [pc, #12638680]    ; 0x499fea44

Lets try this calculation by hand. PC is always 8 bytes ahead of the current instruction, so pc=0x499fe5d4 + 8.
Then we add 12638680, and we read from that memory location.

0x499fe5d4 + 8 + 12638680 = 0x4a60bfb4 so:

(gdb) x/x 0x4a60bfb4
0x4a60bfb4:    0x00000000
Hmm, unlikely.

But gdb's annotation said 0x499fea44.

After asking around a bit I've been told that the number inside the square brackets is not to be taken literally. It includes flags and other things. However, the address in the annotation can be trusted.

So, lets check the memory at that address.

(gdb) x/x 0x499fea44
0x499fea44:    0x41d9f3e4
(gdb) x/4i 0x41d9f3e4
   0x41d9f3e4 <cti_op_resolve_global>:    str    lr, [sp, #3118288]    ; 0x40
   0x41d9f3e8 <cti_op_resolve_global+4>:    bl    0x41cf4aac
   0x41d9f3ec <cti_op_resolve_global+8>:    
    ldr    lr, [sp, #3118288]    ; 0x40
   0x41d9f3f0 <cti_op_resolve_global+12>:    mov    pc, lr

And lets also recall that r8 was programmed with this address before we branched. Checking back to the original register dump, r8=0x41d9f3e4 which is the same as cti_op_resolve_global. Things are making some sense.

The offset in the str/ldr lr lines of 3118288 is huge of course. Again I think we have to ignore it. I think the offset being used is 0x40, as shown by the comment.

Now lets think about the value of lr. It got set as 0x499fe5dc because of the "blx r8" that we followed earlier. Now almost immediately inside cti_op_resolve_global we call "bl", which will change the value of lr. However, lr does *not* reflect the return location for the "bl 0x41cf4aac" call. This means either:
 1. We crashed before executing the bl inside cti_op_resolve_global (seems impossible), or
 2. We executed the bl inside cti_op_resolve_global, and then restored lr, and returned. (seems likely)

So lets go back to the code pasted at the top of this comment (around 0x499fe5dc), since thats where we're returning to.

   0x499fe5dc:    str    r0, [r4, #3118288]    ; 0x70
   0x499fe5e0:    str    r1, [r4, #3118288]    ; 0x74

Lets see if we executed those instructions:
r4=0x47626688
r4 + 0x70 = 0x476266f8

(gdb) x/x 0x476266f8
0x476266f8:    0x4996e240

That matches the value of r0.

Looking at the "str r1":

(gdb) x/x 0x476266fc
0x476266fc:    0xfffffffb

That matches the value of r1.

So it seems like we have returned and executed these 2 instructions at least. Next is:

   0x499fe5e4:    b    0x499fc264

Lets look:

(gdb) x/12i 0x499fc264
   0x499fc264:    mov    r2, #1
   0x499fc268:    mvn    r7, #0
   0x499fc26c:    ldr    r0, [r4, #3118288]    ; 0x50
   0x499fc270:    ldr    r1, [r4, #3118288]    ; 0x54
   0x499fc274:    cmn    r7, #1
   0x499fc278:    bne    0x499fe5e8
   0x499fc27c:    cmn    r1, #5
   0x499fc280:    bne    0x499fe5e8
   0x499fc284:    ldr    r8, [r0]
   0x499fc288:    ldr    r3, [pc, #28956432]    ; 0x499fca44
   0x499fc28c:    cmp    r8, r3
   0x499fc290:    bne    0x499fe5ec

This code looks odd.
Seems to set r7 to a fixed value and then compare its value against 1?
Looking at register values and memory I'm having trouble convincing myself that this code has run, but it might have.
Anyway, out of time for today unfortunately.


I tried the test webpage that you provided. It doesn't trigger the crash.
Also after a few runs I haven't managed to reproduce the problem where the stub offset is off by 8. Maybe that one was a bad dump.
------- Comment #14 From 2012-05-01 19:37:54 PST -------
I don't know exactly what's going on here but I experienced that this kind of crash, related to lr and pc values, could be occurred when cache flush was not run in the requested range.
------- Comment #15 From 2012-05-02 01:39:23 PST -------
> Now lets think about the value of lr. It got set as 0x499fe5dc because of the "blx r8" that we followed earlier. Now almost immediately inside cti_op_resolve_global we call "bl", which will change the value of lr. However, lr does *not* reflect the return location for the "bl 0x41cf4aac" call. This means either:
>  1. We crashed before executing the bl inside cti_op_resolve_global (seems impossible), or
>  2. We executed the bl inside cti_op_resolve_global, and then restored lr, and returned. (seems likely)

This makes sense, and there is a third option. Actually the purpose of such stub code is allowing returning to anywhere in JIT, mainly used by exception handlers. So the return value is stored on the stack (like x86), can be changed (like a buffer overflow attack, but this is intentional here) so the c++ function can return anywhere, including a catch handler.

So we have two new options:
1) A wrong handler was set
2) Something overwrites the return value

Would be good to know if an exception occures just before the return...

Perhaps the following code also crashes:

try {
  var a = "a";
  a++;
} catch(e) { }
------- Comment #16 From 2012-05-02 09:49:09 PST -------
(In reply to comment #15)
> Would be good to know if an exception occures just before the return...

How can I check this?

> Perhaps the following code also crashes:
> 
> try {
>   var a = "a";
>   a++;
> } catch(e) { }

No crash, unfortunately.

Just FYI, I have a feeling that finding a simplistic test case will be difficult. Sometimes when the crash happens, I go back to the same page and it loads just fine without crashing. gmail seems to cause the crash every time, but sometimes it takes a good few seconds longer than normal before the crash happens.

Also, when I run epiphany under gdb, the crash is very hard to reproduce, even on gmail. (thats why I've been mostly working with core dumps)
------- Comment #17 From 2012-05-14 13:53:17 PST -------
This has also been reproduced on a trimslice (also running as armv7hl). We'll disable the ARM JIT in Fedora for the time being in order to avoid this crash.

If anyone with the right experience is interested in working on this issue, we can ship hardware. Send me an email if interested.
------- Comment #18 From 2012-08-09 05:34:11 PST -------
The workaround we were using to disable the ARM JIT in Fedora does not work anymore because the Heap code now requires JIT to be enabled (building 1.9.5). So the situation is a bit chicken and egg now. Has there been any progress on the original issue?