Bug 115705

Summary: Implement a probe mechanism for JIT generated code
Product: WebKit Reporter: Mark Lam <mark.lam>
Component: JavaScriptCoreAssignee: Mark Lam <mark.lam>
Status: RESOLVED FIXED    
Severity: Normal CC: fpizlo, ggaren, mhahnenberg, msaboff, oliver
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
the patch.
msaboff: review-
new and improved after addressing Michael's feedback.
msaboff: review+
revised patch. ggaren: review+

Mark Lam
Reported 2013-05-06 20:29:01 PDT
It would be useful if we can embed a call to a C function (almost) anywhere in JIT generated code so that we can: 1. do printf debugging from JIT generated code 2. set breakpoints in JIT generated code using gdb The probe should save all CPU state (mainly registers) on entry and restore them on exit. Apart from taking up some space, the insertion of the probe should effectively be invisible to the JIT generated code around it. That said, the requirement of non-interference only applies to the probe mechanism itself. The ProbeFunction that is called back by the probe can do whatever it wants as needed. We only try to guarantee that the probe mechanism itself does not introduce any unexpected side-effects. Note: The current implementation only supports X86 and X86_64. Support for other CPU can be added later. Also, the current probe mechanism only preserves general purpose registers and floating point registers. It does not preserve condition code / status / FLAGS registers yet. as a result, the user should not insert probes in between JIT instructions that has a dependency on condition codes between them. For the most part, this limitation does not prevent the probe mechanism from being useful in many ways. Here's an example of a dump of a X86_64 ProbeContext captured from a probe inserted in the DFG Phantom node: ProbeContext 0x7fff52f3d8e0 { probeFunction: 0x10d0d8df0 arg1: 0x10dd4821b 4526998043 arg2: 0x0 0 jitStackFrame: 0x7fff52f3da20 cpu: { eax: 0x00000001130ad850 4614445136 ecx: 0x0000000112f7fd70 4613209456 edx: 0x0000000000000002 2 ebx: 0x000000000000000a 10 esp: 0x00007fff52f3da20 140734585100832 ebp: 0x00007fff52f3dac0 140734585100992 esi: 0x0000000000000004 4 edi: 0x000000000000010b 267 r8: 0x00007fff52f3d224 140734585098788 r9: 0x00007fff52f3d300 140734585099008 r10: 0x000000006aef1ba1 1794055073 r11: 0x0000000000000000 0 r12: 0x0000000000000200 512 r13: 0x0000000112ad0128 4608295208 r14: 0xffff000000000000 -281474976710656 r15: 0xffff000000000002 -281474976710654 eip: 0x00004fca4be01873 87730274965619 xmm0: 00000000 00000000 408f4000 447a0000 0 1000 xmm1: 00000000 00000000 408f4000 00000000 0 1000 xmm2: 00000000 00000000 40919c08 7ec61929 0 1127.01 xmm3: 41d00000 00000000 3ff00007 b803b946 1.07374e+09 1.00001 xmm4: 00000000 00000000 40380628 cbd1244a 0 24.0241 xmm5: 81d8b849 368b4800 007f87ab 4481d0be -9.22806e-300 2.80626e-306 xmm6: 87ab4481 e0b94900 8b4d0000 7f87ab44 -1.00809e-271 -3.09024e-254 xmm7: 00007f87 ab4481e8 ba49098b 4d00007f 6.92782e-310 -6.3203e-28 } } Patch coming soon.
Attachments
the patch. (42.26 KB, patch)
2013-05-06 21:03 PDT, Mark Lam
msaboff: review-
new and improved after addressing Michael's feedback. (34.13 KB, patch)
2013-05-07 18:15 PDT, Mark Lam
msaboff: review+
revised patch. (38.19 KB, patch)
2013-05-15 15:55 PDT, Mark Lam
ggaren: review+
Mark Lam
Comment 1 2013-05-06 21:03:36 PDT
Created attachment 200869 [details] the patch.
Michael Saboff
Comment 2 2013-05-07 09:34:16 PDT
Comment on attachment 200869 [details] the patch. View in context: https://bugs.webkit.org/attachment.cgi?id=200869&action=review I think that the space allocated for saving registers in the cpu struct is overly complex. I suggest that you directly layout the registers using properly aligned C/C++ types and let the compile do the work. For example, the xmm registers can have the type of __m128 which will be aligned. Further, I would then use offset_of() to get the offset within the cpu register save structure instead of just using it in a COMPILE_ASSERT. i.e. PROBE_CPU_EAX_OFFSET becomes offsetof(struct ASMProbeContext, cpu.eax). This should eliminate much of the FOR_EACH_ and the COMPILE_ASSERTs. > Source/JavaScriptCore/assembler/ASMProbeContext.h:63 > + struct { > + #define DECLARE_REGISTER(_type, _regName) \ > + _type _regName; > + FOR_EACH_CPU_REGISTER(DECLARE_REGISTER) > + #undef DECLARE_REGISTER > + } cpu; I'd prefer if there was a CPUState struct defined in each MacroAssembler and the used here. > Source/JavaScriptCore/assembler/MacroAssembler.cpp:107 > + // Pack the space to allocate so that the stack pointer will remain > + // 32 byte aligned. > + const int packedCPUContextSize = WTF_PACK(sizeof(ASMProbeContext), 5); Why do we need 32 byte alignment? > Source/JavaScriptCore/assembler/MacroAssembler.cpp:115 > + // Save the original sp as the farme pointer: Typo "frame" > Source/JavaScriptCore/assembler/X86Assembler.h:81 > + struct XMMRegisterValue { > + uint32_t u0; > + uint32_t u1; > + uint32_t u2; > + uint32_t u3; > + }; Use __m128 instead. > Source/JavaScriptCore/assembler/X86Assembler.h:88 > + // We pad some space after the GPRegs storage because we want the FPRegs > + // storage to start on a 16 byte (128 bit) alignment. > + #define FOR_EACH_CPU_REGISTER(V) \ > + FOR_EACH_CPU_GPREGISTER(V) \ > + FOR_EACH_CPU_PADDING(V) \ > + FOR_EACH_CPU_FPREGISTER(V) Use compiler types that are already padded to eliminate explicit padding. > Source/JavaScriptCore/jit/JITStubs.cpp:95 > +#define PROBE_PROBE_FUNCTION_OFFSET (0 * PTR_SIZE) > +#define PROBE_ARG1_OFFSET (1 * PTR_SIZE) > +#define PROBE_ARG2_OFFSET (2 * PTR_SIZE) > +#define PROBE_JIT_STACK_FRAME_OFFSET (3 * PTR_SIZE) Use offset_of() for these. > Source/JavaScriptCore/jit/JITStubs.cpp:104 > +#define PROBE_CPU_EAX_OFFSET (4 * PTR_SIZE) > +#define PROBE_CPU_ECX_OFFSET (5 * PTR_SIZE) > +#define PROBE_CPU_EDX_OFFSET (6 * PTR_SIZE) > +#define PROBE_CPU_EBX_OFFSET (7 * PTR_SIZE) > +#define PROBE_CPU_ESP_OFFSET (8 * PTR_SIZE) > +#define PROBE_CPU_EBP_OFFSET (9 * PTR_SIZE) > +#define PROBE_CPU_ESI_OFFSET (10 * PTR_SIZE) > +#define PROBE_CPU_EDI_OFFSET (11 * PTR_SIZE) Ditto > Source/JavaScriptCore/jit/JITStubs.cpp:122 > +#define PROBE_CPU_R8_OFFSET (12 * PTR_SIZE) > +#define PROBE_CPU_R9_OFFSET (13 * PTR_SIZE) > +#define PROBE_CPU_R10_OFFSET (14 * PTR_SIZE) > +#define PROBE_CPU_R11_OFFSET (15 * PTR_SIZE) > +#define PROBE_CPU_R12_OFFSET (16 * PTR_SIZE) > +#define PROBE_CPU_R13_OFFSET (17 * PTR_SIZE) > +#define PROBE_CPU_R14_OFFSET (18 * PTR_SIZE) > +#define PROBE_CPU_R15_OFFSET (19 * PTR_SIZE) > +#define PROBE_CPU_EIP_OFFSET (20 * PTR_SIZE) > +#define PROBE_FIRST_XMM_OFFSET (22 * PTR_SIZE) // After padding. Ditto > Source/JavaScriptCore/jit/JITStubs.cpp:134 > +#define PROBE_CPU_XMM0_OFFSET (PROBE_FIRST_XMM_OFFSET + (0 * XMM_SIZE)) > +#define PROBE_CPU_XMM1_OFFSET (PROBE_FIRST_XMM_OFFSET + (1 * XMM_SIZE)) > +#define PROBE_CPU_XMM2_OFFSET (PROBE_FIRST_XMM_OFFSET + (2 * XMM_SIZE)) > +#define PROBE_CPU_XMM3_OFFSET (PROBE_FIRST_XMM_OFFSET + (3 * XMM_SIZE)) > +#define PROBE_CPU_XMM4_OFFSET (PROBE_FIRST_XMM_OFFSET + (4 * XMM_SIZE)) > +#define PROBE_CPU_XMM5_OFFSET (PROBE_FIRST_XMM_OFFSET + (5 * XMM_SIZE)) > +#define PROBE_CPU_XMM6_OFFSET (PROBE_FIRST_XMM_OFFSET + (6 * XMM_SIZE)) > +#define PROBE_CPU_XMM7_OFFSET (PROBE_FIRST_XMM_OFFSET + (7 * XMM_SIZE)) Ditto > Source/WTF/wtf/Alignment.h:55 > + // That means maskBots is 0s followed by N 1s (as expected). We can use Type *maskBits*
Mark Lam
Comment 3 2013-05-07 18:04:43 PDT
(In reply to comment #2) > (From update of attachment 200869 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=200869&action=review > > I think that the space allocated for saving registers in the cpu struct is overly complex. I suggest that you directly layout the registers using properly aligned C/C++ types and let the compile do the work. For example, the xmm registers can have the type of __m128 which will be aligned. Further, I would then use offset_of() to get the offset within the cpu register save structure instead of just using it in a COMPILE_ASSERT. i.e. PROBE_CPU_EAX_OFFSET becomes offsetof(struct ASMProbeContext, cpu.eax). This should eliminate much of the FOR_EACH_ and the COMPILE_ASSERTs. Done. But as discussed offline, I can't define the PROBE_..._OFFSET #defines as offsetof() expressions. The C preprocessor does not evaluate the expression into a number before pasting it into the inline asm as a string. Hence, we'll have to stick with the COMPILE_ASSERTs to verify that the PROBE_..._OFFSETs are valid. > > Source/JavaScriptCore/assembler/ASMProbeContext.h:63 > > + struct { > > + #define DECLARE_REGISTER(_type, _regName) \ > > + _type _regName; > > + FOR_EACH_CPU_REGISTER(DECLARE_REGISTER) > > + #undef DECLARE_REGISTER > > + } cpu; > > I'd prefer if there was a CPUState struct defined in each MacroAssembler and the used here. Done. I also moved the ProbeContext back into the MacroAssembler as an inner struct. I was going between these 2 implementations, but I think having it as an inner class more clearly communicates its relationship to the MacroAssembler::probe() function. > > Source/JavaScriptCore/assembler/MacroAssembler.cpp:107 > > + // Pack the space to allocate so that the stack pointer will remain > > + // 32 byte aligned. > > + const int packedCPUContextSize = WTF_PACK(sizeof(ASMProbeContext), 5); > > Why do we need 32 byte alignment? AMD64 ABI spec says that if anyone passes __m256 values on the stack, then it needs to be 32 byte aligned. Otherwise, it needs to be 16 byte aligned. I'm being conservative. Accordingly, I adjusted a stack bump in ctiMasmProbeTrampoline to keep this 32 byte alignment. > > Source/JavaScriptCore/assembler/MacroAssembler.cpp:115 > > + // Save the original sp as the farme pointer: > > Typo "frame" Fixed. > > Source/WTF/wtf/Alignment.h:55 > > + // That means maskBots is 0s followed by N 1s (as expected). We can use > > Type *maskBits* Fixed. New patch coming soon.
Mark Lam
Comment 4 2013-05-07 18:15:06 PDT
Created attachment 201009 [details] new and improved after addressing Michael's feedback. Also made the ProbeContext::dump() format a little prettier. Here's a sample of the new dump: ProbeContext 0x7fff531db8e0 { probeFunction: 0x10cc62320 arg1: 0x10daa921b 4524249627 arg2: 0x0 0 jitStackFrame: 0x7fff531dba20 cpu: { eax: 0x0000000112e8d850 4612216912 ecx: 0x0000000112d5fd70 4610981232 edx: 0x0000000000000002 2 ebx: 0x000000000000000a 10 esp: 0x00007fff531dba20 140734587845152 ebp: 0x00007fff531dbac0 140734587845312 esi: 0x0000000000000004 4 edi: 0x000000000000010b 267 r8: 0x00007fff531db224 140734587843108 r9: 0x00007fff531db300 140734587843328 r10: 0x000000006aef1ba1 1794055073 r11: 0x0000000000000000 0 r12: 0x0000000000000200 512 r13: 0x00000001128b0128 4606066984 r14: 0xffff000000000000 -281474976710656 r15: 0xffff000000000002 -281474976710654 eip: 0x0000368a34401873 59967210002547 xmm0: 0x0000000000000000 0x408f4000447a0000 0 1000 xmm1: 0x0000000000000000 0x408f400000000000 0 1000 xmm2: 0x0000000000000000 0x40919c087ec61929 0 1127.01 xmm3: 0x41d0000000000000 0x3ff00007b803b946 1.07374e+09 1.00001 xmm4: 0x0000000000000000 0x40380628cbd1244a 0 24.0241 xmm5: 0xeea8b849368b4800 0x007f9e5964eea0be -1.14375e+225 2.81414e-306 xmm6: 0x9e5964eeb0b94900 0x8b4d00007f9e5964 -1.76392e-162 -3.09024e-254 xmm7: 0x00007f9e5964eeb8 0xba49098b4d00007f 6.93263e-310 -6.3203e-28 } }
Michael Saboff
Comment 5 2013-05-08 14:19:36 PDT
Comment on attachment 201009 [details] new and improved after addressing Michael's feedback. View in context: https://bugs.webkit.org/attachment.cgi?id=201009&action=review r+ with comments. > Source/JavaScriptCore/ChangeLog:16 > + the duration that the ProbeFunction is executing. It will be popped of typo at end of line *off* > Source/JavaScriptCore/ChangeLog:25 > + This changeset only implements the probe mechanism for X86, and X86_64. Eliminate comma after X86 > Source/JavaScriptCore/jit/JITStubs.cpp:243 > + "popl %eax" "\n" Instead of popping and then pushing later, why don't we just read the return pc? Or is there a reason that we want to clear the pc off the stack? > Source/JavaScriptCore/jit/JITStubs.cpp:554 > + "popq %rax" "\n" Same question as above. > Source/WTF/wtf/Alignment.h:47 > +// WTF_PACK(0, 3); // yields 0. > +// WTF_PACK(5, 3); // yields 8. > +// WTF_PACK(8, 3); // yields 8. > +// WTF_PACK(23, 3); // yields 24 i.e. 3 x 8. > +// WTF_PACK(24, 3); // yields 24. > +// > +// // Pack to 16 bytes i.e. 4 bits of alignment. > +// WTF_PACK(0, 4); // yields 0. > +// WTF_PACK(5, 4); // yields 16. > +// WTF_PACK(8, 4); // yields 16. > +// WTF_PACK(23, 4); // yields 32 i.e. 2 x 16. > +// WTF_PACK(24, 4); // yields 32. Do we need all of these examples? > Source/WTF/wtf/Alignment.h:96 > + // Note that in binary, alignSize is always a 1 followed by N 0s. > + // That means maskBits is 0s followed by N 1s (as expected). We can use > + // this mask to mask off the low bits of a given size to produce the > + // nearest multiple of alignSize below that size: > + // > + // roundDown(size) { return size & maskBits; } > + // > + // If roundDown(size) yields the aligned size below size, then > + // roundDown(size + alignSize) should yield the aligned size above size. > + // We can use this to implement our roundUp() functionality except that > + // it doesn't work in one case: > + // > + // If size was already sligned to begin with, then > + // roundDown(size + alignSize) will produce (size + alignSize). For > + // roundUp(size), we want it to produce size instead. > + // > + // To remedy this, instead of adding alignSize, we add (alignSize - 1). > + // We'll call this value maxPadSize. Coincidently, this is the same value > + // as maskBits. Hence, > + // > + // If size was already aligned to begin with, then > + // roundDown(size + maxPadSize) will produce size as expected. > + // > + // If size is not aligned, then size > roundDown(size). > + // Let deltaSize = size - roundDown(size). Hence, > + // roundDown(size + maxPadSize) > + // ==> roundDown(roundDown(size) + deltaSize + maxPadSize) > + // ==> roundDown(size) + roundDown(deltaSize + maxPadSize) > + // > + // But since maxPadSize = alignSize - 1, then > + // ((deltaSize + maxPadsize) > alignSize) since > + // deltaSize > 1 by definition. > + // > + // Let remainderSize = (deltaSize + maxPadsize) - alignSize. > + // Note: We're guaranteed that remainderSize < alignSize. > + // > + // Hence, (alignSize + remainderSize) == (deltaSize + maxPadsize). > + // So, continuing the breakdown of roundDown(size + maxPadSize) ... > + // > + // ==> roundDown(size) + roundDown(alignSize + remainderSize) > + // ==> roundDown(size) + roundDown(alignSize) + roundDown(remainderSize) > + // ==> roundDown(size) + alignSize + 0 > + // ==> roundUp(size) when size if not aligned. Seems like a long comment. Rounding up to the next power of2 shouldn't require near this many comments.
Mark Lam
Comment 6 2013-05-15 15:22:12 PDT
I have new offline feedback that will improve the patch. I will upload a new patch shortly. (In reply to comment #5) > > Source/JavaScriptCore/ChangeLog:16 > > + the duration that the ProbeFunction is executing. It will be popped of > > typo at end of line *off* Will fix. > > Source/JavaScriptCore/ChangeLog:25 > > + This changeset only implements the probe mechanism for X86, and X86_64. > > Eliminate comma after X86 Will fix. > > Source/JavaScriptCore/jit/JITStubs.cpp:243 > > + "popl %eax" "\n" > > Instead of popping and then pushing later, why don't we just read the return pc? Or is there a reason that we want to clear the pc off the stack? 2 reasons: 1. This simplifies how I can get at the ProbeContext pointer (which is the esp after I pop the return address). 2. The user probe function may choose to change the value of the registers in the ProbeContext's CPUState for debugging / testing purposes. This includes the return address. For that reason, I will need to push the potentially new return address anyway. So, I might as well pop it at the start of the trampoline. > > Source/WTF/wtf/Alignment.h:47 WTF_PACK() (and wtfPack()) is not needed. Will switch to using the pre-existing roundUpToMultipleOf() function in StdLibExtras.h. So, these changes will be backed out.
Mark Lam
Comment 7 2013-05-15 15:55:33 PDT
Created attachment 201892 [details] revised patch.
Geoffrey Garen
Comment 8 2013-05-15 17:32:18 PDT
Comment on attachment 201892 [details] revised patch. View in context: https://bugs.webkit.org/attachment.cgi?id=201892&action=review r=me > Source/JavaScriptCore/assembler/MacroAssembler.cpp:108 > +// Specifcally, the saved rsp/esp will point to the stack position after we pop > +// the ProbeContext frame. The saved rip/eip will point to the address of the A little clearer as "...stack position before we push…". The value after we pop is not known until the probe runs. > Source/JavaScriptCore/assembler/MacroAssembler.cpp:111 > +void MacroAssembler::probe(MacroAssembler::ProbeFunction function, void* arg1, void* arg2) This should move to MacroAssemblerX86Common. It's not so nice to put tons of #ifdefs into the shared MacroAssembler. > Source/JavaScriptCore/assembler/MacroAssembler.cpp:114 > + #define ProbeContextField(field) Address(esp, offsetof(ProbeContext, field)) The coding style guidelines say this should start with lower case. > Source/JavaScriptCore/assembler/MacroAssembler.cpp:116 > + // The X86_64 ABI specifies that the worse case STACK alignment requirement Typo: should be "stack". > Source/JavaScriptCore/assembler/MacroAssembler.cpp:153 > +NO_RETURN_DUE_TO_ASSERT > +void MacroAssembler::ProbeContext::dumpCPURegisters(const char* indentation) > +{ > + UNUSED_PARAM(indentation); > + ASSERT_NOT_REACHED(); > +} > + > +NO_RETURN_DUE_TO_ASSERT > +void MacroAssembler::probe(MacroAssembler::ProbeFunction function, void* arg1, void* arg2) > +{ > + UNUSED_PARAM(function); > + UNUSED_PARAM(arg1); > + UNUSED_PARAM(arg2); > + ASSERT_NOT_REACHED(); > +} I don't think these stubs add anything. Platforms that don't support probing won't build when probing is enabled. > Source/JavaScriptCore/assembler/X86Assembler.h:36 > +#include <xmmintrin.h> Let's #if this to avoid build problems. > Source/JavaScriptCore/jit/JITStubsX86.h:142 > + // function may have intentionally changed this values for debugging or Typo: Should be "these values". > Source/JavaScriptCore/jit/JITStubsX86.h:143 > + // "Restore" the register values for returning. Note: the user probe > + // function may have intentionally changed this values for debugging or > + // testing purposes. Better to talk about our own API, rather than what other code might do. Something like: "To enable probes to modify register state, we copy all registers out of the ProbeContext before returning." > Source/JavaScriptCore/jit/JITStubsX86.h:167 > + // Restore the return address for the ret below: > + "pushl " STRINGIZE_VALUE_OF(PROBE_CPU_EIP_OFFSET) "(%ebp)" "\n" > + > + // Everything's restored. Lastly, restore the %ebp, and return: > + "movl " STRINGIZE_VALUE_OF(PROBE_CPU_EBP_OFFSET) "(%ebp), %ebp" "\n" > + "ret" "\n" These seem like "what" comments. > Source/JavaScriptCore/jit/JITStubsX86Common.h:41 > +// The following are offsets for MacroAssembler::ProbeContext fields accessed > +// the ctiMasmProbeTrampoline stub. Typo: Should be "accessed by the". > Source/JavaScriptCore/jit/JITStubsX86_64.h:136 > + "popq %rax" "\n" > + "movq %rax, " STRINGIZE_VALUE_OF(PROBE_CPU_EIP_OFFSET) "(%rsp)" "\n" > + > + "movl %rbx, " STRINGIZE_VALUE_OF(PROBE_CPU_EBX_OFFSET) "(%rsp)" "\n" > + "movl %rsp, %rbp" "\n" // Save the ProbeContext*. > + > + "movq %rcx, " STRINGIZE_VALUE_OF(PROBE_CPU_ECX_OFFSET) "(%rbp)" "\n" > + "movq %rdx, " STRINGIZE_VALUE_OF(PROBE_CPU_EDX_OFFSET) "(%rbp)" "\n" > + "movq %rbx, " STRINGIZE_VALUE_OF(PROBE_CPU_EBX_OFFSET) "(%rbp)" "\n" > + "movq %rsi, " STRINGIZE_VALUE_OF(PROBE_CPU_ESI_OFFSET) "(%rbp)" "\n" > + "movq %rdi, " STRINGIZE_VALUE_OF(PROBE_CPU_EDI_OFFSET) "(%rbp)" "\n" > + This stuff is super hard to read. I think it would be better, in future, for us to move away from this kind of assembly, and more toward the technique used in stringLengthTrampolineGenerator() and similar bits of assembly.
Mark Lam
Comment 9 2013-05-16 10:55:49 PDT
Thanks for the review. Geoff's feedback has been applied. The patch is landed in r150186: <http://trac.webkit.org/changeset/150186>.
Note You need to log in before you can comment on or make changes to this bug.