Bug 24986

Summary: [multi-patch] ARM JIT port
Product: WebKit Reporter: Gabor Loki <loki>
Component: JavaScriptCoreAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Enhancement CC: akiss, barraclough, ddkilzer, emacemac7, gyuyoung, hausmann, ismail, kbalazs, kenneth, laszlo.gombos, mh+webkit, mike, mrowe, oliver, ossy, sriram.neelakandan, tonikitoo, yael, yong.li.webkit, zherczeg, zoltan, zuh
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: All   
Attachments:
Description Flags
comparison between interpreter and MacroAssembler based ARM JIT
none
comparison between MacroAssembler based and from the scratch ARM JIT
none
comparison between interpreter and MacroAssembler based ARM JIT+YARR
none
Updated version of generic ARM JIT port
barraclough: review-
Move Segmented vector to wtf sub-directory
none
Only move Segmented vector to wtf sub-directory
none
Add an Iterator to SegmentedVector
none
Reorganize ARM architecture specific macros
none
Remove unnecessary references to AssemblerBuffer
none
Constant pool for AssemblerBuffer
none
Add YARR support for generic ARM platforms
none
Constant pool for AssemblerBuffer (v2)
barraclough: review-
Add YARR support for generic ARM platforms (v2)
barraclough: review-
Constant pool for AssemblerBuffer (v3)
none
Add YARR support for generic ARM platforms (v3)
none
Add generic ARM-JIT support using ARMAssembler
none
Floating point support for ARM. Looks like we double the number of patches each day :)
none
Add generic ARM-JIT support using ARMAssembler v2
none
Floating point support for ARM v2
none
Add optimize call support for ARM JIT
none
Add property access support for ARM JIT
none
Add generic ARM-JIT support using ARMAssembler v3
none
Floating point support for ARM v3
none
Add optimize call support for ARM JIT v3
barraclough: review-
Add property access support for ARM JIT v3
barraclough: review-
a
none
Add optimize call and property access support for ARM JIT
none
Add optimize call and property access support for ARM JIT v2 barraclough: review+

Description Gabor Loki 2009-04-01 06:55:58 PDT
The ARM JIT port can be downloaded from:
 - git clone git://code.staikos.net/webkit && git checkout loki/arm-port
 - or http://code.staikos.net/cgi-bin/gitweb.cgi?p=webkit;a=shortlog;h=loki/arm-port

Any feedback is welcome!
Comment 1 Oliver Hunt 2009-04-01 13:02:13 PDT
a number of the new files appear to have taken large chunks of logic from existing jit files (just replacing a few x86 asm statements with arm equivalents), so should include appropriate copyright info from the original files.  The patch also does not appear to use the macro assembler in any way which we see as problematic.

There also appears to be a large number of unnecessary function renames that do nothing but cloud the patch itself.
Comment 2 Zoltan Herczeg 2009-04-01 14:07:19 PDT
The implementation is started at early of January, and the repository has been reorganized many times since that (there was a merge, which took 3 days, because many things were moved and renamed). We decided to focus on the functionality first, and will do the renamings later.

From now on, we can discuss the syntax. We think the two jit ports have many things in common, but the architecture specific optimizations are different. ARMCompiler and AssemblerBuffer have similar purpose, except the ArmCompiler does some post processing, that is why we decided to put under the jit directory (and AssemberBuffer was not exists in that time)

We don't see any problem with the copyright information as long as it compatible with bsd or lgpl (our supporters don't like gpl for some reasons).

By the way, do you like our work? :)
Comment 3 Oliver Hunt 2009-04-01 14:10:42 PDT
(In reply to comment #2)
> We don't see any problem with the copyright information as long as it
> compatible with bsd or lgpl (our supporters don't like gpl for some reasons).
The specific issue is that you appear to have copy/pasted logic from the existing jit files so should be including the appropriate copyrights from those files, eg. Apple and Cameron

Comment 4 Akos Kiss 2009-04-02 03:10:03 PDT
(In reply to comment #3)

Hi Oliver,

Now, I try to present our view on the issue. There were two main modules of JSC we had to modify, the assembler and the JIT.


To the assembler, we added two new files:

ARMAssembler.h: This contains only C++ functions that deal with the ARM instruction set. This code is ours for sure.

MacroAsseblerARM.h: This contains a simple class wrapping pointers. Here, we could not reuse the AbstractMacroAssembler for ARM since it is quite x86-focused.


In the jit module, there are 7 new files:

CommonJIT.h: This file contains platform-specific macros that are required because of the differences in the x86 and ARM JITs.

ARMCompiler{.h,.cpp}: Basically, this is where the generated machine code instructions are stored. In your implementation, this functionality is implemented in the AssemblerBuffer. However, we go further than simply storing the instructions. We maintain a constant pool and post-optimize branches. Moreover, because of the pecularities of ARM, labels and branches are built and stored in a different manner than done for x86. The x86 jit has no element of similar functionality.

ARMJIT{.h,.cpp}: This converts byte code to machine instructions. The high-level approach is similar to that used in x86, but this is defined/fixed by the interface we had to implement. We did not want to diverge from that. However, on the low-level, the logic of the implementation is ARM-specific. The x86 JIT was not reusable at that level. Where we do resemble that code is the name of the functions.

ARMJITStubFunctions.cpp: This is our own idea. x86 JIT contains nothing alike.

ARMJITPropertyAccess.cpp: This is the place where you might be right. This code resembles the most to the origial JITPropertyAccess. This contains the least ARM-specific optimizations. Here, we are ready to add Apple copyrights.


Additionally, we added an extra file into wtf as well:

WTF/Tape.h: This is an auxiliary data structure. Idea and implementation is ours alone.


Basically, I do not think that implementing an interface implies copyrights on the implementation for the owner of the interface, do you? So, now you see our perspective. May I ask you to point our where you do not agree with our explanations above?


Regards,
Akos, Gabor, and Zoltan

Comment 5 Oliver Hunt 2009-04-02 14:27:21 PDT
(In reply to comment #4)
The issue isn't implementing an interface, it's things like the privateCompileGetByIdChainList, etc methods which are duplicates of the existing macroassembler versions (which you should be using) where the only change is replacing the macroassembler instructions with arm instructions, or ARMJIT.h which is clearly a modified copy of the x86 version.

However these are really minor issues that we shouldn't tunnel vision on, the more significant barrier to being included in trunk is that you should be using the macroassembler, rather than just duplicating large amounts of logic for the ARM.  Effectively you're making a completely distinct jit so any new opcodes will have to be written twice: once for ARM and once for every other port.
Comment 6 Gavin Barraclough 2009-04-02 15:26:59 PDT
Hi Akos, Gabor, Zoltan,

Congratulations on getting this working!, great job.

As Oliver says (and as I'm sure is clear from the changes that have been going into the repository over the last months) we have been abstracting out the major architectural differences in the JIT through use of the MacroAssembler interface with a very clear direction of allowing the core of the JIT to remain as common code shared across architectures.  Our main concern with having the functionality of the JIT duplicated (and effectively forked) will be that engine changes we make will be prone to break the ARM JIT, which we will not be in a position to maintain.  As such, if the JIT is to be forked it would seem to make most sense for it to exist in a fork of WebKit, so we are not continually breaking you?

However, if you are interested in landing this into trunk, from the limited chance I've had to look at your code I'd say your port looks structurally similar enough to the existing JIT that it should be quite possible to merge the two back together.  I see no obvious barriers to moving your code over to using a MacroAssembler interface matching that of the existing JIT.  If then the names of a few internal functions are brought into line with each other (accessing virtual registers, etc), then the two pieces of code should start looking rather similar.  I'd also suggest that porting WREC may help here, since the code generation is very simple, and almost entirely abstracted through the MacroAssembler interface.  Once you have implemented the MacroAssembler class (which should be a fairly trivial wrapper on your ArmAssembler) there are only three very short calling-convention functions in WREC that should need any modification.

thanks & good luck,
G.
Comment 7 Gabor Loki 2009-04-03 05:38:19 PDT
It would be good for us as well if the MacroAssembler were the only module which had to be rewritten, but unfortunately that is not the case here.

We have some barriers which have to be solved to generate machine code on ARM:

1. Constant pool:
On ARM, only those constants can be fit into an instruction which can be produced from a 8 bits long number rotated twice by a 4 bit unsigned integer. For example 257 (0x101) does not fit. This implies that the constant pool is needed. There are two possible solutions. (1) Leave the constant on CallFrame, or (2) insert them into the machine code. The first solution is not satisfactory, because on ARM the space is limited (12 bit) to represent an offset from the base. If the offset does not fit, more instructions are needed to access the address. In the second solution, we have to put extra code which jumps through the constant pool. This case is still better than the first one, because it generates much less machine code. Additionally we have to generate other constants as well, like addresses of stub and CTI functions, far jumps, etc. So, those extra constants have to be stored in the constant pool too (not on CallFrame).

2. Branch optimization:
There is an option to use faster and more predictable branches. Initially, we
generate all branches with PC relative load which uses one constant address and one machine instruction. After the JIT code generation is finished, we can replace the PC relative load with a simple branch instruction if the address fits in the instruction. This replace saves space and times as well. All the above implies that the compilation should have a second phase to do the replacing.

3. Exception handling:
On ARM, we cannot follow the strategy of the exception handling used on x86. We
cannot regenerate identical instructions sequences when an exception happens, because relative calls and constants in the pool are not necessarily the same (as described above, see items 1 and 2). The return address manipulation is solved through stub functions.

4. Conditional field:
All ARM instructions are conditionally executed. So, we can avoid short jumps
(cmp and jmp combo). See a really small example at 'op_construct_verify'. On ARM, we can combine the two jumps using the conditional field.

5. Dynamically generated stub functions:
We eliminated all inline assembly codes from the source code and replaced them with dynamically generated JIT codes. We generate the following code snipets dynamically: ABI switch (trampoline in your terminology), return address manipulation for exception handling, and soft math instructions. The most interesting part is the stub functions for return address manipulation. Because of exception handling, we cannot address some CTI functions directly. For those functions, stub functions are generated where we can solve the return address manipulation.

6. Bigger patterns worsen performance:
At last, if we generate machine code through MacroAssembler we will miss more target specific optimizations. Using a general function of MacroAssembler to generate an operation can often generate useless code. This is similar issue like at super-instructions.

I would be the happiest person if we could use MacroAssembler as you said, but I think it is hard task (and it would lead to performance degradation on ARM). Additionally, if we would request an interface change necessary for ARM that would cause performance regression on x86, it would be rejected. :-) However, I am open to make our points of view converge. The truth must lie somewhere inbetween our approaches.

Feel free to comment or ask!

Regards,
Gabor
Comment 8 Zoltan Herczeg 2009-04-03 12:55:07 PDT
> Congratulations on getting this working!, great job.

Thank you for your encouraging words!

I probably mentioned some time ago, that we have already tried to put nanojit (Tamarin jit engine) under JavaScriptCore. That does exactly the same thing as your MacroAssembler: it has a LIR (low-level intermediate language), and generates machine code from that language. At first sight it looks great, only we have to generate lir, and nanojit will do the rest of the work. Unfortunately, the results were disappointing. Why? Because that common low-level language totally misses platform specific optimizations. Even x86 and x86-64 has major differences. ARM is something totally different, because the instruction size is constant: 1 word (4 bytes).

We must sacrifice either the easly maintanable code or great performance. We chose the second one, we want a fast jit. We hope we can maintain the arm port in the future, and we can implement all the neccessary changes, new byte-codes, etc. But we see your points as well. We had actually talked a lot about them.
Comment 9 Mike Hommey 2009-04-03 13:35:40 PDT
One small question: which ARM ABI does this code target ?
Comment 10 Zoltan Herczeg 2009-04-03 21:46:14 PDT
(In reply to comment #9)
> One small question: which ARM ABI does this code target ?

Which is used by gcc, its major version number is 2

r0-r3: temporary registers
  - parameter passing
  - r0-r1: return value
r4-r12: preserved or not used
r13: stack pointer
r14: link register

stack is aligned to 8 bytes

see http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042c/IHI0042C_aapcs.pdf
Comment 11 Gabor Loki 2009-04-24 03:40:14 PDT
We have added WREC support for ARM.

See at loki/arm-port branch on staikos.


The current results are the following:
Performance:
  SunSpider: 1.2x as fast
  V8: 1.6x as fast
  WindScorpion: 1.05 as fast

Binary:
  Shared 'jsc' file has grown by 14%.

Memory:
  RSS has grown by 3-4%.

The results were measured on a N810 device with shared Qt-4.5.0 library.
Comment 12 Gabor Loki 2009-04-24 07:04:48 PDT
(In reply to comment #11)
> Performance:
>   SunSpider: 1.2x as fast
>   V8: 1.6x as fast
>   WindScorpion: 1.05 as fast

We have noticed if '-falign-function=32' option is passed to the compiler, the port produces better performance results.

SunSpider: 1.4x fast
V8: 1.65x as fast
WindScorpion: 1.06x as fast

Comment 13 Csaba Osztrogonác 2009-05-07 01:35:20 PDT
There was a short discussion earlier why we do not use MacroAssembler for
generating JIT code. Although our position did not change &#8211; i.e., we still
think that JIT has to be as fast as possible &#8211; we have implemented WREC for
ARM using MacroAssembler. You can check the code at:
http://code.staikos.net/cgi-bin/gitweb.cgi?p=webkit;a=shortlog;h=ossy/arm-port-MacroAssember_WREC 

Here is the result. All values are compared to our ARM port, where WREC generates hand-optimized JIT code.

Performance:
SunSpider is almost the same (~1.01x as slow).
RexExpDNA is 1.05x as slow.

MacroAssembler-based WREC generates 20% bigger JIT code. 
'JSC' binary size is 2% bigger.
RSS is almost the same (~ 0.5% bigger).

Well, the numbers say that the MacroAssembler-based WREC is both slower and
bigger than the hand-optimized one. The results definitely come from the
pattern-driven machine code generation. Here is an example (from
'generatePatternCharacterPair'):

using MacroAssembler:
	...
	int pair = ch1 | (ch2 << 16);
	failures.append(branch32(NotEqual, character, Imm32(pair)));
	...
	mov	r0, #97 ; 0x61
	orr	r0, r0, #6750208
	cmp	r2, r0
	bne	0x426be0bc

hand-optimized:
	...
	if ((ch1 >> 8) || (ch2 >> 8)) {
		...
	} else {
		m_assembler.sub_r(tmpReg, character, m_assembler.imm(ch2 <<16));
		m_assembler.cmp_r(tmpReg, m_assembler.imm(ch1));
	}
	failures.append(m_compiler.emitBranch(&m_assembler, ARMAssembler::NE));
	...
	sub	r0, r2, #6750208
	cmp	r0, #97
	bne	0x426be0bc

This is a typical example how hard is to optimize inside a pattern. While
MacroAssembler works with an 'Imm32' data that hides an optimization
opportunity, the hand-optimized version can use the fast case (because initial 
conditions are well known at that point). Well, this optimization could be implemented in ' branch32' (splitting the input, checking the upper bytes, etc.), but this kind of 'Imm32' data occurs mostly in a RegExp. So, in other cases this would lead to JIT generation overhead.

Implementing MacroAssembler-based WREC was not a hard task because WREC uses
very simple JIT constructions. In contrast, the SquirrelFish bytecode uses more
diversified JIT constructions (like calls, loops, math operators, etc.) . 
Although in WREC there is very little space for optimizing the machine code, 
the performance gain of the hand-written JIT is about 5%. What would be then the difference between a MacroAssembler-based and the hand-optimized version of the
JavaScript JIT?

We see that the uniform JIT is pretty and well maintainable, but is this a good
reason not to go for a 1, 2, 5, or 10% performance gain?

Regards,
Gabor and Csaba
Comment 14 Gavin Barraclough 2009-05-07 04:20:27 PDT
Since you have the MacroAssembler up and running, you may find it interesting to give Yarr (our new regex engine) a try (this should largely just be a question of porting the function entry/exit routines, which are fairly similar to those in WREC).

Yarr is still work-in-progress, but last time I measured I believe it was considerably more than 2x faster then WREC at running the larger v8 regular expression benchmark.  We intend to expand the code generation to more completely cover the full regex language, which we expect to further extend the gain of using Yarr over WREC (which must fall back to PCRE for expresions that are not supported).  For your information, WREC is currently unused in tip-of-tree, and the code is likely to be removed fairly soon.  The advantage using the MacroAssembler interface if that such developments in JSC should be able to become available for ARM with a fairly low overhead to get them up and running.  (Alternatively you are of course welcome to port Yarr's code generation to hand generated assembler, though I shan't make any promises at this point as to how long it'll be until it is ripped out and replaced by yet another regular expression engine! ;-) ).

The interface to the MacroAssembler is certainly not set in stone, and where there are limitations that degrade the ability to generate code we would be happy to look at how we can better accommodate code generation on all architectures.  Looking at the example you've given, I'd suggest that considering the ranges of immediates supported on Arm, having the MacroAssembler perform some examination of Imm32 values to attempt to generate the optimal branch form (from the hand coded example) where it is possible to do so could be hugely beneficial.  Though it is clearly more work to implement such a mechanism than it is to simply code the assembler for the idiom in one specific place, there is also a greater benefit to be reaped from the invested effort.  For example, any time an integer or pointer in being embedded into the JIT code, and its bit pattern happened to have the required property it would also benefit from this optimization.

cheers,
G.
Comment 15 Gabor Loki 2009-05-22 01:48:29 PDT
We have decided to implement an ARM port based on MacroAssember as you requested. Now, we have released an initial revision of it at staikos (http://code.staikos.net/cgi-bin/gitweb.cgi?p=webkit;a=shortlog;h=loki/ma-arm-port). The patch is still incomplete - Yarr and FPU are not yet supported -, but finally all regression tests are successfully passed. The patch is based on r43818 (May 18).

We will post the performance progression later as well.
Please, take a look at the patch, and give us any feedback about it.

Regards,
Gabor, Zoltan
Comment 16 Gabor Loki 2009-06-03 03:12:05 PDT
Created attachment 30897 [details]
comparison between interpreter and MacroAssembler based ARM JIT

Here is the result of MacroAssembler based ARM JIT. Measured on N810.
Comment 17 Gabor Loki 2009-06-03 03:12:11 PDT
Created attachment 30898 [details]
comparison between MacroAssembler based and from the scratch ARM JIT

Here is the comparison between MacroAssembler based and from the scratch ARM JIT. Measured on N810.
Comment 18 Gabor Loki 2009-06-03 03:13:56 PDT
Finally, we have rebased both ARM JIT port to r43818 and measured the results on N810. See the attached results above.

Now, we are interested in your opinion about both ARM JIT. What do you think of them?
Comment 19 Gyuyoung Kim 2009-06-05 19:33:55 PDT
Can I ask a question? I would like to run the arm-port. However, I don't know how to compile only SFX on maemo scratchbox. I wonder how do you compile SFX.
Comment 20 Gabor Loki 2009-06-06 00:03:26 PDT
> I don't know how to compile only SFX on maemo scratchbox.
1.set up a scratchbox environment (I recommend scratchbox2)
2.build Qt (4.5.x) (if it is necessary) and set QTDIR
3.get ARM port from staikos
4.if you want to build only a command line 'jsc' tool, remove all SUBDIRS entries
  except the 'JavaScriptCore/jsc.pro' from WebKit.pro
5.build WebKit with the standard 'build-webkit &#8211;qt' script 
Comment 21 Csaba Osztrogonác 2009-06-10 01:43:26 PDT
Created attachment 31125 [details]
comparison between interpreter and MacroAssembler based ARM JIT+YARR

YARR support implemented in out ARM port based on MacroAssembler. You can find it on Staikos: (http://code.staikos.net/cgi-bin/gitweb.cgi?p=webkit;a=shortlog;h=loki/ma-arm-port).

SunsSpider results are:
1.34x as fast (trunk-r43818 -> ARM JIT with MacroAssembler)
1.48x as fast (trunk-r43818 -> ARM JIT with MacroAssembler + YARR)

No significant change on V8 and WindScorpion.

br,
Csaba
Comment 22 Gabor Loki 2009-06-11 02:47:29 PDT
Created attachment 31155 [details]
Updated version of generic ARM JIT port

We are eagerly await your review.
Comment 23 Oliver Hunt 2009-06-11 09:58:08 PDT
Comment on attachment 31155 [details]
Updated version of generic ARM JIT port

I'm not sure why you have made the changes to AssemblerBuffer.h -- they're not needed for the v7 assembler.  I think those changes make the patch larger than they need to be which masks the actual meaningful changes.

The changes to codeblock seems worrisome as well -- exception information is very large (in the order of megs per page) which is why we create the information lazily -- while i realise you are making unique constants in the codestream is it not possible to generate the same *sized* instructions?  Regenerating the code to create exception information merely requires the offsets be the same, not that the code be absolutely identical.  If you can make such a guarantee then you should be able to reduce memory usage, and further reduce the number of changes needed to non-arm code (in general we are trying to move to less ifdefs rather than more).

I'm not really happy with the changes to JITPropertyAccess as neither x86, x86-64, nor ARMv7 required these changes, why do we need them in <v7 ARM?

I'm not entirely sure what Tape is -- it looks like a less generic, but iterable version of SegmentedVector, is this a correct assessment?  Can you not just make SegmentedVector be iterable?

I'm going to hold off r- for now (pending your response to these questions) but i'll need gavin to review the assembler proper
Comment 24 Zoltan Herczeg 2009-06-11 11:15:06 PDT
Hi!

> (From update of attachment 31155 [details] [review])
> I'm not sure why you have made the changes to AssemblerBuffer.h -- they're not
> needed for the v7 assembler.  I think those changes make the patch larger than
> they need to be which masks the actual meaningful changes.

The answer is: Constant pool. Since ARM is a RISC CPU, immediates are limited to rotated 8 bit values. Generating 4 instructions for each load immediate values is a waste of memory space and slower than a pc relative load. I fear you will also need a constant pool later if you would like to extend the 16M limit of your ARMv7. You can read more about it here:
http://webkit.sed.hu/node/17

> The changes to codeblock seems worrisome as well -- exception information is
> very large (in the order of megs per page) which is why we create the
> information lazily -- while i realise you are making unique constants in the
> codestream is it not possible to generate the same *sized* instructions? 

We always keep the m_callReturnIndexVector after the code is generated, its size is not that big. Otherwise, we didn't modify the lazy exception information generation.

> Regenerating the code to create exception information merely requires the
> offsets be the same, not that the code be absolutely identical.  If you can
> make such a guarantee then you should be able to reduce memory usage, and
> further reduce the number of changes needed to non-arm code (in general we are
> trying to move to less ifdefs rather than more).

Load immediates (pointers) always cause trouble. Depending on where they are in the memory, the generated code looks different. And I am totally agree about ifdefs.

> I'm not really happy with the changes to JITPropertyAccess as neither x86,
> x86-64, nor ARMv7 required these changes, why do we need them in <v7 ARM?

Constant pools again. We must ensure that a constant pool will not break the instruction list. Otherwise the hard coded offsets become useless. Alternatively, we could kill those constants and put labels everywhere when an offset needs to be stored. The extra memory cost is minimal, and the whole thing looks much portable.

> I'm not entirely sure what Tape is -- it looks like a less generic, but
> iterable version of SegmentedVector, is this a correct assessment?  Can you not
> just make SegmentedVector be iterable?

Why SegmentedVector is not in the /wtf directory if it is intended for generic use? Perhaps the advantages of the two can be merged together. You could also replace several Vectors with Tape. It is faster since it doesn't need to reallocate the segments all the time.

> I'm going to hold off r- for now (pending your response to these questions) but
> i'll need gavin to review the assembler proper

I hope I could answer for your questions. This is actually the fastest version we could squeeze from macro assembler.
Comment 25 Oliver Hunt 2009-06-11 11:20:42 PDT
> Why SegmentedVector is not in the /wtf directory if it is intended for generic
> use? Perhaps the advantages of the two can be merged together. You could also
> replace several Vectors with Tape. It is faster since it doesn't need to
> reallocate the segments all the time.

I'd consider it an error that SegmentedVector is not in wtf -- and it sounds perfectly reasonable that segmentedvector and tape are merged
Comment 26 Gavin Barraclough 2009-06-12 20:59:51 PDT
We really like to try to keep the sizes of individual changes down, and this patch is really doing two separate things – enabling ARM JIT support on YARR, and enabling the JS JIT on ARM (in the case of the latter we have also landed this in stages - first landing the JIT with optimizations disabled, and then separate patches to enable the optimizations).  We would much prefer to land these changes separately, since it both makes it much easier to properly review the code, and more importantly if we ever discover a bug or performance problem has been introduced, then it is a nightmare if we end up tracking it back to a huge changeset.

So I'd really like to see a first patch that just enables YARR_JIT on ARM – we can follow up immediately after with the rest of the JS changes.  This means basically reverting everything in the JIT, interpreter, bytecode directories for now, and raising a second patch as soon as this has landed.


Looking at the changes to the AssemblerBuffer changes, I think the restructuring here isn't really working yet – which I think is somewhat reflected by the fact that you (presumably) needed to have a mix of ARMAssembler and AssemblerBufferARM methods within AssemblerBufferARM.cpp (and keeping the functions that need inspect the generated code closely tied in with the assembler does make sense to me).  We will want a constant pool available on all platforms, but we will want a single, generic implementation of the core infrastructure, shared on all architectures.  I'm not going to suggest that you need to refactor that out before you can land this patch since that may be a significant task, but we will need to do this soon since other platforms will likely need constant pools, and we don't want duplicated implementations.

I'd suggest that the best thing to do right now to prevent forking the AssemblerBuffer (which is not ultimately what we want to do) without causing you too much work, would be to simply look at temporarily moving the constant-pool related functionality back up into ARMAssembler (perhaps move the new fields added to AssemblerBufferARM into a new ARMConstantPool class for now, rather than adding the functionality into the existing class).  We can then as a separate change work out how we design a common constant pool mechanism across all platforms without having to hold this patch up.  What are your thoughts?

Also in the AssemblerBuffer, we really want to keep the cache flushing code all in one place, which is in ExecutableAllocator::cacheFlush.  We also need the repatch methods to use the ExecutableAllocator::makeWritable / ExecutableAllocator::MakeExecutable / the ExecutableAllocator::MakeWritable class for cache-flushing new code and repatching, as the other assemblers do.  This will mean that the ARM JIT will work with the ASSEMBLER_WX_EXCLUSIVE mechanism on platforms where this is required or desired for security reasons.  Also we intend to modify the way in which we repatch code so that we can coalesce protection changes and cache flushes, so it is important that the assemblers all use the same interfaces so we can keep the changes in sync.


The #defines are also going to need a little cleaning up.  We shouldn't be defining macros beginning with __ - we treat these as being reserved for compilers to define.  You could just rename __ARM_ARCH__ to ARM_ARCH_VERSION to fix this, though I'd suggest if could be nice to define a macro like HAS_ARM_ARCH(N) to perform the check too.  PLATFORM(ARM) is also defined on ARM-v7 builds, so any code specific to the v6 & earlier ARM JIT will need to check something like (PLATFORM(ARM) && ! PLATFORM(ARM_V7)) or (PLATFORM(ARM) && (ARM_ARCH_VERSION < 7)), or (PLATFORM(ARM) && !HAS_ARM_ARCH(7)) – or maybe have a new PLATFORM(ARM_PRE_v7) defined in Platform.h.  I'm open to suggestions here.


I'm a little concerned at having bkpt be a nop on pre-ARM-v5 platforms.  If the instruction doesn't exist, I think you really should remove it from the ARMAssembler interface (exactly as you have for clz_r).  I'd suggest that on pre-v5 ARMs we really want to do something to stop execution from simply ploughing on through breakpoint()s.  A simple suggestion would be to implement MacroAssemblerARM::breakpoint() to plant a store to address 0 (I don't know ARM memory maps, but most platforms will leave this page unmapped to catch errors).  Alternatively you could plant a supervisor mode instruction which should cause a trap, or just infinite loop.


You may want to think about modifying the behavior of MacroAssemblerARM::ret().  You've adhered to the same behavior as x86, but I'm afraid we haven't been as well behaved on ARMv7, and you may want to consider copying.  It is clearly not ideal to have the behavior of MacroAssemblerARM::ret() differ between platforms, but it is also clearly not ideal to require all other platforms to mimic x86 behavior, and to push the return value on the stack.  We've certainly not come up with the best and final solution here yet – the best solution may be to say that function entry and exit simply can't be described well by the MacroAssembler, and to remove ret() from the intergface & call m_assembler.ret() directly.  Another solution for some cases may be to provide a higher level interface to describe function entry and exit to the MacroAssembler.


Finally there are a bunch of little code-style issues.  We do stick to full names for variable naming, so cpool should be constantPool, CPool should be ConstantPool.  We prefer inline methods to macro where possible, so ARM_COND_FIELD should be an inline method armConditionField.  In 'emitInst' there is an extra space between the ASSERT and open paren.  In Platform.h there should not be a space between the # and the define/undef.  We don't prefix constants with a 'k' – I'm not sure we have an entirely consistent style on constants, but for values like this I'd say we tend to stick to CAPS_AND_UNDERBARS (weinig/bdash/darin, I don't know if you want to chip in here, you normally point out my style errors :-) ).

I'm going to r- for now, but don't worry - this is not as harsh as it sounds.
This is a perfectly normal stage in the inclusion of a set of changes of this size.
Comment 27 Gavin Barraclough 2009-06-14 17:07:30 PDT
In my eagerness to respond quickly to this patch I may have been a little hasty, I'm not sure whether it will be sensible to move forwards without first addressing the design issues in the constant pools.

The real problem here is that by design the Assembler interface is intended to exactly match the targeted platform's instruction set. The MacroAssembler is where the constant pool should live.  This interface is designed to provide a higher level API to programmers developing the JIT, providing a common interface to machine code generation across architectures, masking the differences in capabilities of platforms and breaking down operations into multiple machine instructions where there is not a one-to-one mapping available.

This really should probably be implemented and landed independently of (and in advance of if required by) a new port of the JIT.
Comment 28 Zoltan Herczeg 2009-06-15 03:18:21 PDT
Created attachment 31285 [details]
Move Segmented vector to wtf sub-directory

First small patch: Tape and Segmented vector are merged together and moved to wtf sub-directory. I hope everybody will like this merged code.
Comment 29 Kenneth Rohde Christiansen 2009-06-15 08:13:54 PDT
I would do the move in one commit and the "extend it with pre-increment iterator support".

Right now it is very difficult seeing what you extended.
Comment 30 Gyuyoung Kim 2009-06-15 23:38:17 PDT
(In reply to comment #20)
> > I don't know how to compile only SFX on maemo scratchbox.
> 1.set up a scratchbox environment (I recommend scratchbox2)
> 2.build Qt (4.5.x) (if it is necessary) and set QTDIR
> 3.get ARM port from staikos
> 4.if you want to build only a command line 'jsc' tool, remove all SUBDIRS
> entries
>   except the 'JavaScriptCore/jsc.pro' from WebKit.pro
> 5.build WebKit with the standard 'build-webkit &#8211;qt' script 

Thank you for your advice. However, I couldn't only build SFX using your instruction. I don't know the reason. Anyway, I was only able to build SFX after modifying GNUmakefile.am.

Can I ask one more question? When I executed SFX shell which includes your ARM port on my ARM target, the SFX came to crash with "Illegal instruction" message.(But, the SFX shell was executed on N810 well.) I got a backtrace from coredump.

(gdb) bt
#0  clear_cache (begin=0x40c41000 "", end=0x40c41014 "")
    at JavaScriptCore/jit/ARMCompiler.h:256
#1  0x0014e654 in JSC::ARMCompiler::compile (this=0xbeabf6b0, 
    executableAllocator=@0x246b58, codeBlock=0x0)
    at JavaScriptCore/jit/ARMCompiler.cpp:332
#2  0x0014570c in JSC::JIT::detectStorePCDelta ()
    at JavaScriptCore/jit/ARMJIT.cpp:326
#3  0x001458f4 in JIT (this=0xbeac0150, globalData=0x2473e8, codeBlock=0x0)
    at JavaScriptCore/jit/ARMJIT.cpp:65
#4  0x0013a278 in JSC::JIT::compileCTIMachineTrampolines (globalData=0x2473e8, 
    executablePool=0x247860, ctiArrayLengthTrampoline=0x247864, 
    ctiStringLengthTrampoline=0x247868, ctiVirtualCallPreLink=0x24786c, 
    ctiVirtualCallLink=0x247870, ctiVirtualCall=0x247874)
    at JavaScriptCore/jit/ARMJIT.h:161
#5  0x001320c8 in JITStubs (this=0x247860, globalData=0x2473e8)
    at JavaScriptCore/jit/JITStubs.cpp:83
#6  0x0001d074 in JSGlobalData (this=0x2473e8, isShared=false, 
    vptrSet=@0xbeac0c80) at JavaScriptCore/runtime/JSGlobalData.cpp:141
#7  0x0001d8d0 in JSC::JSGlobalData::create (isShared=false)
    at JavaScriptCore/runtime/JSGlobalData.cpp:191
#8  0x000137f8 in main (argc=1, argv=0xbeac0e14) at JavaScriptCore/jsc.cpp:314

My ARM target has 32-bit Marvell PXA310 624 MHz, which is ARM v5 architecture, AFAIK.
Do you know why the SFX comes to crash? If I need to create new bug for this issue, please let me know.


I hope that this arm port will be merged to webkit main trunk. :)

Thank you.
Comment 31 Zoltan Herczeg 2009-06-16 00:12:02 PDT
Created attachment 31336 [details]
Only move Segmented vector to wtf sub-directory
Comment 32 Zoltan Herczeg 2009-06-16 00:12:40 PDT
Created attachment 31337 [details]
Add an Iterator to SegmentedVector
Comment 33 Gabor Loki 2009-06-16 00:22:59 PDT
(In reply to comment #30)
> Thank you for your advice. However, I couldn't only build SFX using your
> instruction. I don't know the reason. Anyway, I was only able to build SFX
> after modifying GNUmakefile.am.

I wrote that 'guide' for Qt build not for Gtk.

> on my ARM target, the SFX came to crash with "Illegal instruction"
> message.(But, the SFX shell was executed on N810 well.)

First of all, at JavaScriptCore/jit/ARMCompiler.h:256
there is no asm instruction. What does your last 'git log'
say?

If your (cross)GCC version is greater than 3.4.6, the
compilation produces a call for '__clear_cache' built-in
function. What is your GCC version?

> I hope that this arm port will be merged to webkit main trunk. :)

You should try MacroAssembler based ARM JIT. The variant which
you use probably never be landed.
Comment 34 Gabor Loki 2009-06-16 00:44:43 PDT
Gavin, first of all, thanks for your comment. As you can see, we have started splitting the patch into separate smaller changes. We have added the first patch here in this bug report. Is it good solution for you to place every patch here in this bug report? Or should we create separate bug reports for each changes which blocks this bug entry?

The first - and I think is the most important - design issue is the constant pool. Correct me if I am wrong, but I feel that you does not catch the meaning of our constant pool design. At AssemblerBufferARM the constant pool contains addresses which could be changed during runtime. So practically the constant pool does not contain only constant values. Its name comes from GCC terminology. Its behavior looks like more an address pool rather than a constant pool such as
in RegisterFile.

The usage of constant pool is implicated by the ARM machine code itself. If a MacroAssembler asks to do something with an Address, the target architecture will have to place the address in a machine instruction or in a register in order to be able to work with them. On a generic ARM platform every 32 bits value can be produced by maximum 4 arithmetic instructions. This solution is not rewarding in every situation (see more at http://webkit.sed.hu/node/17). So, sometimes we
have to use a PC relative load in order to access a data. A group of values which are accessed by PC relative loads is a constant pool (in our view). The limitation of the constant pool is determined by the addressable distance of a PC relative load. In practice this means that we are collecting the constants during instruction appending until the distance of the first constant and its usage reach the maximum distance. After collecting the constant pool have to be flushed and separated from executable code. This kind of mechanism is not necessary for every architecture. I hope it was clear, and sorry if you have known it already.

We can abstract constant pool if you like, but there a design issue which have to discussed: where should the constant pool implementation should live.

1.- CP should live within a separate AssemberBuffer.
There will be an AssemblerBuffer without CP feature, and a other one (AssemblerBufferWithCP) with CP. This solution is very simple. Each architecture can decide which solution is suitable for them. The problem is here that the functionality of AssemblerBuffer is repeated in AssemblerBufferWithCP as well (code duplication, copy&paste code, etc.)

2.- CP should live in {ARCH}Assembler.
Each architecture can implement its own constant pool. I think this is a bad solution, because there will be many code duplication (more than in case 1.), and the assembler have to be tightly coupled with AssemblerBuffer.

3.- CP should live in MacroAssembler{ARCH}
IMHO, there is no need to know anything about CP in this abstraction level. In order to be able to decide when to flush, the MacroAssembler have to be tightly coupled with AssemblerBuffer as well. This case is inconceivable, because the MacroAssembler does not know anything about how many bytes are written to the buffer when a function of the Assembler is called.

4.- CP should live in a separate file (derived from AssemblerBuffer)
This solution is very similar to the case 1). CP lives in a separate file (AssemblerBufferWithCP) where the constant pool interface have been implemented, and this class is derived from AssemblerBuffer. In this case the pubic interface of AssemblerBufferWithCP calls AssemblerBuffer interface considering the flushing.

What do you think about them?
Comment 35 Oliver Hunt 2009-06-16 00:48:04 PDT
Comment on attachment 31337 [details]
Add an Iterator to SegmentedVector

r=me, i can't land these two patches right now because of the state of my tree -- i may do later if no one else has
Comment 36 Oliver Hunt 2009-06-16 00:50:42 PDT
(In reply to comment #34)
> Gavin, first of all, thanks for your comment. As you can see, we have started
> splitting the patch into separate smaller changes. We have added the first
> patch here in this bug report. Is it good solution for you to place every patch
> here in this bug report? Or should we create separate bug reports for each
> changes which blocks this bug entry?

In general we like things like the segmentedvector move and update to be in a separate bug as it makes it easier to do the r+->land->close pattern, but that's not a hard rule and we have far worse violations of that concept in bugs.w.o at the moment anyway :D
Comment 37 Gavin Barraclough 2009-06-17 00:28:56 PDT
Zoltan,

There are three reasons that currently envisage we may require a constant pool on other platforms:

(1) For immediate values that cannot be represented within instruction operands.  For example, floating-point and vector values on x86 (& x86-64, ARMv7).
(2) For values that can be (and currently are being) encoded in the instruction stream, but that might be more efficient if accessed via a constant pool.  E.g. on x86-64 it *might* be more efficient to place some pointer immediates in a constant pool (these are currently loaded via second move immediate instruction), and on arm-v7 32-bit immediate values are loaded with a 2 instruction sequence, again in some cases it may make sense to switch this to a data load.
(3) Moving values that are patched out of the instruction stream will remove the need to i-cache flush on modification, and by organizing constant-pool into page sized units we may be able to avoid changing protections when running with NX support.  This may be useful at some point in the future.

> 3.- CP should live in MacroAssembler{ARCH}
> IMHO, there is no need to know anything about CP in this abstraction level.

This way.  This is exactly the kind of thing that the macro assembler is intended to abstract. :-)  The idea of the assembler classes is that they very simply provide an interface to encode the machine instructions that are available on the target processor.  We do not want to intermingle the instruction formatting mechanism with an instruction selection policy – the latter should be layered upon the former.  The assembler is, of course, in many ways on its own a little unhelpful, due to the limitations of the instruction set - be it the inability to load an arbitrary 32-bit immediate value on ARM, or the inability to shift by any register on x86, etc, etc.  The MacroAssembler wraps the architecture specific assembler in a common higher level interface, providing the facilities as necessary to support its full API, and providing a richer programming interface.  It is already the case the the MacroAssembler must handle situations in which an immediate operand to an instruction may not fit within the range of values supported on a given platform, and in these situations it will load the value into a register.  Currently it's only option for doing so is to use immediate moves, but having the option of utilizing a constant pool also available here would be entirely complementary to existing mechanisms.

The parts of the implementation that can be shared across platforms (such as data-structures mapping the constant pool loads to immediate values to be added to the constant pool) may well fit nicely into the AbstractMacroAssembler layer to make them available across platforms.

> In
> order to be able to decide when to flush, the MacroAssembler have to be tightly
> coupled with AssemblerBuffer as well.

I'm not sure I would use the term tightly-coupled.  The AbstractMacroAssembler already has a size() method to check how much has currently been written to the buffer, it should just be able to use this to check how much code has been planted since a previous point.  Adding additional calls to check used space will not be a problem if necessary.

> This case is inconceivable, because the
> MacroAssembler does not know anything about how many bytes are written to the
> buffer when a function of the Assembler is called.

Sure it does!  There should be a one-to-one mapping between calls to the Assembler and instructions planted, and as such the assembler can assume that the number of bytes written by an assembler call will be whatever the maximum instruction width is for that platform (4, in the case of ARM).
Comment 38 Zoltan Herczeg 2009-06-17 02:13:36 PDT
On ARM, every immediate values are limited. Constant values are 8 bit values shifted by an even number, memory adresses must be at maximum 4095 bytes away from a base register, the relative offset of branches are limited to +-32 Mbytes and so on. We do not store addresses in the constant pool by convenience, we do this because there is no other (reasonable) way. An independent constant pool which is not embedded into the code is limited to 8 kbytes because of the addressing limitations. Even if they are independent, we still need a way to load their base offset into a register, and we need to prepare multiple constant pools. External constant pools require too much efforts, and this is not the ARM way. ARM constants pools are embedded into the code, and protected by a branch instruction from execution.

What we store in the constant pool
 - immediates, if they cannot be encoded as at most two instructions
 - branch/call targets, since we don't know the target is within +-32Mbytes in advance. Although it might be unusual to generate bigger JIT code than 32 MBytes, it would be a security hole to do it any other way.
 - addresses which will be patched later. Same reason as before: we don't know the immediate will fit into data processing instructions.

If we want to know the size of an instruction in advance, we must always generate the worst case scenario. I.e: let's assume there is a macro assembler instruction which can be encoded as 1 (90%), 2 (7%) or 3(3%) instructions. If we insist to keep the code size fixed, we have to generate 3 instructions all the time, because of that 3%. From this viewpoint, x86 CISC looks more fixed-size than ARM.

From assembler viewpoint we have 3 kinds of instruction:
  - normal instruction
  - instruction with constant (if the constant is already in the pool, we can reuse its address)
  - instruction with unique constant (branches and other values patched later)

Something has to maintain state of the constant pool, and push it when necessary.
  - It can be done on MacroAssembler level, but in that case MacroAssembler must know which type of instruction (see above) is given.
  - It can be done on formatter level, in that case macro assembler knows nothing about the constant pool.

I doubt it is possible to maintain the constant pool without information about the instruction types.

I know the current assembler model is not prepared for variable length instructions, but it is essential on RISC architectures, where the generated code size depend on both the arguments and the current environment (state of the constant pool). On ARM, even a simple "mov(reg, reg)" may take 4 from 4096+4 bytes. The only barrier I saw to support variable length instruction is those patchOffsets used by property accesses. Are they really that important? Can we have a define to enable variable length instructions and put extra labels during code generation? In this case we could keep the advantages of both approach.

Besides, Gabor wrote the post you answered :)
Comment 39 Gyuyoung Kim 2009-06-17 19:25:56 PDT
(In reply to comment #33)

Sorry, I replied you lately.

> First of all, at JavaScriptCore/jit/ARMCompiler.h:256
> there is no asm instruction. What does your last 'git log'
> say?
I had used the branch of loki/arm-port with "initial version of ARM port" log.

> If your (cross)GCC version is greater than 3.4.6, the
> compilation produces a call for '__clear_cache' built-in
> function. What is your GCC version?
I am using the maemo chinook scratchbox. GCC version of chinook is 3.4.4.
Fortunately, I succeeded to run arm-port using below source code,
  
         const int syscall = 0xf0002;
             __asm __volatile (
                    "mov     r0, %0\n"
                    "mov     r1, %1\n"
                    "mov     r7, %2\n"
                    "mov     r2, #0x0\n"
                    "swi     0x00000000\n"
                : 
                :   "r" (begin), "r" (end), "r" (syscall)
                :   "r0", "r1", "r7"
        );

> > I hope that this arm port will be merged to webkit main trunk. :)
> You should try MacroAssembler based ARM JIT. The variant which
> you use probably never be landed.

Yes, I start to use MacroAssembler based ARM JIT on the branch of loki/ma-arm-port.
Thank you.
Comment 40 Mark Rowe (bdash) 2009-06-17 19:29:58 PDT
This bug isn't a good place to discuss how to build from the related git branch.  It only obfuscates the discussion about making progress on the JIT.  If you need assistance in building the git branch, I would suggest taking that discussion to email.
Comment 41 Gabor Loki 2009-06-19 02:20:00 PDT
Created attachment 31537 [details]
Reorganize ARM architecture specific macros

This patch adds PLATFORM_ARM_ARCH macro which can be used to determine the minimum required ARM architecture version.
Comment 42 Gavin Barraclough 2009-06-19 11:30:48 PDT
Comment on attachment 31537 [details]
Reorganize ARM architecture specific macros

Looks good, armv7 is presently has a build problem in ToT, I will land this later today when I can test on that platform.
Comment 43 Gavin Barraclough 2009-06-19 15:07:43 PDT
Hi Zoltan, Gabor,

> Besides, Gabor wrote the post you answered :)

Bah, sorry guys, easy to get confused between posts on a large bug like this. :o)

Okay, so to be clear, I do completely understand what a constant pool is, why one is required on arm (and on other platforms, too) and the fact that for certain operations it is necessary to generate multiple machine instructions.  And functionally, the way you have implemented this all looks great.  The place that I think your patch is somewhat at conflict with the design of the JIT is just in the layering, and I think that if we don't bring the interfaces into line with each other then we will find it more difficult to share code across platforms, and to avoid changes we make from breaking the arm port of the JIT as we go forwards.

I've not done a good job anywhere of trying to describe the layers in the architecture, so let me do a better job now.  The abstract code generation layer (MacroAssembler interface down) is layered like a traditional compiler.  In a traditional compiler, it is common to have an assembler layer completely independent of the compiler (often a separate application).  The compiler takes a source code file, compiles it, and produces an output file of assembly code.  The interface to the assembler is largely made up of a set of assembly code mnemonics from which there is typically a one-to-one mapping to machine instructions (additionally there are assembler directives).

Let's look at an example:

arith.c: (compiled will gcc 4.0 -O3 -march=armv6)

     int sub(int x)
    {
        return x - 12345678;
    }

     int mul(int x)
    {
         return x * 12345678;
    }

arith.s:  (With some .globl & .align directives stripped out to make it more readable.)

    _sub:
        sub	r0, r0, #12320768
        sub	r0, r0, #24832
        sub	r0, r0, #78
        bx	lr

    _mul:
        ldr	r3, L5
        mul	r0, r0, r3
        bx	lr

    L5:
        .long	12345678

In the case of the subtract, since the immediate does not fit within an immediate field of a subtract instruction, the C compiler has emitted multiple subtract instructions.  In the case of the multiply, the compiler has chosen to store the constant in a constant pool (labelled L5), and has planted a load instruction to load the constant from the pool.  These are two different instruction selection strategies, and a compiler may choose (as it has here) to use a mixture of both.  The constant-pool is not a mechanism implemented within the assembler, but instead is a one of a number of techniques available to the compiler during instruction selection when generating code for which a single assembly operation is not available.  The assembler is not aware of the structure or semantics of the constant pool - it only sees a PC relative load instruction.

Layering the compiler on top of an assembler in this fashion provides a number of benefits.  For the compiler developer, layering the compiler on the assembler separates the instruction selection from the minutiae of machine instruction encoding.  For clients of the compiler providing a well defined language for machine instruction generation is useful if the compiler provides facilities to bypass the higher level language, and directly emit a specific sequence of machine instructions (commonly though use of inline 'asm' statements in C code, and in our JIT through direct use of 'm_assembler').

The assembler interface within the JIT is designed to closely mimic that of the assembler layer in a traditional compiler, providing an interface largely made up of the set of assembler mnemonics of the host architecture, with a one-to-one mapping from calls to these functions, to machine instructions being planted.  All instruction selection is performed within the MacroAssembler layer.  The MacroAssembler is a very simple compiler, mapping from one (generic) assembly like language to another (concrete, machine-specific) assembly language.  We refer to it as a MacroAssembler since the capabilities are presently very limited, simply remapping mnemonic names (e.g. addPtr -> addq_rr), and expanding single MacroAssembler calls to multiple assembler operations (facilities typically within the capabilities of traditional macro-assembler languages).  However we certainly expect the sophistication of the MacroAssembler layer to increase as necessary – and we tend to be conservative in our naming, bear in mind that we initially labelled the entire JIT as just a context-threaded-interpreter, a much simpler form of dynamic code generation engine (and we still have the acronym 'cti' scattered through the code as a reminder!).


The key difference from the ARM port as it currently stands, and the description above of the design of the layering in the JIT (and in a traditional compiler stack) is that the constant pool mechanism is below the assembler interface.  It seems that correcting this so that it falls within the MacroAssembler layer (from where we can look at how we can generalize to code to share implementation where possible and minimize unnecessary duplication when we introduce constant pools on other platforms) should only require a relatively minor restructuring of your code – it should not require you to change the code your JIT generates, and it should not significantly change the implementation of the constant pool, just shuffle a few methods between files, and change the classes upon which a few variables are declared.

Obviously how you fix this is entirely up to you, but here would be one way to do it, which is basically just a bit of copying and pasting, and a bunch of renames:

(1) Move the new methods and variables implementing the constant pool that you have added to AssemblerBuffer up into ARMAssembler, so that AssemblerBuffer is just a simple data container.
(2) Create a new class called ARMInstructionEncoder that contains an member variable that is an AssemblerBuffer, and passes all AssemblerBuffer methods through.  Replace the AssemblerBuffer member variable in ARMAssembler with a member of type ARMInstructionEncoder.
(3) Move all the methods from ARMAssembler that format up machine instructions onto ARMInstructionEncoder.  ARMInstructionEncoder should end up with one method for every mnemonic in the ARM instruction set, each of which emits one machine instruction.
(4) Rename the class ARMAssembler to MacroAssemblerARMInternals.
(5) Rename the class ARMInstructionEncoder to ARMAssembler.
(6) Move the class MacroAssemblerARMInternals into the file MacroAssemblerARM.h.

This would put the constant-pool implementation and instruction selection on the right side of the assembler interface.  Of course you may want to achieve this refactoring through a completely different route than the one I've described. The important thing is the result, this is where we really need to get the code to so that we can look at refactoring out the constant pool so that it can be shared across platforms, and so that we can bring the assembler interfaces in line with each other so that changes we make do not break the ARM JIT.

cheers,
G.
Comment 44 Gavin Barraclough 2009-06-19 18:07:37 PDT
Macro changes landed in r44886.
Comment 45 Zoltan Herczeg 2009-06-19 22:34:40 PDT
(In reply to comment #44)
> Macro changes landed in r44886.

Could you commit the other two patches as well? 
Comment 46 David Kilzer (:ddkilzer) 2009-06-20 05:32:01 PDT
Comment on attachment 31537 [details]
Reorganize ARM architecture specific macros

(In reply to comment #44)
> Macro changes landed in r44886.

Clearing review+ flag since this patch landed.

http://trac.webkit.org/changeset/44886
Comment 47 David Kilzer (:ddkilzer) 2009-06-20 06:16:40 PDT
Comment on attachment 31336 [details]
Only move Segmented vector to wtf sub-directory

Clearing review+ flag after landing:

$ git svn dcommit
Committing to http://svn.webkit.org/repository/webkit/trunk ...
        R       JavaScriptCore/bytecompiler/SegmentedVector.h => JavaScriptCore/wtf/SegmentedVector.h
        M       JavaScriptCore/ChangeLog
        M       JavaScriptCore/GNUmakefile.am
        M       JavaScriptCore/JavaScriptCore.order
        M       JavaScriptCore/JavaScriptCore.vcproj/JavaScriptCore/JavaScriptCore.vcproj
        M       JavaScriptCore/JavaScriptCore.vcproj/WTF/WTF.vcproj
        M       JavaScriptCore/JavaScriptCore.xcodeproj/project.pbxproj
        M       JavaScriptCore/bytecompiler/BytecodeGenerator.h
        M       JavaScriptCore/parser/Lexer.h
Committed r44897

http://trac.webkit.org/changeset/44897
Comment 48 David Kilzer (:ddkilzer) 2009-06-20 06:28:38 PDT
Comment on attachment 31337 [details]
Add an Iterator to SegmentedVector

Clearing review+ flag after landing:

$ git svn dcommit
Committing to http://svn.webkit.org/repository/webkit/trunk ...
        M       JavaScriptCore/ChangeLog
        M       JavaScriptCore/wtf/SegmentedVector.h
Committed r44899

http://trac.webkit.org/changeset/44899
Comment 49 Zoltan Herczeg 2009-06-21 23:44:01 PDT
Thank you for the landings.
Comment 50 Zoltan Herczeg 2009-06-22 01:26:48 PDT
Hi Gavin,

it seems I misunderstood the concept of MacroAssembler. I thougt it is a pseudo assembly language, which resembles the syntax of x86 machines, and its backends translate the instructions of this virtual assembler to machine code.

I try to summarize your new arm layout concept:

   AssemblerBuffer
        (no changes)
        |
   ARMAssembler
        (format and push all possible ARM instructions, or push a 32 bit word)
        |
   MacroAssemblerARM
        (contains "MacroAssemblerARMInternals" = "original ARMAssembler" - 
                                                 "current ARMAssembler"
             Hence, it contains utility functions)
        (MacroAssemblerARM supports constant pool handling)

Our main issue with this approach is that we need to check the state of the constant pool manually before every instruction: 
   CheckConstPool(sizeOfNextInstruction); ARMAssembler.mov_r();
   CheckConstPool(sizeOfNextInstruction); ARMAssembler.branch();
   ...
We cannot avoid this manual check, since ARMAssembler is not able to call CheckConstPool(), since it is defined in MacroAssemblerARM level. ARMAssembler cannot include MacroAssemblerARM, since MacroAssemblerARM already includes ARMAssembler, and adding the first relation would cause a circular reference. Although we can do this manual check before every instruction, it would not be a nice coding stlye (and may cause crashes if we forget to put it before every individual instruction).

We are still thinking about keeping the constant pool in somewhat lower level:

   AssemblerBuffer
        (no changes)
        |
   AssemblerBufferWithConstantPool
        (constant pool handling)
        |
   ARMAssembler
        (same as abouve)
        |
   MacroAssemblerARM
        (contains MacroAssemblerARMInternals = same as above)

AssemblerBufferWithConstantPool is a machine independent constant pool template, which can push both instructions and constants. Of coures, this template class is optional depending on the architecture. 3 template arguments are enough to specialze this class:
   - the max size of the constant pool
   - the size of the barrier jump (protect the constant pool from execution)
   - a callback(?) function which inserts the jump and does machine dependent post processing
In this way we can avoid the mentioned manual checks.

Which one you choose? We can implement either one.

> The constant-pool is not a mechanism implemented within
> the assembler, but instead is a one of a number of techniques available to the
> compiler during instruction selection when generating code for which a single
> assembly operation is not available.  The assembler is not aware of the
> structure or semantics of the constant pool - it only sees a PC relative load
> instruction.

Although the compilers tend to keep the constant pool generation for themselves, most assemblers support constant pools with help of "ldr reg, =value" pseudo instruction and ".ltorg" assembler directive.

Regards,
Zoltan
Comment 51 Zoltan Herczeg 2009-06-24 05:57:21 PDT
Any thoughts?

By the way, we have started to work on the generalized constant pool (2nd method), and it seems we can make it quite platform independent.
Comment 52 Gavin Barraclough 2009-06-25 02:44:27 PDT
Hi Zoltan, terribly sorry for the slow response, busy week.

It's really great to hear that that you can generalize your constant pool, and I can buy your argument for keeping it at a lower level.  All sounds good to me!

As a quick comment on the MacroAssembler syntax, the goal is not necessarily to stick to x86 syntax, the idea is really that it is generally often much easier to break down richer semantic operations into a sequence of instructions than it is to spot sequences of more primitive atoms from a more minimalist interface that can be combined to form complex operations (which is not to say that peephole optimizations that can coalesce instructions would not be a potentially useful thing to have, just that to do so well is hard, and that we don't want to pass up on an easy opportunity to provide richer semantic information that a MacroAssembler implementation might be able to make use of).

The syntax of the MacroAssembler is of course very heavily influenced by the fact that the JIT was first developed on x86, and the fact that the majority of browsing takes place on x86 platforms would likely seem to be a good reason for it to continue to have some bearing on the design.  Also, since the goal of the interface is to be semantically rich (but within reason – we want to stick to something that is relatively straightforward to generate on all platforms), and since x86 is of the richer instruction sets we're likely to port the JIT to, it would seem likely to me that there is a good chance it will continue to look fairly close to x86 in some aspects.

That said, we are certainly open to reasonable changes that expand the interface to allow us to take advantage of other architectures.  An obvious example on ARM would be conditional execution, and I would imagine that we should be able to find ways that we could allow the MacroAssemblerARM to take advantage of that without it being overly burdensome for other architectures.  For example, we could add method to plant a short forwards branch, which could be linked like a regular Jump/Label on architectures other than ARM.  On ARM you could omit to plant the branch and inversely predicate instructions until the branch was linked.  We could potentially make it an interface requirement that within such a construct certain operations (other branches, 'set' operations, etc) could not be used, if this were required to make the pattern useful.

At some point we will likely also want to switch the interface for arithmetic operations from two operand to three operand, to better support architectures that can make use of this.  I believe ARMv6 is mostly 2 operand like x86? - in these cases the MacroAssemblers will need be able to generate a extra move where the register to contain the result is not the same as one of the operands.  This may also open up more code generation options on ARM, for example where an immediate operand to an add must be loaded from a constant pool, and the result register is not the same as the operand register, it may be more optimal to load directly into the result register and then add onto that value.

Hope this all makes sense & helps clarify our thinking!

G.
Comment 53 Gabor Loki 2009-06-29 06:00:16 PDT
Created attachment 32007 [details]
Remove unnecessary references to AssemblerBuffer
Comment 54 Gabor Loki 2009-06-29 06:03:16 PDT
Created attachment 32008 [details]
Constant pool for AssemblerBuffer

This patch adds a platform independent constant pool handling mechanism.
The AssemblerBufferWithCP contains a brief summary of itself as well.
Comment 55 Gabor Loki 2009-06-29 06:30:20 PDT
Created attachment 32009 [details]
Add YARR support for generic ARM platforms

This patch adds YARR support for generic ARM platforms (disabled by default).

The ARMAssembler contains more functionality than minimum required in YARR. This could help the reviewer to get a deeper view of the constant pool mechanism as well.
Comment 56 Sam Weinig 2009-06-29 17:39:57 PDT
Comment on attachment 32007 [details]
Remove unnecessary references to AssemblerBuffer

Tabs in the ChangeLog, but otherwise fine. r=me.
Comment 57 David Kilzer (:ddkilzer) 2009-06-30 02:56:22 PDT
Comment on attachment 32007 [details]
Remove unnecessary references to AssemblerBuffer

Clearing the review+ flag after landing:

$ git svn dcommit
Committing to http://svn.webkit.org/repository/webkit/trunk ...
        M       JavaScriptCore/ChangeLog
        M       JavaScriptCore/interpreter/Interpreter.cpp
        M       JavaScriptCore/interpreter/Interpreter.h
Committed r45370

http://trac.webkit.org/changeset/45370
Comment 58 Gabor Loki 2009-06-30 03:22:28 PDT
Created attachment 32046 [details]
Constant pool for AssemblerBuffer (v2)

This patch adds a platform independent constant pool handling mechanism.
The AssemblerBufferWithCP contains a brief summary of itself as well.

(In this version an alignment bug have been fixed)
Comment 59 Gabor Loki 2009-06-30 03:24:04 PDT
Created attachment 32047 [details]
Add YARR support for generic ARM platforms (v2)
Comment 60 Zoltan Herczeg 2009-06-30 06:55:50 PDT
Hi Gavin,

we have successfully implemented the platform independent constant pool, only some template parameters and platform dependent callbacks are required. We can put 32 or 64 bit constants into the pool which is enough to hold any int, pointer or double constant. It makes the implementation much easier than variable constant size, and it is enough for most purposes. The pools are aligned to 64 bit. The machine instructions are not needed to be aligned. We hope this implementation is suitable for any generic uses.

We also posted the yarr patch, but we feel you may find it too big. Unfortunately, we have no idea how to break it into smaller, working parts.

We have not yet posted, but we have a jit enabled, all optimizations disabled patch as well. Furthermore, we have implemented the floating point instructions, and ENABLE_JIT_OPTIMIZE_ARITHMETIC can now be turned on (another patch). We decided to wait for your opinion about the constant pool patch before we post the other patches.

I know you are busy since you usually do the coding and reviewing in JavaScriptCore alone, so if we can help you in any way just ask us.
Comment 61 Zoltan Herczeg 2009-07-01 02:14:19 PDT
We were wondering about renaming PLATFORM_ARM_ARCH(7) to PLATFORM(THUMB2) - similar to PLATFORM(X86) and PLATFORM(X86_64). It would clarify things when the generic ARM will be landed, because others might not know the difference between the two imlementation. We can send a simple patch for you if you like this idea.

Our next step is property caching. Because of the constant pool we had issues with fixed instruction offsets. Since the pool can be flushed after any instruction, the instruction offsets are not known in advance. We had implemented a check before, which flush the constant pool if the pool might break a tightly coupled instruction sequence. Only the length of the sequence and the number of constans are required for this check. However, this check is an extra line in the source code and not looks portable at all. To make it portable, we decided to use an enum and an inline function in JIT.h.

enum {
  getByIdFastCase,
  getByIdSlowCase,
  ...
} InstructionSequenceTypes;

inline JIT::ensureSpaceFor(InstructionSequenceTypes type)
{
#if PLATFORM(ARM)
    switch (type) {
    case getByIdFastCase: m_assembler.ensureSpace(48 /* Length */, 4/* NumberOfConstants */);
    ...
    }
#else
    UNUSED_PARAM(type);
#endif
}

We can put ensureSpaceFor(getByIdFastCase) calls into the compiler.
Comment 62 Gavin Barraclough 2009-07-01 15:24:52 PDT
Comment on attachment 32046 [details]
Constant pool for AssemblerBuffer (v2)

> Constant pool for AssemblerBuffer (v2)

This basically looks fine as is, r- for a handful of small fixes, but no big issues at all.

The comment on the class states that the constant pool can store 4 or 8 bytes values, however the code currently appears to be very 32-bit specific, only handling entries of size (sizeof(uint32_t)) - so unless I'm missing something the comment is currently a little misleading.  It may be good to fix this to say that the pool can hold 32-bit values, and that we intend to extend the class to be able to hold 64-bit values in the pool too, at some point in the future.

When you align the pool, I'd suggest you may want to consider replacing the magic number '0x0badf00d' with a value that will be interpreted as a breakpoint, or illegal instruction on ARM (just as an extra guard against an erroneously linked branch jumping into the padding.  To make this effective you would also have to reverse the order in which it checks & plants the alignment (byte, then short, then int) in order that the int value is 32-bit aligned.  On x86 we use HLTs to pad (we use these since they're illegal in user space, cause a trap, and as such clearly indicate an error case), and we'd probably want to do the same in the constant pool too.  At minimum I'd suggest you should move these values into the Assembler (AssemblerType::padForAlign8, 16, 32) so that these values can be provided in a platform-specific fashion (we also try to avoid "magic numbers" inline in the code, and prefer defining constant values or declaring static const variables – making this change would fix this problem too).  I think it could be even nicer if you just called AssemblerType::align(int) to do the alignment ... except I imagine that could be problematic (since you would end up recursing, at Assembler::align tried to plant a byte, caused a recursive flush, called Assembler::align, tried to plant a byte ... etc.).  You could of course add a new specific method to Assembler to alignConstantPool(int), and then provide methods to put the pad values directly to the assembler buffer.  But then you have a danger that these methods could be misused.  I'd probably ignore this last suggestion myself.  I'd probably just add a set of constants to Assembler (AssemblerType::padForAlign8, 16, 32).

> if (10 * m_numConsts > 6 * maxPoolSize / sizeof(uint32_t))
I think this check is prone to overflow.  It is probably not a practical concern right now, since m_pool is allocated with a single malloc, so a huge value for maxPoolSize would not be sensible anyway.  However in order to support large constant pools on x86[_64], the obvious thing to do will probably be to replace these arrays with some form of vector, and when we do so this test would become potentially unsafe.  'maxPoolSize' is defined as a signed int, so if (maxPoolSize > 0x15555555) then (6 * maxPoolSize) will overflow to a negative walue, and the test will always pass.  Due to the first multiply (* 10) there are also boundary condition problems that could cause the left had side of this compare to overflow, too.  If you enforce that maxPoolSize is no larger than 0x15555555 then the limit (6 * maxPoolSize) could be 0x7FFFFFFE, however as m_numConsts rolls over from 0x0CCCCCCC -> 0x0CCCCCCD (m_numConsts * 10) goes from 0x7FFFFFF8 -> 0x80000002, which, again interpreted as signed values are both less than 0x7FFFFFFE, so the limit being exceeded will not be detected.  I'd suggest adding an "ASSERT(maxPoolSize < 0x10000000);" above this line to make sure we stay well away from any potential overflow issues here would be a good idea.

Finally, a quick find & replace code style issue – it is not WebKit coding style to use abbreviations, we use full words.  AssemblerBufferWithCP is certainly against style, placeConstPoolBarrier etc really should probably be placeConstantPoolBarrier - but I leave it at your discretion as to exactly how you change these names to bring these names in line with coding style.

But other than these little style issues, the mechanism looks great, and yes, looks like it should be fairly straightforward to take this and roll it out to the other platforms when we hit a point that we need to add constant-pool support, which is really good news.
Comment 63 Gavin Barraclough 2009-07-01 17:06:04 PDT
Comment on attachment 32047 [details]
Add YARR support for generic ARM platforms (v2)

This all looks really great, couple of small comments.

In the MacroAssembler, we normally try to only use UNUSED_PARAM if a parameter is only not use on some (ifdef'ed) code paths, so we'd normally write:

void mulDouble(Address src, FPRegisterID dest)
{
    UNUSED_PARAM(src);
    UNUSED_PARAM(dest);
    ASSERT_NOT_REACHED();
}

as:

void mulDouble(Address, FPRegisterID)
{
    ASSERT_NOT_REACHED();
}

But if you prefer to leave this unchanged, this would be okay.

In Platform.h:

 546 #elif PLATFORM(ARM)
 547     /* Under development, temporarily disabled. */
 548     #define ENABLE_JIT 0
 549     #define ENABLE_JIT_OPTIMIZE_NATIVE_CALL 0

This should really be added as a part of the JIT patch, let's revert this for now (it really doesn't make a lot of sense without the rest of the JIT changes :o) ).  Also...

 596  || (!defined(ENABLE_YARR_JIT) && PLATFORM(ARM) && 0) \

We will only want to enable YARR & the ARM JIT on operating systems it has been tested on.  What OS are you testing on?  Linux?  GTK or KDE?  Or other?  Also, if Linux, this may also be the wrong place to be enabling them.  On Linux I believe the platform specific settings are configured in configure.ac (GTK?) and JavaScriptCore/JavaScriptCore.pri (KDE?) – and I believe this is where YARR is enabled on x86-Linux.  (Personally, I find the fact that we configure these things in different places on different OSes a little odd, but given that we do, it certainly seems that it would be more correct to keep the way we do things on ARM-Linux in line with the way we work on x86-Linux - again, assuming that it is Linux you're targeting).

The rest of the patch all looks great.  r-ing while the Platform.h questions are answered, but otherwise this all looks good to land.
Comment 64 Gavin Barraclough 2009-07-01 17:50:36 PDT
Hi Zoltan, hi Gabor,

Sorry for the delay in getting these reviewed, and yes, things are as busy as ever!  You'll see I've reviewed the patches, both are r-s, but they're both 99.9% of the way there now, just a few very small issues to clear up.

> I know you are busy since you usually do the coding and reviewing in
> JavaScriptCore alone, so if we can help you in any way just ask us.

I should probably make sure that the rest of the team get the credit due to them – you may well have seen me committing a bit more to the JIT in ToT over the last few weeks, but the other guys are still very busy working on the nitro-extreme branch, and fixing bugs across JSC.

> We also posted the yarr patch, but we feel you may find it too big.
> Unfortunately, we have no idea how to break it into smaller, working parts.

No, you're absolutely right here, this wasn't too big at all.  While clearly not a small patch, I don't think you could have sensibly broken this down any further, this patch was absolutely fine as it was.

> Our next step is property caching. Because of the constant pool we had issues
> with fixed instruction offsets. Since the pool can be flushed after any
> instruction, the instruction offsets are not known in advance. We had
> implemented a check before, ...
>
> We can put ensureSpaceFor(getByIdFastCase) calls into the compiler.

Yes, totally!  I was going to suggest something like this too, to make the code a little less fragile.

I'd suggest you may want to consider having a pre & post method, say, beginUninterruptedCodeSequence(getByIdFastCaseCodeSize) , endUninterruptedCodeSequence(getByIdFastCaseCodeSize), (hmmm, not sure these are good function names - not meaning to not use your function name here, just that 'ensureSpaceFor' doesn't read right with pre- & post- prefix added).

This way, on #ifndef NDEBUG builds you could record (to a member variable on the JIT) a label() of where the block of code begins, and in the 'post' method you could ASSERT that the difference between a the label() recorded in the 'pre' method, and a label() here, is equal to the argument passed in - just to help catch cases where someone changes the code, but does not update the constant (and in non-debug builds the 'post' method would do nothing).

Just a thought, in case it is helpful.

cheers,
G.

Comment 65 Gabor Loki 2009-07-02 03:04:48 PDT
Created attachment 32175 [details]
Constant pool for AssemblerBuffer (v3)

Reformatted as suggested in comment #62.
Comment 66 Gabor Loki 2009-07-02 03:08:54 PDT
Created attachment 32176 [details]
Add YARR support for generic ARM platforms (v3)

Reformatted as suggested in comment #63.
Comment 67 Zoltan Herczeg 2009-07-02 04:08:17 PDT
Hi Gavin,

thank you for the review!

The constant pool is designed to hold both 32 and 64 bit values. However, since we don't need 64 bit values yet, we decided to omit their helper functions for now. Of course it is clear for us that supporting constants which size are longer than the instructions, require special handling. We realized a simple greedy algorithm is enough for this issue. The size of a new constant is assigned to m_lastConstDelta, and decreased by the size of the following instructions until it reaches zero. When a new constant is pushed to the pool and m_lastConstDelta is non-zero, we use its value to decrease the maximum allowed difference (also a counter, called m_maxDistance).

For aligning, we chose the padForAlign solution. The other one looks too complicated (at least for us).

Regarding the "if (10 * m_numConsts > 6 * maxPoolSize / sizeof(uint32_t))" expression, we divided to constants by 2. However, we still think these multiplies are not an issue, since we don't plan to support pools bigger than 809 Mbytes. Actually, we feel the size of the pool should not be bigger than 16 Kbytes. Since maxPoolSize is a template argument, a compiler assert to ensure this maximum allowed size would be a good idea.

We have renamed all functions with abbreviations.

It seems gcc yields several warnings without the UNUSED_PARAM directives. We hope these unused params will not be in the mainline too long, since we have already implemented them and post the patch soon into bugzilla. We feel split the code here is good for a reviewer viewpoint, but we can combine the two patches if you wish.

We have also removed all our auto-enabling code in "Platform.h". So all ports can decide whether they want to enable ARM-JIT or not.

We have considered several methods for uninterrupted code sequences. We also thought about the OpenGL way (begin / endUninterruptedCodeSequence). However, since we need an extra buffer in this case, we feel the code will become too complex (and slow). We chose the ensureSpaceFor(getByIdFastCase) method, since we already have to calculate the patch offsets, so calculating the total size is only one extra step further (adding the last instruction size to the last patch offset). And the resulting code is fast as well, since an inlined switch with a constant argument is automatically evaluated by gcc (we have tested it on a small program).

Thank you again.
Zoltan
Comment 68 Zoltan Herczeg 2009-07-02 05:28:09 PDT
Created attachment 32181 [details]
Add generic ARM-JIT support using ARMAssembler
Comment 69 Zoltan Herczeg 2009-07-02 05:51:51 PDT
Created attachment 32182 [details]
Floating point support for ARM. Looks like we double the number of patches each day :)
Comment 70 Gavin Barraclough 2009-07-17 14:19:48 PDT
Comment on attachment 32175 [details]
Constant pool for AssemblerBuffer (v3)

Landed in r46057.
Comment 71 Gavin Barraclough 2009-07-17 14:58:40 PDT
Comment on attachment 32176 [details]
Add YARR support for generic ARM platforms (v3)

Landed in r46059.

Sorry for the delay in getting back to these, will start looking at the new patches at the start of next week.

cheers,
G.
Comment 72 Zoltan Herczeg 2009-07-17 22:17:18 PDT
(In reply to comment #70)
> (From update of attachment 32175 [details])
> Landed in r46057.

Many, many, many thanks to you Gavin!
Comment 73 Gabor Loki 2009-07-20 03:18:29 PDT
Comment on attachment 32175 [details]
Constant pool for AssemblerBuffer (v3)

Clearing the review+ flag after landing (r46057).
Comment 74 Gabor Loki 2009-07-20 03:19:36 PDT
Comment on attachment 32176 [details]
Add YARR support for generic ARM platforms (v3)

Clearing the review+ flag after landing (r46059).
Comment 75 Gabor Loki 2009-07-24 00:26:40 PDT
Gavin,
could you review the last two patches?

BTW, we have two more patches - which are not posted yet - for a fully functional ARM JIT. :-)
Comment 76 Ismail Donmez 2009-07-24 00:43:02 PDT
Just wondering; will the ARM JIT work on ARMv4I types CPUs?
Comment 77 Gabor Loki 2009-07-24 00:54:12 PDT
> Just wondering; will the ARM JIT work on ARMv4I types CPUs?

The short answer is: yes.
If you like, test it on your hw and send us your feedback.

Currently the trunk supports YARR on ARM if you set ENABLE_YARR=1 ENABLE_YARR_JIT=1 for the compilation.
Comment 78 Ismail Donmez 2009-07-24 00:56:40 PDT
(In reply to comment #77)
> The short answer is: yes.
> If you like, test it on your hw and send us your feedback.
> 
> Currently the trunk supports YARR on ARM if you set ENABLE_YARR=1
> ENABLE_YARR_JIT=1 for the compilation.

Thanks for the answer. I am using it via Qt so I'll have to wait for Qt guys to pick it up.
Comment 79 Zoltan Herczeg 2009-07-28 03:22:36 PDT
Created attachment 33609 [details]
Add generic ARM-JIT support using ARMAssembler v2
Comment 80 Zoltan Herczeg 2009-07-28 03:23:25 PDT
Created attachment 33610 [details]
Floating point support for ARM v2
Comment 81 Gabor Loki 2009-07-28 03:26:07 PDT
Created attachment 33611 [details]
Add optimize call support for ARM JIT

This patch adds support for ENABLE_JIT_OPTIMIZE_CALL and ENABLE_JIT_OPTIMIZE_NATIVE_CALL optimizations.
Comment 82 Gabor Loki 2009-07-28 03:27:47 PDT
Created attachment 33612 [details]
Add property access support for ARM JIT

This patch adds support for ENABLE_JIT_OPTIMIZE_PROPERTY_ACCESS and ENABLE_JIT_OPTIMIZE_METHOD_CALLS optimizations.
Comment 83 Gavin Barraclough 2009-07-29 21:01:45 PDT
Comment on attachment 33609 [details]
Add generic ARM-JIT support using ARMAssembler v2

(1) The CodeBlock changes.

Having divergent behaviour outside of the JIT is really undesirable – while these differences look fairly harmless right now, as we changes the way the JIT works over time, they could become problematic.  This approch also seems suboptimal on ARM platforms, since ARM is typically used in embedded products with limited memory availability, and carrying around all bytecode from all functions is a large memory burden (particularly on complex sites using large JS libraries).  I'd suggest that there is a much better solution, that isolates the issue to the MacroAssembler, and will overall significantly reduce the memory footprint of the JIT on ARM.

I'm assuming that you're running into a problem that we also ran into on ARMv7, that sometimes a pointer value will be representable as an instruction immediate?  On ARMv7, when a ImmPtr is converted to an Imm32 we set a flag in the Imm32 indicating that it is wrapping a pointer value - Imm32::m_isPointer.  You could just change the #if wrapping this to PLATFORM(ARM).  In fact, having one extra field passed in Imm32 is highly unlikely to seriously impact JIT performance (most methods using this are inline to the extra value should just fold away in the compiler), so with a quick performance check you could probably just remove the #if altogether.

In MacroAssemblerARMv7, when emitting an instruction with an Imm32 operand, we force all immediate operands for which this flag is set to be generated in a manner that could represent a full 32-bit value (i.e., in your case, that would mean generating the value into the constant pool).  Whilst this clearly this means occasionally adopting a slightly less optimal representation for a few pointer values scattered in the code, when balanced against the memory saving of not retaining the bytecode it seems to be a clear huge win.


(2) JIT generating the trampolines.

Again, I'm really not keen on this, specifically because it differs from the way other platforms work, and having randomly differing implementations makes the JIT unnecessarily complex for a new developer to start working on (and JITs tend not to be the simplest pieces of code under the best of circumstances).  I actually much prefer your way of doing things here - I'd much prefer if we could JIT generate these trampolines on all platforms.  However we tested this on x86, and when we did we measured a repeatable performance regression (I'm presuming this is because it changes a well predicted relative call instruction to a memory-indirected poorly predicted call).  So my question would be, is there any significant benefit to ARM from this?  If there is not a clear benefit from doing so, then introducing unnecessary inconsistencies would be highly undesirable.


> +#if PLATFORM(ARM) && !PLATFORM_ARM_ARCH(7)
> +    Jump jumpToCtiVMThrowTrampoline = jump();
> +#else
> +    move(ImmPtr(reinterpret_cast<void*>(ctiVMThrowTrampoline)), regT2);
>      restoreReturnAddressBeforeReturn(regT2);
>      ret();
> -    
> +#endif

The JIT is currently making a jump to the ctiVMThrowTrampoline in a really weird way here, but it doesn't seem like a great idea to only fix this on ARM – and it having this code unnecessarily platform specific would be undesirable.  We really should just switch this to a jump on all platforms. 


>              return JSValue::decode(ctiTrampoline(
>  #if PLATFORM(X86_64)
>                  0, 0, 0, 0, 0, 0,
> +#elif PLATFORM(ARM)
> +                // For both ARM and ARMv7
> +                0, 0, 0, 0,
>  #endif
>                  m_ref.m_code.executableAddress(), registerFile, callFrame, exception, Profiler::enabledProfilerReference(), globalData));

It looks from the comment like this was a deliberate change to ARMv7 – if so (and I apologize if I'm misinterpreting here), please don't make speculative changes to platforms you aren't testing (a bug report would of course be hugely appreciated if you think you spot a bug), this 'fix' will actually break ARMv7.  On ARMv7 the trampoline accepts arguments in registers, spilling those that need be preserved to the stack.  This both seems somewhat cleaner (rather than relying on the 'magic' behavior of passing extra arguments), and also allows us to chose not to spill those arguments that we do not need to preserve (e.g. the entry code pointer).  We have a long standing bug open to change x64-64 to work this way too, and I'd suggest you may want to consider the same for ARM - though that is purely optional, just switching to 'PLATFORM(ARM) && !PLATFORM_ARM_ARCH(7)' would be a valid fix here.


> +#if PLATFORM(ARM)
> +    typedef EncodedJSValue (*ctiTrampolineFunctionPtr)(void*, void*, void*, void*,
> +            void* code, RegisterFile*, CallFrame*, JSValue* exception, Profiler**, JSGlobalData*);
> +    extern ctiTrampolineFunctionPtr ctiTrampoline;
> +    extern void* ctiVMThrowTrampoline;
> +    extern void* ctiOpThrowNotCaught;
> +#else

This should be 'PLATFORM(ARM) && !PLATFORM_ARM_ARCH(7)'


> -#if PLATFORM(ARM)
> +#if PLATFORM_ARM_ARCH(7)
>  #define ENABLE_ASSEMBLER_WX_EXCLUSIVE 1
>  #else
>  #define ENABLE_ASSEMBLER_WX_EXCLUSIVE 0

This was probably really on the wrong #ifdef already, so if we're going to change it, then it should probably really go to #if PLATFORM(IPHONE).
Comment 84 Gavin Barraclough 2009-07-30 22:40:18 PDT
FYI the bug I mentioned to change the calling convention ro the x86-64 trampoline was https://bugs.webkit.org/show_bug.cgi?id=22910 , it turned out I had a patch sat in my inbox from another engineer that fixed it, so I've landed this in r 46621.

G.
Comment 85 Zoltan Herczeg 2009-08-05 04:42:34 PDT
Created attachment 34127 [details]
Add generic ARM-JIT support using ARMAssembler v3
Comment 86 Zoltan Herczeg 2009-08-05 04:43:15 PDT
Created attachment 34128 [details]
Floating point support for ARM v3
Comment 87 Gabor Loki 2009-08-05 04:44:39 PDT
Created attachment 34129 [details]
Add optimize call support for ARM JIT v3
Comment 88 Gabor Loki 2009-08-05 04:45:35 PDT
Created attachment 34130 [details]
Add property access support for ARM JIT v3
Comment 89 Gabor Loki 2009-08-05 04:56:24 PDT
Gavin,
we did your suggestions about trampolines and calling convention. The changes do not introduce any significant performance progression or regression either. I hope it is fine for you as well.

BTW, the current code works with JSVALUE32 (as ARMv7).
Comment 90 Gavin Barraclough 2009-08-05 12:56:52 PDT
Comment on attachment 34127 [details]
Add generic ARM-JIT support using ARMAssembler v3

Zoltan, did I see you get commit rights to svn?
Leaving this for you to land, if I'm wrong & you need this landing for you then ping me & will be happy to do so.

G.
Comment 91 Gabor Loki 2009-08-05 13:21:41 PDT
(In reply to comment #90)
> Zoltan, did I see you get commit rights to svn?

Not yet. You mistake him for Zoltan Horvath.
Comment 92 Gavin Barraclough 2009-08-05 13:25:07 PDT
cheers Gabor, Akos also pointer out my mistake, I'll get this landed for you guys later today.
Comment 93 Gavin Barraclough 2009-08-05 16:18:18 PDT
Comment on attachment 34129 [details]
Add optimize call support for ARM JIT v3

(1) ensureSpaceFor

The basic design of this I'm happy with, the idea of enumerating the sequences seems nice and clean, but I do have two small concerns.  First is the method name - I think it could be clearer, and more explicit.  The method name makes sense in the context of ARM, where you have a constant pool, but for developers on other platforms it is unclear as to the purpose of the call.I'd suggest a better name would include something like UninterruptedInstructionSequence, which would be more descriptive of *why* this is being called.   Also, I'm uncomfortable with the use of magic constants that are unchecked (note that all the 'patchOffset' constants in the JIT are guarded by ASSERTs).  If someone changes the code generation in a way that could break on ARM, why should have an automatic way of detecting this.  I'd suggest that you can fix this by having 'begin' and 'end' methods in the MacroAssemberARM, and #ifndef NDEBUG check that the values passed are correct.

I'd suggest something like this (with MacroAssemblerARM::beginUninterruptedInstructionSequence replacing MacroAssemblerARM::ensureSpace, and MacroAssemblerARM::endUninterruptedInstructionSequence called at the end of the sequence as appropriate - both method names just suggestions, I'm open to improvements), would both be more explicitly named, and would provide checking of at least one of the magic constants:

    void beginUninterruptedInstructionSequence(int insnSpace, int constSpace)
    {
#ifndef NDEBUG
        m_uninterruptedInstructionsBegin = label();
        m_uninterruptedInstructionSpace = insnSpace;
#endif
        m_assembler.ensureSpace(insnSpace, constSpace);
    }

    void endUninterruptedInstructionSequence()
    {
        ASSERT(differenceBetween(m_uninterruptedInstructionsBegin, label()) == m_uninterruptedInstructionSpace);
    }

#ifndef NDEBUG
    Label m_uninterruptedInstructionsBegin;
    int m_uninterruptedInstructionSpace;
#endif

If you could also preserve and test 'constSpace' in a similar fashion (could you record the number of entries in the pool in 'begin', and again check the difference in 'end'?), then that would be even better.

(2) enableLatePatch

This seems to be tied to an assembler specific internal optimization, I don't think that explicitly exposing this in common code in this fashion is the right API here.  The purpose of this flag appears to be to force calls to be generated using a constant pool entry, rather than as a relative branch.  A simple short-term solution would seem to replace the 'm_latePatch' bit with a flag indicating whether the JmpSrc is a jump or a call, and to only plant relative branches for jumps.  With this change the calls to enableLatePatch() should be unnecessary - all calls can be repatched.

(As a note on a longer term direction, we expect at some point to be able to move the Label, Jump, Call, etc classes out of AbstractMacroAssembler and into their own header, then push these types down into the assemblers so that each assembler does not need define its own JmpSrc & JmpDst types - so in the long run I think we'd expect to be able to distinguish between jumps & calls anyway).

This suggested change is based on the assumption that the ability to link branches without a constant pool entry is an optimization that only really benefits jumps, not calls.  I'd be resistant to introduce any additional API unless there is a demonstrable benefit here.  If there is a measurable performance benefit from doing so, then I'd suggest the right place to add API would be in LinkBuffer::link(Call, FunctionPtr).  Ultimately this is not about the call instruction itself being different, but the way that the link will be performed being different.  I'd suggest adding a third argument to this method - ", bool isFinal = false" (I've inverted the logic from 'enable' here - I'd suggest the default behaviour should be to allow the call to be relinked, and if we do need to do so we should explicitly specify cases where optimization can take place).  For calls that are *not* currently being marked as 'enableLatePatch', you could instead pass isFinal as true when linking, which should result in exactly the same code generation from the ARM JIT as you are presently producing.  But I'd like to see performance numbers to back up any API change, since I do expect that the benefit is really coming from planting relative branches for jumps rather than calls.
Comment 94 Gavin Barraclough 2009-08-05 16:23:39 PDT
Comment on attachment 34130 [details]
Add property access support for ARM JIT v3

This patch is dependent on the call patch that I've just reviewed & asked for changes.  I have no objections to this patch and expect this patch to be a simple r+ as soon as the call patch gets okayed.

But since this is blocked on the other patch, and since I think you're going to need a new version of this to update some behaviour here (re the comments on the call patch), I'm going to r- for now to take this out of the review queue.

G.
Comment 95 Gavin Barraclough 2009-08-05 22:23:19 PDT
Sending        JavaScriptCore/ChangeLog
Sending        JavaScriptCore/assembler/ARMAssembler.cpp
Sending        JavaScriptCore/assembler/AbstractMacroAssembler.h
Sending        JavaScriptCore/assembler/MacroAssemblerARM.h
Sending        JavaScriptCore/bytecode/CodeBlock.h
Sending        JavaScriptCore/jit/JIT.h
Sending        JavaScriptCore/jit/JITInlineMethods.h
Sending        JavaScriptCore/jit/JITOpcodes.cpp
Sending        JavaScriptCore/jit/JITStubs.cpp
Sending        JavaScriptCore/jit/JITStubs.h
Sending        JavaScriptCore/wtf/Platform.h
Transmitting file data ...........
Committed revision 46831.
Comment 96 Gavin Barraclough 2009-08-05 22:33:40 PDT
Sending        JavaScriptCore/ChangeLog
Sending        JavaScriptCore/assembler/ARMAssembler.cpp
Sending        JavaScriptCore/assembler/ARMAssembler.h
Sending        JavaScriptCore/assembler/MacroAssemblerARM.h
Sending        JavaScriptCore/jit/JIT.h
Transmitting file data .....
Committed revision 46832.
Comment 97 Gabor Loki 2009-08-11 07:59:28 PDT
Created attachment 34558 [details]
a
Comment 98 Gabor Loki 2009-08-11 08:22:20 PDT
Created attachment 34560 [details]
Add optimize call and property access support for ARM JIT

I have done the requested changes from comment #93.

- The 'latePatch' method is removed.
    It looks like that forcing the address of calls on constant pool does not hurt much. Wash on SunSpider (slightly slower but not statistically significant).

- The {BEGIN,END}_UNINTERRUPTED_SEQUENCE macros have been introduced for tightly coupled sequences.
    I try to keep the code as platform independent as possible.

- The optimize call and property access patches are combined.
Comment 99 Kenneth Rohde Christiansen 2009-08-11 09:52:12 PDT
Aint the two r+ patches already committed? If so, please clear the flag.
Comment 100 Gavin Barraclough 2009-08-11 23:30:58 PDT
Comment on attachment 34560 [details]
Add optimize call and property access support for ARM JIT

> #define WTF_USE_CONSTANT_POOL 1

Gah, I'm really sorry, but I'm going to have to pick you up on an annoying technicality.

We should not be defining WTF_USE_ macros outside of Platform.h (there is another example of this being done, but it's a bug.  you'll see that ENABLE_, WTF_PLATFORM_, etc are only defined in Platform.h).  The problem generally with defining these values in other headers is that the macros allow the value no to be defined, which could lead to unexpected behaviour.  Consider, for example, if '#define ENABLE_JIT 1' was defined in JIT.h, rather than Platform.h.  For any .cpp files that include JIT.h the expression 'ENABLE(JIT)' will be true, but for any files that happen not to include JIT.h this will evaluate to false.

Whilst there is no bug in the way you've used this, or any clear way that it could lead to a bug in this case, I think we want to keep the design rule that the USE() etc marco are only used with values defined in Platform.h.

The way that you've structured the macros (defining the value in AssemblerBufferWithConstantPool.h) seems perfectly sensible, so rather than trying to move this up into Platform.h I'd suggest you may just want to consider a quick rename & stop using the USE macro in this case - just switch '#define WTF_USE_CONSTANT_POOL 1' to something like '#define ASSEMBLER_HAS_CONSTANT_POOL 1' and replace 'USE(CONSTANT_POOL)' with 'defined(ASSEMBLER_HAS_CONSTANT_POOL)'.

I'm going to call this an r+-with-changes, since everything else looks good.  If you update a new patch with the macro name fixed one way or another I'll get it landed for you.

BTW, I'm guessing this may be your intention already, but let's make this the last change on this über-bug - from now on I think all future patch should get a bug of their own, as per normal process.

cheers,
G.
Comment 101 Gabor Loki 2009-08-12 00:25:49 PDT
Created attachment 34641 [details]
Add optimize call and property access support for ARM JIT v2

I have did the requested changes from comment #100 :-), and marked the attachment 34560 [details] with obsolete flag which was r+-with-changes.
Comment 102 Eric Seidel (no email) 2009-08-12 15:15:44 PDT
Comment on attachment 34560 [details]
Add optimize call and property access support for ARM JIT

Clearing review flag since a new patch has been posted.
Comment 103 Gavin Barraclough 2009-08-12 22:58:46 PDT
Sending        JavaScriptCore/ChangeLog
Sending        JavaScriptCore/assembler/ARMAssembler.cpp
Sending        JavaScriptCore/assembler/ARMAssembler.h
Sending        JavaScriptCore/assembler/ARMv7Assembler.h
Sending        JavaScriptCore/assembler/AbstractMacroAssembler.h
Sending        JavaScriptCore/assembler/AssemblerBufferWithConstantPool.h
Sending        JavaScriptCore/assembler/MacroAssemblerARM.h
Sending        JavaScriptCore/assembler/X86Assembler.h
Sending        JavaScriptCore/jit/JIT.h
Sending        JavaScriptCore/jit/JITCall.cpp
Sending        JavaScriptCore/jit/JITInlineMethods.h
Sending        JavaScriptCore/jit/JITPropertyAccess.cpp
Transmitting file data ............
Committed revision 47186.