172765 – [meta][JSC][ARMv6] Support ARMv6 architecture in LLInt and JIT layers

Caio Lima

Reported 2017-05-31 12:03:55 PDT

Currently the ARMv6 architecture isn’t properly supported in JSC, even with support in some places. However there are devices that could be important to be supported that are ARMv6 based such as Raspberry Pi (first gen) and Raspberry Pi Zero. There is support for ARMv6 in Igalia's WPE downstream branch downstream branch[1] to some extent and the work to make it at least compilable isn't huge. I'm going to use this bug as meta-bug to keep track of work to be done into ARMv6 support. The plan is firstly submit Patches to enable JSC compilation to the architecture and after send fixes to test failures and other architecture issues/improvements. As major goal, we want to make ARMv6 green in all tests.

Alex Christensen

Comment 1 2017-06-14 10:36:08 PDT

Note: Even though Raspberry Pi 3 has an arm64 CPU, Raspbian is distributed as an armv6 binary to support the lower-end Raspberry Pi models with the same image.

Csaba Osztrogonác

Comment 2 2017-06-20 02:15:20 PDT

If you are going to fix all ARMv6 (and it means ARMv7 with ARM instruction set too), you should check bug159408 and its dependencies too. The ARM (traditional) backend is really slow and buggy ~ a year ago (since r202214) because of this bug we had to disable generating IC for math functions. Maybe the best fix would be to remove constant pool completely from the traditional ARM assembler to be able to make ICs work again.

Caio Lima

Comment 3 2017-07-13 07:44:05 PDT

Created attachment 315345 [details] CLoop vs JIT enabled benchmarks These results are comparing CLoop build (baseline) with JIT enabled build (changes) in JSC benchmarks.

Carlos Alberto Lopez Perez

Comment 4 2017-07-13 07:54:11 PDT

(In reply to Alex Christensen from comment #1) > Note: Even though Raspberry Pi 3 has an arm64 CPU, Raspbian is distributed > as an armv6 binary to support the lower-end Raspberry Pi models with the > same image. Right. So this work will make it possible to enable the JSC JIT on modern WebKit2GTK+ on Raspbian.

Filip Pizlo

Comment 5 2017-07-13 08:59:17 PDT

This perf data says that you’re probably just as well off running the cloop on this platform. V8Spider is a super important test and a regression there is not OK. There are a bunch of CPU platforms that would love to have WebKit. Only three CPU platforms currently get serious perf love (armv7, arm64, and x86-64), and of those the 64-but ones get particular attention. The best way to serve both the most-invested-in platforms and the minor ones is to have a really great cloop. You could optimize cloop to beat the JIT on armv6 in less time than it would take you to fix and optimize the JIT. If you do it right, you’ll beat the JIT on all minor platforms at once (mips, etc). Also, remember that we actively optimize the JIT based on data we gather on big powerful machines (compared to a armv6). So our tuning will in many cases be good for us and bad for you. Our JIT is not meant for small embedded devices. It’s meant for a 64-bit box with lots of ram and multiple cores. So, if you want peak perf on armv6, your best bet is to optimize the cloop.

Carlos Alberto Lopez Perez

Comment 6 2017-07-13 09:44:33 PDT

(In reply to Filip Pizlo from comment #5) > This perf data says that you’re probably just as well off running the cloop > on this platform. V8Spider is a super important test and a regression there > is not OK. > > There are a bunch of CPU platforms that would love to have WebKit. Only > three CPU platforms currently get serious perf love (armv7, arm64, and > x86-64), and of those the 64-but ones get particular attention. The best way > to serve both the most-invested-in platforms and the minor ones is to have a > really great cloop. You could optimize cloop to beat the JIT on armv6 in > less time than it would take you to fix and optimize the JIT. If you do it > right, you’ll beat the JIT on all minor platforms at once (mips, etc). > > Also, remember that we actively optimize the JIT based on data we gather on > big powerful machines (compared to a armv6). So our tuning will in many > cases be good for us and bad for you. Our JIT is not meant for small > embedded devices. It’s meant for a 64-bit box with lots of ram and multiple > cores. So, if you want peak perf on armv6, your best bet is to optimize the > cloop. I have a very different reading of the data that has been uploaded. * Number of tests that regressed with the JIT on ARMv6: $ curl -sL 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345'|grep -P "definitely ([0-9])+\.[0-9]+x slower"|wc -l 82 * Number of tests that improved with the JIT on ARMv6: $ curl -sL 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345'|grep -P "definitely ([0-9])+\.[0-9]+x faster"|wc -l 542 So the JIT on ARMv6 improves on 542 tests and regresses on 82. That is: it improves the performance on a 87% of the tests. * How much was the regression for the tests that regressed? $ curl -sL 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345' | grep -Po "definitely ([0-9])+\.[0-9]+x slower"|awk -F'definitely ' '{print $2}'|cut -d\. -f1|sort | uniq -c | sort -n -r 43 1 15 2 10 3 3 6 3 4 2 12 1 9 1 8 1 7 1 5 1 19 1 13 On the tests that regressed, 43 was below 2x, 15 below 3x, and 10 below 4x. * How much was the improvement for the tests that improved? $ curl -sL 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345' | grep -Po "definitely ([0-9])+\.[0-9]+x faster"|awk -F'definitely ' '{print $2}'|cut -d\. -f1|sort | uniq -c | sort -n -r 115 1 89 2 51 3 34 4 21 7 20 6 19 8 19 5 19 13 13 10 10 20 9 15 9 11 7 14 6 9 5 21 5 19 5 17 5 16 5 12 4 24 4 18 3 77 3 70 3 44 3 42 3 38 3 34 3 33 3 29 3 26 2 94 2 83 2 69 2 49 2 45 2 41 2 35 2 32 2 31 2 25 2 22 2 141 2 103 1 90 1 75 1 72 1 71 1 68 1 63 1 59 1 47 1 37 1 23 1 160 1 155 1 105 1 101 1 100 The first column is the number of tests and the second the improvement (rounded to zero). Lots of tests with improvements that go beyond 20x, even as far as a 160x performance improvement. So, honestly, if this don't looks like a good enough performance improvement for you I don't know what it is :\

Michael Catanzaro

Comment 7 2017-07-13 10:40:18 PDT

Well I don't understand which tests most accurately reflect real-world websites, but I bet Filip does. If he thinks V8Spider is most important, we should probably listen. Occasionally, we discover a long way into a project that things have been going in the wrong direction. We need to make sure we are making the right choice as to when it's time to keep trying and when it's time to accept things did not work out and start over. I'm very surprised that Filip thinks we can get better performance out of cloop, but if so, it would be inadvisable to look the other way.

Michael Catanzaro

Comment 8 2017-07-13 10:42:13 PDT

(In reply to Filip Pizlo from comment #5) > Also, remember that we actively optimize the JIT based on data we gather on > big powerful machines (compared to a armv6). So our tuning will in many > cases be good for us and bad for you. Our JIT is not meant for small > embedded devices. I suppose iOS devices are far more powerful than typical armv6 devices?

Filip Pizlo

Comment 9 2017-07-13 10:48:15 PDT

(In reply to Carlos Alberto Lopez Perez from comment #6) > (In reply to Filip Pizlo from comment #5) > > This perf data says that you’re probably just as well off running the cloop > > on this platform. V8Spider is a super important test and a regression there > > is not OK. > > > > There are a bunch of CPU platforms that would love to have WebKit. Only > > three CPU platforms currently get serious perf love (armv7, arm64, and > > x86-64), and of those the 64-but ones get particular attention. The best way > > to serve both the most-invested-in platforms and the minor ones is to have a > > really great cloop. You could optimize cloop to beat the JIT on armv6 in > > less time than it would take you to fix and optimize the JIT. If you do it > > right, you’ll beat the JIT on all minor platforms at once (mips, etc). > > > > Also, remember that we actively optimize the JIT based on data we gather on > > big powerful machines (compared to a armv6). So our tuning will in many > > cases be good for us and bad for you. Our JIT is not meant for small > > embedded devices. It’s meant for a 64-bit box with lots of ram and multiple > > cores. So, if you want peak perf on armv6, your best bet is to optimize the > > cloop. > > I have a very different reading of the data that has been uploaded. > > * Number of tests that regressed with the JIT on ARMv6: > $ curl -sL > 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345'|grep -P > "definitely ([0-9])+\.[0-9]+x slower"|wc -l > 82 > > * Number of tests that improved with the JIT on ARMv6: > $ curl -sL > 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345'|grep -P > "definitely ([0-9])+\.[0-9]+x faster"|wc -l > 542 > > So the JIT on ARMv6 improves on 542 tests and regresses on 82. That is: it > improves the performance on a 87% of the tests. > > > * How much was the regression for the tests that regressed? > $ curl -sL > 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345' | grep > -Po "definitely ([0-9])+\.[0-9]+x slower"|awk -F'definitely ' '{print > $2}'|cut -d\. -f1|sort | uniq -c | sort -n -r > 43 1 > 15 2 > 10 3 > 3 6 > 3 4 > 2 12 > 1 9 > 1 8 > 1 7 > 1 5 > 1 19 > 1 13 > > On the tests that regressed, 43 was below 2x, 15 below 3x, and 10 below 4x. > > > * How much was the improvement for the tests that improved? > $ curl -sL > 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345' | grep > -Po "definitely ([0-9])+\.[0-9]+x faster"|awk -F'definitely ' '{print > $2}'|cut -d\. -f1|sort | uniq -c | sort -n -r > 115 1 > 89 2 > 51 3 > 34 4 > 21 7 > 20 6 > 19 8 > 19 5 > 19 13 > 13 10 > 10 20 > 9 15 > 9 11 > 7 14 > 6 9 > 5 21 > 5 19 > 5 17 > 5 16 > 5 12 > 4 24 > 4 18 > 3 77 > 3 70 > 3 44 > 3 42 > 3 38 > 3 34 > 3 33 > 3 29 > 3 26 > 2 94 > 2 83 > 2 69 > 2 49 > 2 45 > 2 41 > 2 35 > 2 32 > 2 31 > 2 25 > 2 22 > 2 141 > 2 103 > 1 90 > 1 75 > 1 72 > 1 71 > 1 68 > 1 63 > 1 59 > 1 47 > 1 37 > 1 23 > 1 160 > 1 155 > 1 105 > 1 101 > 1 100 > > The first column is the number of tests and the second the improvement > (rounded to zero). Lots of tests with improvements that go beyond 20x, even > as far as a 160x performance improvement. > > > So, honestly, if this don't looks like a good enough performance improvement > for you I don't know what it is :\ I'm happy to answer your question: [pizlo@wrathchild OpenSource] Tools/Scripts/run-jsc-benchmarks NoJIT:JSC_useJIT=false:/System/Library/Frameworks/JavaScriptCore.framework/Resources/jsc JIT:JSC_useJIT=true:/System/Library/Frameworks/JavaScriptCore.framework/Resources/jsc --sunspider --v8-spider --outer 6 Warning: could not identify checkout location for NoJIT Warning: could not identify checkout location for JIT 448/448 -e Generating benchmark report at /Volumes/Data/primary/OpenSource/NoJIT_JIT_SunSpiderV8Spider_wrathchild_20170713_1043_report.txt And raw data at /Volumes/Data/primary/OpenSource/NoJIT_JIT_SunSpiderV8Spider_wrathchild_20170713_1043.json Benchmark report for SunSpider and V8Spider on wrathchild (MacBookPro13,2). VMs tested: "NoJIT" at /System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc export JSC_useJIT=false "JIT" at /System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc export JSC_useJIT=true Collected 6 samples per benchmark/VM, with 6 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. NoJIT JIT SunSpider: 3d-cube 11.5154+-0.2823 ^ 5.6214+-0.4731 ^ definitely 2.0485x faster 3d-morph 16.4709+-0.4416 ^ 5.6433+-0.1673 ^ definitely 2.9187x faster 3d-raytrace 17.0282+-0.5761 ^ 5.5649+-0.9222 ^ definitely 3.0599x faster access-binary-trees 8.0210+-0.1176 ^ 2.9373+-0.5648 ^ definitely 2.7308x faster access-fannkuch 25.9462+-0.8402 ^ 6.2921+-0.7729 ^ definitely 4.1236x faster access-nbody 13.3165+-0.1877 ^ 2.9873+-0.1235 ^ definitely 4.4577x faster access-nsieve 7.0340+-0.1611 ^ 2.7871+-0.4898 ^ definitely 2.5238x faster bitops-3bit-bits-in-byte 9.9086+-0.4578 ^ 1.5604+-0.0842 ^ definitely 6.3499x faster bitops-bits-in-byte 19.7679+-0.5447 ^ 3.2862+-0.1380 ^ definitely 6.0155x faster bitops-bitwise-and 14.0890+-0.1835 ^ 2.6123+-0.1488 ^ definitely 5.3934x faster bitops-nsieve-bits 20.9683+-0.3707 ^ 5.6011+-3.0928 ^ definitely 3.7436x faster controlflow-recursive 10.1180+-1.0903 ^ 3.4333+-0.4442 ^ definitely 2.9470x faster crypto-aes 11.6860+-0.3691 ^ 4.8807+-0.2735 ^ definitely 2.3943x faster crypto-md5 9.6230+-0.3215 ^ 3.1081+-0.2014 ^ definitely 3.0961x faster crypto-sha1 8.9427+-0.1939 ^ 3.4079+-0.1464 ^ definitely 2.6241x faster date-format-tofte 14.6384+-0.1859 ^ 8.2467+-1.1542 ^ definitely 1.7751x faster date-format-xparb 13.9489+-0.1422 ^ 5.7903+-0.0897 ^ definitely 2.4090x faster math-cordic 18.4724+-0.3624 ^ 3.6552+-0.4528 ^ definitely 5.0538x faster math-partial-sums 12.5854+-1.0010 ^ 4.4906+-0.1942 ^ definitely 2.8026x faster math-spectral-norm 13.9168+-0.4676 ^ 2.3823+-0.1053 ^ definitely 5.8417x faster regexp-dna 7.5146+-0.0543 7.5068+-0.0960 string-base64 11.7726+-0.2485 ^ 4.7688+-0.2088 ^ definitely 2.4687x faster string-fasta 25.2803+-1.5132 ^ 6.3643+-0.3358 ^ definitely 3.9722x faster string-tagcloud 14.3337+-0.0660 ^ 9.2762+-0.1280 ^ definitely 1.5452x faster string-unpack-code 26.0402+-2.8718 24.2419+-3.8663 might be 1.0742x faster string-validate-input 11.0649+-0.2889 ^ 4.8250+-0.2475 ^ definitely 2.2932x faster <arithmetic> 14.3848+-0.1585 ^ 5.4335+-0.2673 ^ definitely 2.6474x faster NoJIT JIT V8Spider: crypto 462.8188+-5.3305 ^ 44.0877+-1.3795 ^ definitely 10.4977x faster deltablue 933.2314+-6.7242 ^ 75.2298+-7.2097 ^ definitely 12.4051x faster earley-boyer 346.5754+-13.8079 ^ 41.8177+-0.8726 ^ definitely 8.2878x faster raytrace 185.5504+-0.8525 ^ 34.8239+-8.6922 ^ definitely 5.3283x faster regexp 103.3735+-2.0679 ^ 57.9933+-0.6233 ^ definitely 1.7825x faster richards 1092.5397+-33.5531 ^ 50.9009+-1.7511 ^ definitely 21.4641x faster splay 154.8283+-2.7336 ^ 37.1096+-1.1089 ^ definitely 4.1722x faster <geometric> 336.1627+-2.1021 ^ 47.1635+-2.1802 ^ definitely 7.1276x faster NoJIT JIT Geomean of preferred means: <scaled-result> 69.5372+-0.3621 ^ 16.0006+-0.4935 ^ definitely 4.3459x faster I only ran those two benchmarks because I wanted to give you a quick answer. As you can see, the expected speed-up is at least 2x, and on V8Spider it should be even higher. There should not be *any* regressions. Please view a regression from the JIT on any of these tests the same way that you would view a test failure. These benchmarks represent a best-case scenario for the JIT. If you can't get speed-ups from the JIT on all of these tests, then probably, the JIT will experience bad performance pathologies in real life and people will disable the JIT whenever they can because the interpreter will be faster for them. Viewed under that lens, the ARMv6 JIT appears to be badly broken and is not anywhere near the shape it would need to be in for it to be enabled.

Caio Lima

Comment 10 2017-07-13 11:49:31 PDT

(In reply to Filip Pizlo from comment #9) > I only ran those two benchmarks because I wanted to give you a quick answer. > > As you can see, the expected speed-up is at least 2x, and on V8Spider it > should be even higher. There should not be *any* regressions. > > Please view a regression from the JIT on any of these tests the same way > that you would view a test failure. These benchmarks represent a best-case > scenario for the JIT. If you can't get speed-ups from the JIT on all of > these tests, then probably, the JIT will experience bad performance > pathologies in real life and people will disable the JIT whenever they can > because the interpreter will be faster for them. Viewed under that lens, > the ARMv6 JIT appears to be badly broken and is not anywhere near the shape > it would need to be in for it to be enabled. That's a fair point. Now I need to follow the path that @Filip mentioned in mailing list. I will investigate if there is a way to get perf improvements in regressing tests and if it isn't possible, we can focus making Cloop good.

Csaba Osztrogonác

Comment 11 2017-07-14 03:38:40 PDT

I think the disabled IC's are responsible for this huge performance regression on V8Spider benchmark, see https://bugs.webkit.org/show_bug.cgi?id=172765#c2 We had to disable inline caching completely as a workaround because of the mentioned IC refactoring patches. The prolem is that the jump instruction and the address can be 50-100-200 bytes far from each other, it depends on when the constant pool is flushed. But the IC generator logic needs the exact jump size. I think the proper fix would be to completely get rid off constant pool which is only used by the old ARM traditional assembler. But unfortunately nobody was interested in fix this serious issue since a year ago. (me neithher, and I don't and won't have time to work on this issue in the future)

Carlos Alberto Lopez Perez

Comment 12 2017-07-14 04:46:19 PDT

(In reply to Michael Catanzaro from comment #7) > Well I don't understand which tests most accurately reflect real-world > websites, but I bet Filip does. If he thinks V8Spider is most important, we > should probably listen. > > Occasionally, we discover a long way into a project that things have been > going in the wrong direction. We need to make sure we are making the right > choice as to when it's time to keep trying and when it's time to accept > things did not work out and start over. I'm very surprised that Filip thinks > we can get better performance out of cloop, but if so, it would be > inadvisable to look the other way. I also don't know which tests accurately reflect real-world websites, but I do understand that something that improves performance on 87% of the whole JSC performance test suite is a neat gain in performance. You don't need to be an expert in compilers to understand this. Otherwise please just drop all those tests that are meaningless, and keep only V8Spider. I don't want to mean here that there isn't any problem with the current ARMv6 JIT support. My point is that (given this benchmark results) I don't buy that JIT is not a performance improvement over cloop for ARMv6.

Saam Barati

Comment 13 2017-07-14 10:07:33 PDT

(In reply to Csaba Osztrogonác from comment #11) > I think the disabled IC's are responsible for this huge performance > regression > on V8Spider benchmark, see https://bugs.webkit.org/show_bug.cgi?id=172765#c2 > > We had to disable inline caching completely as a workaround because of the > mentioned IC refactoring patches. The prolem is that the jump instruction > and the address can be 50-100-200 bytes far from each other, it depends on > when the constant pool is flushed. But the IC generator logic needs the exact > jump size. I think the proper fix would be to completely get rid off constant > pool which is only used by the old ARM traditional assembler. But > unfortunately > nobody was interested in fix this serious issue since a year ago. (me > neithher, and I don't and won't have time to work on this issue in the > future) It seems like this should be part of the necessary work to land ARMv6 support

Filip Pizlo

Comment 14 2017-07-14 11:36:13 PDT

(In reply to Carlos Alberto Lopez Perez from comment #12) > (In reply to Michael Catanzaro from comment #7) > > Well I don't understand which tests most accurately reflect real-world > > websites, but I bet Filip does. If he thinks V8Spider is most important, we > > should probably listen. > > > > Occasionally, we discover a long way into a project that things have been > > going in the wrong direction. We need to make sure we are making the right > > choice as to when it's time to keep trying and when it's time to accept > > things did not work out and start over. I'm very surprised that Filip thinks > > we can get better performance out of cloop, but if so, it would be > > inadvisable to look the other way. > > I also don't know which tests accurately reflect real-world websites, but I > do understand that something that improves performance on 87% of the whole > JSC performance test suite is a neat gain in performance. You don't need to > be an expert in compilers to understand this. This is a test suite. If you aren't achieving speed-ups on all of them, then something is broken. Think of regressions as test failures. I never look at the performance of the whole suite. We've never used that to justify changes. We typically require each suite to be progressed for something to count as an optimization. Rarely, we make trade-offs between suites, but only if the regressed suite is only regressed a bit (like single-digits percentages, not double-digit like here). > > Otherwise please just drop all those tests that are meaningless, and keep > only V8Spider. We use a broad spectrum of tests. Test suites are better if they have more tests. We aren't going to drop tests because we don't want to lose test coverage. V8Spider is one of our inline caching tests. Inline caching is fundamental to dynamic language performance. The cloop has inline caching. It's better to have an interpreter with inline caching than a compiler without inline caching. SunSpider and other benchmarks measure other optimization functionality, which is just as important. > > I don't want to mean here that there isn't any problem with the current > ARMv6 JIT support. My point is that (given this benchmark results) I don't > buy that JIT is not a performance improvement over cloop for ARMv6. Based on how we interpret benchmark data, the ARMv6 JIT perf data tells us that the cloop is better for perf on that CPU. We require the JIT to be a *big* speed-up before we enable it, because it also has costs that these benchmarks don't capture. On some benchmarks, like JSBench, the JIT is usually a slight regression. So, for us to enable the JIT, we require the JIT to be a huge progression on benchmarks that it's designed to improve, like V8Spider. Real-world programs behave like a mix of JSBench, SunSpider, V8Spider, and our other tests. Enabling the JIT will slow down the JSBench-like part of a real app (that's the run-once start-up code). For this to make sense, we want to know that the speed-up on the other parts is really big.

Filip Pizlo

Comment 15 2017-07-14 11:37:06 PDT

(In reply to Csaba Osztrogonác from comment #11) > I think the disabled IC's are responsible for this huge performance > regression > on V8Spider benchmark, see https://bugs.webkit.org/show_bug.cgi?id=172765#c2 > > We had to disable inline caching completely as a workaround because of the > mentioned IC refactoring patches. The prolem is that the jump instruction > and the address can be 50-100-200 bytes far from each other, it depends on > when the constant pool is flushed. But the IC generator logic needs the exact > jump size. I think the proper fix would be to completely get rid off constant > pool which is only used by the old ARM traditional assembler. But > unfortunately > nobody was interested in fix this serious issue since a year ago. (me > neithher, and I don't and won't have time to work on this issue in the > future) This is a great argument for removing the ARMv6 backend from the tree. I'm not opposed to ARMv6. But if you come back with a patch to reland it, I want that patch to get the same kind of JIT qualification that we do for the other platforms on which we claim that WebKit has a JIT.

Caio Lima

Comment 16 2017-07-14 11:56:36 PDT

(In reply to Csaba Osztrogonác from comment #11) > I think the disabled IC's are responsible for this huge performance > regression > on V8Spider benchmark, see https://bugs.webkit.org/show_bug.cgi?id=172765#c2 > > We had to disable inline caching completely as a workaround because of the > mentioned IC refactoring patches. The prolem is that the jump instruction > and the address can be 50-100-200 bytes far from each other, it depends on > when the constant pool is flushed. But the IC generator logic needs the exact > jump size. I think the proper fix would be to completely get rid off constant > pool which is only used by the old ARM traditional assembler. But > unfortunately > nobody was interested in fix this serious issue since a year ago. (me > neithher, and I don't and won't have time to work on this issue in the > future) I remember you pointed it sometime ago. I think it's time to revisit that and see if I can fix that issue. I agree that it's a requirement to re-enable JIT for ARMv6.As I'm visiting the reason of V8Spider regression, this is the first thing I will take a look. (In reply to Filip Pizlo from comment #15) > > This is a great argument for removing the ARMv6 backend from the tree. > > I'm not opposed to ARMv6. But if you come back with a patch to reland it, I > want that patch to get the same kind of JIT qualification that we do for the > other platforms on which we claim that WebKit has a JIT. IC is one of the most important optimizations for dynamic languages and I doubt that even we can compile JS efficiently for these architectures, it will beat an interpreter that is capable of IC. Last time I touched IC code in GetByIdWithThis, the improvements where ~11x faster. Given the other improvements we got, IMO it's worth it invest some of my time looking if I can get over regressions.

Filip Pizlo

Comment 17 2017-07-14 11:58:15 PDT

(In reply to Michael Catanzaro from comment #7) > Well I don't understand which tests most accurately reflect real-world > websites, but I bet Filip does. If he thinks V8Spider is most important, we > should probably listen. > > Occasionally, we discover a long way into a project that things have been > going in the wrong direction. We need to make sure we are making the right > choice as to when it's time to keep trying and when it's time to accept > things did not work out and start over. I'm very surprised that Filip thinks > we can get better performance out of cloop, but if so, it would be > inadvisable to look the other way. Let me be a bit more open about what I mean: - We know that some JSC open source clients (people who fork the project, but keep in touch with us at conferences and such) disable the JIT on embedded platforms. They have data to say it benefits their JS apps. I don't want to name names. The common thread is: if you're on a CPU we don't optimize for then the JIT isn't a speed-up in practice. - We have stopped tuning for the case where the concurrent JIT is disabled, so our work can largely ignore the execution time of the JIT. I think this means that we always regress how long it takes to JIT, which translates into crappy JIT performance on 32-bit platforms. We also don't do our usual testing on 32-bit, so while the rest of WK on 32-bit platforms is probably quite stable, our JIT probably has lots of terrible bugs. We probably want to disable the 32-bit JITs eventually just for stability of 32-bit. - We know that the interpreter can be optimized by adding better inline cache support. Inline caches make JS faster, and the LLInt has only very basic inline caches. The old C++ interpreter had better inline caches, and we've seen data to suggest that for clients that disable the JIT, the cloop is a regression over the old JIT. We believe that's because the cloop has less sophisticated inline caches. One measurement seemed to suggest that this is worth ~20%, but I'm not sure - could be more or less. Specifically: -> The old interpreter and the JIT can do polymorphic inline caches, where there are multiple cases that we can cascade through. -> The old interpreter and the JIT can do structure-based prototype handling, for when watchpoints don't work. -> The JIT can do inline caching on exotic DOM objects because it groks the impure property watchpoint. The LLInt doesn't know how to do this. It's easy to fix. - We know that the interpreter's inline caches benefit both clients that disable the JIT and clients that have the JIT, because even if you have the JIT, your run-once code will run through the interpreter. Thus, inline cache optimizations will benefit both embedded clients on weird CPUs and us. That means that we can share optimizations with each other. - We want to add getter/setter inlining to our inline caches. If we extended this to the LLInt, it would be an enormous speed-up for modern JS code. Modern JS code uses getter/setters everywhere. I'm happy to guide someone else through this project. I don't know when/if any of the regular JSC hackers will get to this. Inlining getter/setters is a 10x speed-up on getter/setter heavy code in the JIT, so that's a big deal. It's probably be a smaller speed-up in the LLInt, so let's say 2x conservatively. The JIT can do this today, but it's part of the machinery that's disabled on ARMv6: you currently need both inline caches and DFG to do it. - There's probably 20% performance or more available just from tuning: -> Find commonly executed slow paths and write assembly for them. -> Find fast paths that aren't useful and remove them. -> Optimize C++ code that the interpreter often relies on, like runtime functions that are more relevant to the interpreter than the JIT because the JIT inlines them. -> Do a better job of avoiding overhead that is there to support type inference if the JIT is disabled. If you don't have a JIT, you don't need type inference. These are things that we can all do together to benefit all platforms that aren't the primary targets of optimization (like x86-64 and ARM64). It will benefit not just ARMv6 but also MIPS, RISC-V, and lots of other stuff. Ideally, we'll be able to make a high-speed interpreter-based JSC that is totally CPU-agnostic. If we devote the same amount of effort to improving the ARMv6 JIT, then I don't know if it would be as fast as what I propose above. Considering the current rate of progress, it seems like it would take a long time for the ARMv6 JIT to be a real speed-up. On the other hand, there's a good chance that the cloop optimizations will give 20% (ICs) + 20% (more tuning) plus another 2x for getter/setter code. So, I think that the cloop optimization path seems like a better bet for producing real speed-ups on ARMv6.

Caio Lima

Comment 18 2017-08-23 07:29:38 PDT

Created attachment 318867 [details] CLoop vs JIT + IC enabled benchmarks Hi guys. Sorry for the delay here. After our last discussion, I put some effort to enable IC for ARMv6 into JIT layers and now I finally collected some results for that. Now, we have regressions just into 2 tests in SunSpider (they aren't regressing in LongSpider) and 3 into Octane (gbemu, typescript and box2d). Also, I see regressions into microbenchmarks, mainly cases with observable-side-effects and set/map tests. So, with these new results, I would like to re-start the discussion to see if the Fixes I've done into ARMv6 to re-enable JIT are good enough to be accepted by mainline.

Adrian Perez

Comment 19 2024-08-23 05:11:57 PDT

*** Bug 172764 has been marked as a duplicate of this bug. ***

Attachments
CLoop vs JIT enabled benchmarks (90.84 KB, text/plain) 2017-07-13 07:44 PDT, Caio Lima	no flags	Details
CLoop vs JIT + IC enabled benchmarks (98.05 KB, text/plain) 2017-08-23 07:29 PDT, Caio Lima	no flags	Details
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.