Bug 172765 - [meta][JSC][ARMv6] Support ARMv6 architecture in LLInt and JIT layers
Summary: [meta][JSC][ARMv6] Support ARMv6 architecture in LLInt and JIT layers
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Caio Lima
URL:
Keywords:
Depends on: 172767 167411 172766 172768 172972
Blocks: 108645
  Show dependency treegraph
 
Reported: 2017-05-31 12:03 PDT by Caio Lima
Modified: 2017-09-05 07:43 PDT (History)
12 users (show)

See Also:


Attachments
CLoop vs JIT enabled benchmarks (90.84 KB, text/plain)
2017-07-13 07:44 PDT, Caio Lima
no flags Details
CLoop vs JIT + IC enabled benchmarks (98.05 KB, text/plain)
2017-08-23 07:29 PDT, Caio Lima
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Caio Lima 2017-05-31 12:03:55 PDT
Currently the ARMv6 architecture isn’t properly supported in JSC, even with support in some places. However there are devices that could be important to be supported that are ARMv6 based such as Raspberry Pi (first gen) and Raspberry Pi Zero. There is support for ARMv6 in Igalia's WPE downstream branch downstream branch[1] to some extent and the work to make it at least compilable isn't huge. 

I'm going to use this bug as meta-bug to keep track of work to be done into ARMv6 support. The plan is firstly submit Patches to enable JSC compilation to the architecture and after send fixes to test failures and other architecture issues/improvements.

As major goal, we want to make ARMv6 green in all tests.
Comment 1 Alex Christensen 2017-06-14 10:36:08 PDT
Note: Even though Raspberry Pi 3 has an arm64 CPU, Raspbian is distributed as an armv6 binary to support the lower-end Raspberry Pi models with the same image.
Comment 2 Csaba Osztrogonác 2017-06-20 02:15:20 PDT
If you are going to fix all ARMv6 (and it means ARMv7 with ARM instruction set too), you should check bug159408 and its dependencies too.

The ARM (traditional) backend is really slow and buggy ~ a year ago (since
r202214) because of this bug we had to disable generating IC for math functions.

Maybe the best fix would be to remove constant pool completely 
from the traditional ARM assembler to be able to make ICs work again.
Comment 3 Caio Lima 2017-07-13 07:44:05 PDT
Created attachment 315345 [details]
CLoop vs JIT enabled benchmarks

These results are comparing CLoop build (baseline) with JIT enabled build (changes) in JSC benchmarks.
Comment 4 Carlos Alberto Lopez Perez 2017-07-13 07:54:11 PDT
(In reply to Alex Christensen from comment #1)
> Note: Even though Raspberry Pi 3 has an arm64 CPU, Raspbian is distributed
> as an armv6 binary to support the lower-end Raspberry Pi models with the
> same image.

Right.

So this work will make it possible to enable the JSC JIT on modern WebKit2GTK+ on Raspbian.
Comment 5 Filip Pizlo 2017-07-13 08:59:17 PDT
This perf data says that you’re probably just as well off running the cloop on this platform.  V8Spider is a super important test and a regression there is not OK. 

There are a bunch of CPU platforms that would love to have WebKit. Only three CPU platforms currently get serious perf love (armv7, arm64, and x86-64), and of those the 64-but ones get particular attention. The best way to serve both the most-invested-in platforms and the minor ones is to have a really great cloop. You could optimize cloop to beat the JIT on armv6 in less time than it would take you to fix and optimize the JIT. If you do it right, you’ll beat the JIT on all minor platforms at once (mips, etc). 

Also, remember that we actively optimize the JIT based on data we gather on big powerful machines (compared to a armv6). So our tuning will in many cases be good for us and bad for you. Our JIT is not meant for small embedded devices. It’s meant for a 64-bit box with lots of ram and multiple cores. So, if you want peak perf on armv6, your best bet is to optimize the cloop.
Comment 6 Carlos Alberto Lopez Perez 2017-07-13 09:44:33 PDT
(In reply to Filip Pizlo from comment #5)
> This perf data says that you’re probably just as well off running the cloop
> on this platform.  V8Spider is a super important test and a regression there
> is not OK. 
> 
> There are a bunch of CPU platforms that would love to have WebKit. Only
> three CPU platforms currently get serious perf love (armv7, arm64, and
> x86-64), and of those the 64-but ones get particular attention. The best way
> to serve both the most-invested-in platforms and the minor ones is to have a
> really great cloop. You could optimize cloop to beat the JIT on armv6 in
> less time than it would take you to fix and optimize the JIT. If you do it
> right, you’ll beat the JIT on all minor platforms at once (mips, etc). 
> 
> Also, remember that we actively optimize the JIT based on data we gather on
> big powerful machines (compared to a armv6). So our tuning will in many
> cases be good for us and bad for you. Our JIT is not meant for small
> embedded devices. It’s meant for a 64-bit box with lots of ram and multiple
> cores. So, if you want peak perf on armv6, your best bet is to optimize the
> cloop.

I have a very different reading of the data that has been uploaded.

* Number of tests that regressed with the JIT on ARMv6:
$ curl -sL 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345'|grep -P "definitely ([0-9])+\.[0-9]+x slower"|wc -l
82

* Number of tests that improved with the JIT on ARMv6:
$ curl -sL 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345'|grep -P "definitely ([0-9])+\.[0-9]+x faster"|wc -l
542

So the JIT on ARMv6 improves on 542 tests and regresses on 82. That is: it improves the performance on a 87% of the tests.


* How much was the regression for the tests that regressed?
$ curl -sL 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345' | grep -Po "definitely ([0-9])+\.[0-9]+x slower"|awk -F'definitely ' '{print $2}'|cut -d\. -f1|sort | uniq -c | sort -n -r
     43 1
     15 2
     10 3
      3 6
      3 4
      2 12
      1 9
      1 8
      1 7
      1 5
      1 19
      1 13

On the tests that regressed, 43 was below 2x, 15 below 3x, and 10 below 4x.


* How much was the improvement for the tests that improved?
$ curl -sL 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345' | grep -Po "definitely ([0-9])+\.[0-9]+x faster"|awk -F'definitely ' '{print $2}'|cut -d\. -f1|sort | uniq -c | sort -n -r
    115 1
     89 2
     51 3
     34 4
     21 7
     20 6
     19 8
     19 5
     19 13
     13 10
     10 20
      9 15
      9 11
      7 14
      6 9
      5 21
      5 19
      5 17
      5 16
      5 12
      4 24
      4 18
      3 77
      3 70
      3 44
      3 42
      3 38
      3 34
      3 33
      3 29
      3 26
      2 94
      2 83
      2 69
      2 49
      2 45
      2 41
      2 35
      2 32
      2 31
      2 25
      2 22
      2 141
      2 103
      1 90
      1 75
      1 72
      1 71
      1 68
      1 63
      1 59
      1 47
      1 37
      1 23
      1 160
      1 155
      1 105
      1 101
      1 100

The first column is the number of tests and the second the improvement (rounded to zero). Lots of tests with improvements that go beyond 20x, even as  far as a 160x performance improvement.


So, honestly, if this don't looks like a good enough performance improvement for you I don't know what it is :\
Comment 7 Michael Catanzaro 2017-07-13 10:40:18 PDT
Well I don't understand which tests most accurately reflect real-world websites, but I bet Filip does. If he thinks V8Spider is most important, we should probably listen.

Occasionally, we discover a long way into a project that things have been going in the wrong direction. We need to make sure we are making the right choice as to when it's time to keep trying and when it's time to accept things did not work out and start over. I'm very surprised that Filip thinks we can get better performance out of cloop, but if so, it would be inadvisable to look the other way.
Comment 8 Michael Catanzaro 2017-07-13 10:42:13 PDT
(In reply to Filip Pizlo from comment #5)
> Also, remember that we actively optimize the JIT based on data we gather on
> big powerful machines (compared to a armv6). So our tuning will in many
> cases be good for us and bad for you. Our JIT is not meant for small
> embedded devices.

I suppose iOS devices are far more powerful than typical armv6 devices?
Comment 9 Filip Pizlo 2017-07-13 10:48:15 PDT
(In reply to Carlos Alberto Lopez Perez from comment #6)
> (In reply to Filip Pizlo from comment #5)
> > This perf data says that you’re probably just as well off running the cloop
> > on this platform.  V8Spider is a super important test and a regression there
> > is not OK. 
> > 
> > There are a bunch of CPU platforms that would love to have WebKit. Only
> > three CPU platforms currently get serious perf love (armv7, arm64, and
> > x86-64), and of those the 64-but ones get particular attention. The best way
> > to serve both the most-invested-in platforms and the minor ones is to have a
> > really great cloop. You could optimize cloop to beat the JIT on armv6 in
> > less time than it would take you to fix and optimize the JIT. If you do it
> > right, you’ll beat the JIT on all minor platforms at once (mips, etc). 
> > 
> > Also, remember that we actively optimize the JIT based on data we gather on
> > big powerful machines (compared to a armv6). So our tuning will in many
> > cases be good for us and bad for you. Our JIT is not meant for small
> > embedded devices. It’s meant for a 64-bit box with lots of ram and multiple
> > cores. So, if you want peak perf on armv6, your best bet is to optimize the
> > cloop.
> 
> I have a very different reading of the data that has been uploaded.
> 
> * Number of tests that regressed with the JIT on ARMv6:
> $ curl -sL
> 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345'|grep -P
> "definitely ([0-9])+\.[0-9]+x slower"|wc -l
> 82
> 
> * Number of tests that improved with the JIT on ARMv6:
> $ curl -sL
> 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345'|grep -P
> "definitely ([0-9])+\.[0-9]+x faster"|wc -l
> 542
> 
> So the JIT on ARMv6 improves on 542 tests and regresses on 82. That is: it
> improves the performance on a 87% of the tests.
> 
> 
> * How much was the regression for the tests that regressed?
> $ curl -sL
> 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345' | grep
> -Po "definitely ([0-9])+\.[0-9]+x slower"|awk -F'definitely ' '{print
> $2}'|cut -d\. -f1|sort | uniq -c | sort -n -r
>      43 1
>      15 2
>      10 3
>       3 6
>       3 4
>       2 12
>       1 9
>       1 8
>       1 7
>       1 5
>       1 19
>       1 13
> 
> On the tests that regressed, 43 was below 2x, 15 below 3x, and 10 below 4x.
> 
> 
> * How much was the improvement for the tests that improved?
> $ curl -sL
> 'https://bug-172765-attachments.webkit.org/attachment.cgi?id=315345' | grep
> -Po "definitely ([0-9])+\.[0-9]+x faster"|awk -F'definitely ' '{print
> $2}'|cut -d\. -f1|sort | uniq -c | sort -n -r
>     115 1
>      89 2
>      51 3
>      34 4
>      21 7
>      20 6
>      19 8
>      19 5
>      19 13
>      13 10
>      10 20
>       9 15
>       9 11
>       7 14
>       6 9
>       5 21
>       5 19
>       5 17
>       5 16
>       5 12
>       4 24
>       4 18
>       3 77
>       3 70
>       3 44
>       3 42
>       3 38
>       3 34
>       3 33
>       3 29
>       3 26
>       2 94
>       2 83
>       2 69
>       2 49
>       2 45
>       2 41
>       2 35
>       2 32
>       2 31
>       2 25
>       2 22
>       2 141
>       2 103
>       1 90
>       1 75
>       1 72
>       1 71
>       1 68
>       1 63
>       1 59
>       1 47
>       1 37
>       1 23
>       1 160
>       1 155
>       1 105
>       1 101
>       1 100
> 
> The first column is the number of tests and the second the improvement
> (rounded to zero). Lots of tests with improvements that go beyond 20x, even
> as  far as a 160x performance improvement.
> 
> 
> So, honestly, if this don't looks like a good enough performance improvement
> for you I don't know what it is :\

I'm happy to answer your question:

[pizlo@wrathchild OpenSource] Tools/Scripts/run-jsc-benchmarks NoJIT:JSC_useJIT=false:/System/Library/Frameworks/JavaScriptCore.framework/Resources/jsc JIT:JSC_useJIT=true:/System/Library/Frameworks/JavaScriptCore.framework/Resources/jsc --sunspider --v8-spider --outer 6
Warning: could not identify checkout location for NoJIT
Warning: could not identify checkout location for JIT
448/448                                            -e 
Generating benchmark report at /Volumes/Data/primary/OpenSource/NoJIT_JIT_SunSpiderV8Spider_wrathchild_20170713_1043_report.txt
And raw data at /Volumes/Data/primary/OpenSource/NoJIT_JIT_SunSpiderV8Spider_wrathchild_20170713_1043.json

Benchmark report for SunSpider and V8Spider on wrathchild (MacBookPro13,2).

VMs tested:
"NoJIT" at /System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc
    export JSC_useJIT=false
"JIT" at /System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc
    export JSC_useJIT=true

Collected 6 samples per benchmark/VM, with 6 VM invocations per benchmark. Emitted a call to gc() between
sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific
preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95%
confidence intervals in milliseconds.

                                       NoJIT                      JIT                                        
SunSpider:
   3d-cube                        11.5154+-0.2823     ^      5.6214+-0.4731        ^ definitely 2.0485x faster
   3d-morph                       16.4709+-0.4416     ^      5.6433+-0.1673        ^ definitely 2.9187x faster
   3d-raytrace                    17.0282+-0.5761     ^      5.5649+-0.9222        ^ definitely 3.0599x faster
   access-binary-trees             8.0210+-0.1176     ^      2.9373+-0.5648        ^ definitely 2.7308x faster
   access-fannkuch                25.9462+-0.8402     ^      6.2921+-0.7729        ^ definitely 4.1236x faster
   access-nbody                   13.3165+-0.1877     ^      2.9873+-0.1235        ^ definitely 4.4577x faster
   access-nsieve                   7.0340+-0.1611     ^      2.7871+-0.4898        ^ definitely 2.5238x faster
   bitops-3bit-bits-in-byte        9.9086+-0.4578     ^      1.5604+-0.0842        ^ definitely 6.3499x faster
   bitops-bits-in-byte            19.7679+-0.5447     ^      3.2862+-0.1380        ^ definitely 6.0155x faster
   bitops-bitwise-and             14.0890+-0.1835     ^      2.6123+-0.1488        ^ definitely 5.3934x faster
   bitops-nsieve-bits             20.9683+-0.3707     ^      5.6011+-3.0928        ^ definitely 3.7436x faster
   controlflow-recursive          10.1180+-1.0903     ^      3.4333+-0.4442        ^ definitely 2.9470x faster
   crypto-aes                     11.6860+-0.3691     ^      4.8807+-0.2735        ^ definitely 2.3943x faster
   crypto-md5                      9.6230+-0.3215     ^      3.1081+-0.2014        ^ definitely 3.0961x faster
   crypto-sha1                     8.9427+-0.1939     ^      3.4079+-0.1464        ^ definitely 2.6241x faster
   date-format-tofte              14.6384+-0.1859     ^      8.2467+-1.1542        ^ definitely 1.7751x faster
   date-format-xparb              13.9489+-0.1422     ^      5.7903+-0.0897        ^ definitely 2.4090x faster
   math-cordic                    18.4724+-0.3624     ^      3.6552+-0.4528        ^ definitely 5.0538x faster
   math-partial-sums              12.5854+-1.0010     ^      4.4906+-0.1942        ^ definitely 2.8026x faster
   math-spectral-norm             13.9168+-0.4676     ^      2.3823+-0.1053        ^ definitely 5.8417x faster
   regexp-dna                      7.5146+-0.0543            7.5068+-0.0960        
   string-base64                  11.7726+-0.2485     ^      4.7688+-0.2088        ^ definitely 2.4687x faster
   string-fasta                   25.2803+-1.5132     ^      6.3643+-0.3358        ^ definitely 3.9722x faster
   string-tagcloud                14.3337+-0.0660     ^      9.2762+-0.1280        ^ definitely 1.5452x faster
   string-unpack-code             26.0402+-2.8718           24.2419+-3.8663          might be 1.0742x faster
   string-validate-input          11.0649+-0.2889     ^      4.8250+-0.2475        ^ definitely 2.2932x faster

   <arithmetic>                   14.3848+-0.1585     ^      5.4335+-0.2673        ^ definitely 2.6474x faster

                                       NoJIT                      JIT                                        
V8Spider:
   crypto                        462.8188+-5.3305     ^     44.0877+-1.3795        ^ definitely 10.4977x faster
   deltablue                     933.2314+-6.7242     ^     75.2298+-7.2097        ^ definitely 12.4051x faster
   earley-boyer                  346.5754+-13.8079    ^     41.8177+-0.8726        ^ definitely 8.2878x faster
   raytrace                      185.5504+-0.8525     ^     34.8239+-8.6922        ^ definitely 5.3283x faster
   regexp                        103.3735+-2.0679     ^     57.9933+-0.6233        ^ definitely 1.7825x faster
   richards                     1092.5397+-33.5531    ^     50.9009+-1.7511        ^ definitely 21.4641x faster
   splay                         154.8283+-2.7336     ^     37.1096+-1.1089        ^ definitely 4.1722x faster

   <geometric>                   336.1627+-2.1021     ^     47.1635+-2.1802        ^ definitely 7.1276x faster

                                       NoJIT                      JIT                                        
Geomean of preferred means:
   <scaled-result>                69.5372+-0.3621     ^     16.0006+-0.4935        ^ definitely 4.3459x faster

I only ran those two benchmarks because I wanted to give you a quick answer.

As you can see, the expected speed-up is at least 2x, and on V8Spider it should be even higher.  There should not be *any* regressions.

Please view a regression from the JIT on any of these tests the same way that you would view a test failure.  These benchmarks represent a best-case scenario for the JIT.  If you can't get speed-ups from the JIT on all of these tests, then probably, the JIT will experience bad performance pathologies in real life and people will disable the JIT whenever they can because the interpreter will be faster for them.  Viewed under that lens, the ARMv6 JIT appears to be badly broken and is not anywhere near the shape it would need to be in for it to be enabled.
Comment 10 Caio Lima 2017-07-13 11:49:31 PDT
(In reply to Filip Pizlo from comment #9) 
> I only ran those two benchmarks because I wanted to give you a quick answer.
> 
> As you can see, the expected speed-up is at least 2x, and on V8Spider it
> should be even higher.  There should not be *any* regressions.
> 
> Please view a regression from the JIT on any of these tests the same way
> that you would view a test failure.  These benchmarks represent a best-case
> scenario for the JIT.  If you can't get speed-ups from the JIT on all of
> these tests, then probably, the JIT will experience bad performance
> pathologies in real life and people will disable the JIT whenever they can
> because the interpreter will be faster for them.  Viewed under that lens,
> the ARMv6 JIT appears to be badly broken and is not anywhere near the shape
> it would need to be in for it to be enabled.

That's a fair point. Now I need to follow the path that @Filip mentioned in mailing list. I will investigate if there is a way to get perf improvements in regressing tests and if it isn't possible, we can focus making Cloop good.
Comment 11 Csaba Osztrogonác 2017-07-14 03:38:40 PDT
I think the disabled IC's are responsible for this huge performance regression
on V8Spider benchmark, see https://bugs.webkit.org/show_bug.cgi?id=172765#c2

We had to disable inline caching completely as a workaround because of the
mentioned IC refactoring patches. The prolem is that the jump instruction
and the address can be 50-100-200 bytes far from each other, it depends on
when the constant pool is flushed. But the IC generator logic needs the exact
jump size. I think the proper fix would be to completely get rid off constant
pool which is only used by the old ARM traditional assembler. But unfortunately
nobody was interested in fix this serious issue since a year ago. (me neithher, and I don't and won't have time to work on this issue in the future)
Comment 12 Carlos Alberto Lopez Perez 2017-07-14 04:46:19 PDT
(In reply to Michael Catanzaro from comment #7)
> Well I don't understand which tests most accurately reflect real-world
> websites, but I bet Filip does. If he thinks V8Spider is most important, we
> should probably listen.
> 
> Occasionally, we discover a long way into a project that things have been
> going in the wrong direction. We need to make sure we are making the right
> choice as to when it's time to keep trying and when it's time to accept
> things did not work out and start over. I'm very surprised that Filip thinks
> we can get better performance out of cloop, but if so, it would be
> inadvisable to look the other way.

I also don't know which tests accurately reflect real-world websites, but I do understand that something that improves performance on 87% of the whole JSC performance test suite is a neat gain in performance. You don't need to be an expert in compilers to understand this.

Otherwise please just drop all those tests that are meaningless, and keep only V8Spider.

I don't want to mean here that there isn't any problem with the current ARMv6 JIT support. My point is that (given this benchmark results) I don't buy that JIT is not a performance improvement over cloop for ARMv6.
Comment 13 Saam Barati 2017-07-14 10:07:33 PDT
(In reply to Csaba Osztrogonác from comment #11)
> I think the disabled IC's are responsible for this huge performance
> regression
> on V8Spider benchmark, see https://bugs.webkit.org/show_bug.cgi?id=172765#c2
> 
> We had to disable inline caching completely as a workaround because of the
> mentioned IC refactoring patches. The prolem is that the jump instruction
> and the address can be 50-100-200 bytes far from each other, it depends on
> when the constant pool is flushed. But the IC generator logic needs the exact
> jump size. I think the proper fix would be to completely get rid off constant
> pool which is only used by the old ARM traditional assembler. But
> unfortunately
> nobody was interested in fix this serious issue since a year ago. (me
> neithher, and I don't and won't have time to work on this issue in the
> future)

It seems like this should be part of the necessary work to land ARMv6 support
Comment 14 Filip Pizlo 2017-07-14 11:36:13 PDT
(In reply to Carlos Alberto Lopez Perez from comment #12)
> (In reply to Michael Catanzaro from comment #7)
> > Well I don't understand which tests most accurately reflect real-world
> > websites, but I bet Filip does. If he thinks V8Spider is most important, we
> > should probably listen.
> > 
> > Occasionally, we discover a long way into a project that things have been
> > going in the wrong direction. We need to make sure we are making the right
> > choice as to when it's time to keep trying and when it's time to accept
> > things did not work out and start over. I'm very surprised that Filip thinks
> > we can get better performance out of cloop, but if so, it would be
> > inadvisable to look the other way.
> 
> I also don't know which tests accurately reflect real-world websites, but I
> do understand that something that improves performance on 87% of the whole
> JSC performance test suite is a neat gain in performance. You don't need to
> be an expert in compilers to understand this.

This is a test suite.  If you aren't achieving speed-ups on all of them, then something is broken.  Think of regressions as test failures.

I never look at the performance of the whole suite.  We've never used that to justify changes.  We typically require each suite to be progressed for something to count as an optimization.  Rarely, we make trade-offs between suites, but only if the regressed suite is only regressed a bit (like single-digits percentages, not double-digit like here).

> 
> Otherwise please just drop all those tests that are meaningless, and keep
> only V8Spider.

We use a broad spectrum of tests.  Test suites are better if they have more tests.  We aren't going to drop tests because we don't want to lose test coverage.

V8Spider is one of our inline caching tests.  Inline caching is fundamental to dynamic language performance.  The cloop has inline caching.  It's better to have an interpreter with inline caching than a compiler without inline caching.

SunSpider and other benchmarks measure other optimization functionality, which is just as important.

> 
> I don't want to mean here that there isn't any problem with the current
> ARMv6 JIT support. My point is that (given this benchmark results) I don't
> buy that JIT is not a performance improvement over cloop for ARMv6.

Based on how we interpret benchmark data, the ARMv6 JIT perf data tells us that the cloop is better for perf on that CPU.

We require the JIT to be a *big* speed-up before we enable it, because it also has costs that these benchmarks don't capture.  On some benchmarks, like JSBench, the JIT is usually a slight regression.  So, for us to enable the JIT, we require the JIT to be a huge progression on benchmarks that it's designed to improve, like V8Spider.

Real-world programs behave like a mix of JSBench, SunSpider, V8Spider, and our other tests.  Enabling the JIT will slow down the JSBench-like part of a real app (that's the run-once start-up code).  For this to make sense, we want to know that the speed-up on the other parts is really big.
Comment 15 Filip Pizlo 2017-07-14 11:37:06 PDT
(In reply to Csaba Osztrogonác from comment #11)
> I think the disabled IC's are responsible for this huge performance
> regression
> on V8Spider benchmark, see https://bugs.webkit.org/show_bug.cgi?id=172765#c2
> 
> We had to disable inline caching completely as a workaround because of the
> mentioned IC refactoring patches. The prolem is that the jump instruction
> and the address can be 50-100-200 bytes far from each other, it depends on
> when the constant pool is flushed. But the IC generator logic needs the exact
> jump size. I think the proper fix would be to completely get rid off constant
> pool which is only used by the old ARM traditional assembler. But
> unfortunately
> nobody was interested in fix this serious issue since a year ago. (me
> neithher, and I don't and won't have time to work on this issue in the
> future)

This is a great argument for removing the ARMv6 backend from the tree.

I'm not opposed to ARMv6.  But if you come back with a patch to reland it, I want that patch to get the same kind of JIT qualification that we do for the other platforms on which we claim that WebKit has a JIT.
Comment 16 Caio Lima 2017-07-14 11:56:36 PDT
(In reply to Csaba Osztrogonác from comment #11)
> I think the disabled IC's are responsible for this huge performance
> regression
> on V8Spider benchmark, see https://bugs.webkit.org/show_bug.cgi?id=172765#c2
> 
> We had to disable inline caching completely as a workaround because of the
> mentioned IC refactoring patches. The prolem is that the jump instruction
> and the address can be 50-100-200 bytes far from each other, it depends on
> when the constant pool is flushed. But the IC generator logic needs the exact
> jump size. I think the proper fix would be to completely get rid off constant
> pool which is only used by the old ARM traditional assembler. But
> unfortunately
> nobody was interested in fix this serious issue since a year ago. (me
> neithher, and I don't and won't have time to work on this issue in the
> future)

I remember you pointed it sometime ago. I think it's time to revisit that and see if I can fix that issue. I agree that it's a requirement to re-enable JIT for ARMv6.As I'm visiting the reason of V8Spider regression, this is the first thing I will take a look. 

(In reply to Filip Pizlo from comment #15)
> 
> This is a great argument for removing the ARMv6 backend from the tree.
> 
> I'm not opposed to ARMv6.  But if you come back with a patch to reland it, I
> want that patch to get the same kind of JIT qualification that we do for the
> other platforms on which we claim that WebKit has a JIT.

IC is one of the most important optimizations for dynamic languages and I doubt that even we can compile JS efficiently for these architectures, it will beat an interpreter that is capable of IC. Last time I touched IC code in GetByIdWithThis, the improvements where ~11x faster. Given the other improvements we got, IMO it's worth it invest some of my time looking if I can get over regressions.
Comment 17 Filip Pizlo 2017-07-14 11:58:15 PDT
(In reply to Michael Catanzaro from comment #7)
> Well I don't understand which tests most accurately reflect real-world
> websites, but I bet Filip does. If he thinks V8Spider is most important, we
> should probably listen.
> 
> Occasionally, we discover a long way into a project that things have been
> going in the wrong direction. We need to make sure we are making the right
> choice as to when it's time to keep trying and when it's time to accept
> things did not work out and start over. I'm very surprised that Filip thinks
> we can get better performance out of cloop, but if so, it would be
> inadvisable to look the other way.

Let me be a bit more open about what I mean:

- We know that some JSC open source clients (people who fork the project, but keep in touch with us at conferences and such) disable the JIT on embedded platforms.  They have data to say it benefits their JS apps.  I don't want to name names.  The common thread is: if you're on a CPU we don't optimize for then the JIT isn't a speed-up in practice.

- We have stopped tuning for the case where the concurrent JIT is disabled, so our work can largely ignore the execution time of the JIT.  I think this means that we always regress how long it takes to JIT, which translates into crappy JIT performance on 32-bit platforms.  We also don't do our usual testing on 32-bit, so while the rest of WK on 32-bit platforms is probably quite stable, our JIT probably has lots of terrible bugs.  We probably want to disable the 32-bit JITs eventually just for stability of 32-bit.

- We know that the interpreter can be optimized by adding better inline cache support.  Inline caches make JS faster, and the LLInt has only very basic inline caches.  The old C++ interpreter had better inline caches, and we've seen data to suggest that for clients that disable the JIT, the cloop is a regression over the old JIT.  We believe that's because the cloop has less sophisticated inline caches.  One measurement seemed to suggest that this is worth ~20%, but I'm not sure - could be more or less.  Specifically:

    -> The old interpreter and the JIT can do polymorphic inline caches, where there are multiple cases that we can cascade through.

    -> The old interpreter and the JIT can do structure-based prototype handling, for when watchpoints don't work.

    -> The JIT can do inline caching on exotic DOM objects because it groks the impure property watchpoint.  The LLInt doesn't know how to do this.  It's easy to fix.

- We know that the interpreter's inline caches benefit both clients that disable the JIT and clients that have the JIT, because even if you have the JIT, your run-once code will run through the interpreter.  Thus, inline cache optimizations will benefit both embedded clients on weird CPUs and us.  That means that we can share optimizations with each other.

- We want to add getter/setter inlining to our inline caches.  If we extended this to the LLInt, it would be an enormous speed-up for modern JS code.  Modern JS code uses getter/setters everywhere.  I'm happy to guide someone else through this project.  I don't know when/if any of the regular JSC hackers will get to this.  Inlining getter/setters is a 10x speed-up on getter/setter heavy code in the JIT, so that's a big deal.  It's probably be a smaller speed-up in the LLInt, so let's say 2x conservatively.  The JIT can do this today, but it's part of the machinery that's disabled on ARMv6: you currently need both inline caches and DFG to do it.

- There's probably 20% performance or more available just from tuning:

    -> Find commonly executed slow paths and write assembly for them.

    -> Find fast paths that aren't useful and remove them.

    -> Optimize C++ code that the interpreter often relies on, like runtime functions that are more relevant to the interpreter than the JIT because the JIT inlines them.

    -> Do a better job of avoiding overhead that is there to support type inference if the JIT is disabled.  If you don't have a JIT, you don't need type inference.

These are things that we can all do together to benefit all platforms that aren't the primary targets of optimization (like x86-64 and ARM64).  It will benefit not just ARMv6 but also MIPS, RISC-V, and lots of other stuff.  Ideally, we'll be able to make a high-speed interpreter-based JSC that is totally CPU-agnostic.

If we devote the same amount of effort to improving the ARMv6 JIT, then I don't know if it would be as fast as what I propose above.  Considering the current rate of progress, it seems like it would take a long time for the ARMv6 JIT to be a real speed-up.  On the other hand, there's a good chance that the cloop optimizations will give 20% (ICs) + 20% (more tuning) plus another 2x for getter/setter code.

So, I think that the cloop optimization path seems like a better bet for producing real speed-ups on ARMv6.
Comment 18 Caio Lima 2017-08-23 07:29:38 PDT
Created attachment 318867 [details]
CLoop vs JIT + IC enabled benchmarks

Hi guys. Sorry for the delay here.

After our last discussion, I put some effort to enable IC for ARMv6 into JIT layers and now I finally collected some results for that.

Now, we have regressions just into 2 tests in SunSpider (they aren't regressing in LongSpider) and 3 into Octane (gbemu, typescript and box2d). Also, I see regressions into microbenchmarks, mainly cases with observable-side-effects and set/map tests.

So, with these new results, I would like to re-start the discussion to see if the Fixes I've done into ARMv6 to re-enable JIT are good enough to be accepted by mainline.