Bug 158619 - [JSC] Inline JSC::toInt32 to improve kraken
Summary: [JSC] Inline JSC::toInt32 to improve kraken
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-10 10:00 PDT by Yusuke Suzuki
Modified: 2016-06-13 09:51 PDT (History)
5 users (show)

See Also:


Attachments
Patch (13.29 KB, patch)
2016-06-10 21:13 PDT, Yusuke Suzuki
mark.lam: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2016-06-10 10:00:33 PDT
Linux perf tool analysis on several kraken benchmarks shows that JSC::toInt32 is frequently called.
But, the function JSC::toInt32 itself is small one. It's worth inlining.
Comment 1 Yusuke Suzuki 2016-06-10 10:10:24 PDT
Seems good. I'll check Octane, SunSpider, ASMBench and upload the patch.

Benchmark report for Kraken on hanayamata.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf/Release/bin/jsc

Collected 100 samples per benchmark/VM, with 100 VM invocations per benchmark. Emitted a call to gc() between
sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                           baseline                  patched                                      

ai-astar                                96.084+-0.548             96.015+-0.523         
audio-beat-detection                    48.621+-1.138             48.357+-1.020         
audio-dft                              123.803+-0.373            123.549+-0.109         
audio-fft                               37.399+-0.050             37.381+-0.039         
audio-oscillator                        53.682+-0.071      ?      53.718+-0.082         ?
imaging-darkroom                        87.673+-0.063      ?      87.724+-0.060         ?
imaging-desaturate                      56.400+-0.064             56.333+-0.041         
imaging-gaussian-blur                   79.772+-1.849      ?      79.936+-1.394         ?
json-parse-financial                    44.649+-0.208      ?      44.997+-0.302         ?
json-stringify-tinderbox                27.110+-0.056             26.812+-0.428           might be 1.0111x faster
stanford-crypto-aes                     43.989+-0.157      ^      43.001+-0.078         ^ definitely 1.0230x faster
stanford-crypto-ccm                     44.338+-0.688      ?      44.588+-0.686         ?
stanford-crypto-pbkdf2                 108.420+-1.044      ^     106.231+-0.437         ^ definitely 1.0206x faster
stanford-crypto-sha256-iterative        35.538+-0.077      ^      35.314+-0.037         ^ definitely 1.0063x faster

<arithmetic>                            63.391+-0.168             63.140+-0.140           might be 1.0040x faster
Comment 2 Yusuke Suzuki 2016-06-10 21:13:33 PDT
Created attachment 281075 [details]
Patch
Comment 3 Yusuke Suzuki 2016-06-10 21:16:47 PDT
Performance results. mandreel seems noise. To ensure that, we retake the mandreel 100 times, and the result is the following.

Benchmark report for Octane on hanayamata.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf/Release/bin/jsc

Collected 100 samples per benchmark/VM, with 100 VM invocations per benchmark. Emitted a call to
gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used
the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark
execution times with 95% confidence intervals in milliseconds.

                            baseline                  patched                                      

mandreel       x2      49.08879+-0.28178         48.86048+-0.10656       

<geometric>            49.08879+-0.28178         48.86048+-0.10656         might be 1.0047x faster




Benchmark report for SunSpider, Octane, Kraken, and AsmBench on hanayamata.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf/Release/bin/jsc

Collected 30 samples per benchmark/VM, with 30 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                                                 baseline                  patched
SunSpider:
   3d-cube                                    6.1564+-0.2165            6.0395+-0.0198          might be 1.0194x faster
   3d-morph                                  25.7981+-0.1100     ?     25.9357+-0.1825        ?
   3d-raytrace                                7.4267+-0.5293            6.8834+-0.2116          might be 1.0789x faster
   access-binary-trees                        2.3713+-0.0980     ?      2.4013+-0.1151        ? might be 1.0127x slower
   access-fannkuch                            6.9854+-0.3202            6.7167+-0.2763          might be 1.0400x faster
   access-nbody                               2.8520+-0.0103     ?      2.9024+-0.0567        ? might be 1.0177x slower
   access-nsieve                              3.2326+-0.0544            3.2281+-0.0483
   bitops-3bit-bits-in-byte                   1.2229+-0.0422     ?      1.2335+-0.0501        ?
   bitops-bits-in-byte                        2.8118+-0.1010            2.7639+-0.0888          might be 1.0173x faster
   bitops-bitwise-and                         2.0064+-0.0246            1.9841+-0.0082          might be 1.0112x faster
   bitops-nsieve-bits                         3.5306+-0.2168            3.3985+-0.1914          might be 1.0389x faster
   controlflow-recursive                      2.8202+-0.0975     ?      2.8671+-0.1323        ? might be 1.0166x slower
   crypto-aes                                 5.2740+-0.1032     ?      5.3000+-0.2027        ?
   crypto-md5                                 2.7913+-0.1429            2.7811+-0.1261
   crypto-sha1                                2.5246+-0.0264     !      2.6164+-0.0640        ! definitely 1.0364x slower
   date-format-tofte                         10.3640+-0.1394     ?     10.4484+-0.1300        ?
   date-format-xparb                          6.0335+-0.1035     ?      6.0831+-0.1376        ?
   math-cordic                                3.0874+-0.0865     ?      3.1688+-0.1189        ? might be 1.0264x slower
   math-partial-sums                         10.5272+-0.0396     ?     10.5978+-0.1326        ?
   math-spectral-norm                         2.1891+-0.0521     ?      2.2340+-0.0519        ? might be 1.0205x slower
   regexp-dna                                 7.3567+-0.0223     ?      7.3860+-0.0869        ?
   string-base64                              3.9486+-0.0283     ?      4.0317+-0.0774        ? might be 1.0211x slower
   string-fasta                               6.6015+-0.2154            6.4747+-0.1874          might be 1.0196x faster
   string-tagcloud                            9.7379+-0.0710            9.6021+-0.1486          might be 1.0141x faster
   string-unpack-code                        19.7744+-0.0478     ?     19.8322+-0.2298        ?
   string-validate-input                      4.2980+-0.1012     ?      4.3578+-0.1089        ? might be 1.0139x slower

   <arithmetic>                               6.2201+-0.0308            6.2026+-0.0274          might be 1.0028x faster

                                                 baseline                  patched
Octane:
   encrypt                                   0.17962+-0.00241    ?     0.18033+-0.00120       ?
   decrypt                                   2.98068+-0.01640    ?     3.00958+-0.04567       ?
   deltablue                        x2       0.15236+-0.00144          0.15025+-0.00117         might be 1.0141x faster
   earley                                    0.33712+-0.00074    ?     0.33722+-0.00079       ?
   boyer                                     5.32422+-0.00564    ?     5.32558+-0.00841       ?
   navier-stokes                    x2       4.78881+-0.00693    ?     4.79320+-0.01528       ?
   raytrace                         x2       0.93551+-0.00362          0.93233+-0.00309
   richards                         x2       0.09723+-0.00092    ?     0.09801+-0.00167       ?
   splay                            x2       0.39334+-0.00123    ?     0.39364+-0.00119       ?
   regexp                           x2      18.71392+-0.06857         18.65126+-0.05944
   pdfjs                            x2      42.39686+-0.35091         42.31642+-0.30435
   mandreel                         x2      48.68664+-0.15582    !    49.25694+-0.29339       ! definitely 1.0117x slower
   gbemu                            x2      39.31304+-1.31894         38.79220+-1.50068         might be 1.0134x faster
   closure                                   0.60327+-0.00395          0.60178+-0.00309
   jquery                                    8.02553+-0.02883    ?     8.08426+-0.14237       ?
   box2d                            x2      14.63042+-0.09236    ?    14.69277+-0.19818       ?
   zlib                             x2     368.45610+-5.24313        364.03090+-2.28827         might be 1.0122x faster
   typescript                       x2     810.87181+-4.19509        810.07914+-4.30712

   <geometric>                               5.90654+-0.01651          5.90027+-0.01901         might be 1.0011x faster

                                                 baseline                  patched
Kraken:
   ai-astar                                   95.966+-1.109      ?      96.090+-0.925         ?
   audio-beat-detection                       47.492+-1.701             46.657+-1.232           might be 1.0179x faster
   audio-dft                                 123.721+-0.402      ?     124.373+-0.666         ?
   audio-fft                                  37.322+-0.036             37.276+-0.015
   audio-oscillator                           53.789+-0.285             53.722+-0.157
   imaging-darkroom                           88.159+-0.917             87.730+-0.094
   imaging-desaturate                         56.332+-0.064      ?      56.345+-0.104         ?
   imaging-gaussian-blur                      80.524+-2.687             79.091+-2.555           might be 1.0181x faster
   json-parse-financial                       44.610+-0.441      ?      44.791+-0.325         ?
   json-stringify-tinderbox                   27.087+-0.035      ^      26.609+-0.144         ^ definitely 1.0180x faster
   stanford-crypto-aes                        43.669+-0.210      ^      42.862+-0.115         ^ definitely 1.0188x faster
   stanford-crypto-ccm                        45.213+-1.424             44.490+-1.290           might be 1.0162x faster
   stanford-crypto-pbkdf2                    107.665+-0.581      ^     106.229+-0.807         ^ definitely 1.0135x faster
   stanford-crypto-sha256-iterative           35.470+-0.112             35.351+-0.078

   <arithmetic>                               63.359+-0.230             62.972+-0.281           might be 1.0061x faster

                                                 baseline                  patched
AsmBench:
   towers.c                                 280.7607+-10.5500         276.5852+-2.2406          might be 1.0151x faster
   n-body.c                                 930.6878+-6.5519          924.9891+-6.5493
   float-mm.c                               718.3752+-4.3448     ?    723.1201+-5.6769        ?
   container.cpp                           3029.7668+-37.4858    ?   3048.7741+-25.0556       ?
   quicksort.c                              439.4318+-1.6382          439.2427+-1.8137
   gcc-loops.cpp                           4163.1676+-18.3908    ?   4175.9266+-52.8059       ?
   bigfib.cpp                               456.7496+-6.6840     ?    457.0188+-5.5533        ?
   hash-map                                 153.3622+-1.5466     ?    153.5558+-1.7462        ?
   dry.c                                    478.1047+-12.4209         477.8930+-9.9328

   <geometric>                              688.1226+-3.0102          688.0901+-2.4242          might be 1.0000x faster

                                                 baseline                  patched
Geomean of preferred means:
   <scaled-result>                           35.5738+-0.0610           35.4851+-0.0812          might be 1.0025x faster
Comment 4 Yusuke Suzuki 2016-06-10 21:33:39 PDT
Comment on attachment 281075 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=281075&action=review

> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1062
> +ALWAYS_INLINE int32_t toInt32(double number)

It may be good to change this ALWAYS_INLINE to inline to rely on the compiler's decision. I'll check this.
Comment 5 Yusuke Suzuki 2016-06-10 21:57:48 PDT
Comment on attachment 281075 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=281075&action=review

>> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1062
>> +ALWAYS_INLINE int32_t toInt32(double number)
> 
> It may be good to change this ALWAYS_INLINE to inline to rely on the compiler's decision. I'll check this.

Performance results recommend us to use ALWAYS_INLINE for this toInt32 function. So use ALWAYS_INLINE. 

Benchmark report for Kraken on hanayamata.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf/Release/bin/jsc
"always" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf2/Release/bin/jsc

Collected 100 samples per benchmark/VM, with 100 VM invocations per benchmark. Emitted a call to gc() between sample measurements.
Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level
timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                                 baseline                  patched                    always                always v. baseline    

stanford-crypto-pbkdf2       107.863+-0.376      ^     106.828+-0.409            106.087+-0.450         ^ definitely 1.0167x faster

<arithmetic>                 107.863+-0.376      ^     106.828+-0.409            106.087+-0.450         ^ definitely 1.0167x faster
Comment 6 Mark Lam 2016-06-10 22:06:46 PDT
Comment on attachment 281075 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=281075&action=review

r=me with fix.

>>> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1062
>>> +ALWAYS_INLINE int32_t toInt32(double number)
>> 
>> It may be good to change this ALWAYS_INLINE to inline to rely on the compiler's decision. I'll check this.
> 
> Performance results recommend us to use ALWAYS_INLINE for this toInt32 function. So use ALWAYS_INLINE. 
> 
> Benchmark report for Kraken on hanayamata.
> 
> VMs tested:
> "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf-master/Release/bin/jsc
> "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf/Release/bin/jsc
> "always" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf2/Release/bin/jsc
> 
> Collected 100 samples per benchmark/VM, with 100 VM invocations per benchmark. Emitted a call to gc() between sample measurements.
> Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level
> timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.
> 
>                                  baseline                  patched                    always                always v. baseline    
> 
> stanford-crypto-pbkdf2       107.863+-0.376      ^     106.828+-0.409            106.087+-0.450         ^ definitely 1.0167x faster
> 
> <arithmetic>                 107.863+-0.376      ^     106.828+-0.409            106.087+-0.450         ^ definitely 1.0167x faster

I know that this function came from JSCJSValue.cpp, but I think a better place to put it is in runtime/MathCommon.h since it doesn't have anything to do with JSValues.
Comment 7 Yusuke Suzuki 2016-06-10 22:10:02 PDT
Comment on attachment 281075 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=281075&action=review

>>>> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1062
>>>> +ALWAYS_INLINE int32_t toInt32(double number)
>>> 
>>> It may be good to change this ALWAYS_INLINE to inline to rely on the compiler's decision. I'll check this.
>> 
>> Performance results recommend us to use ALWAYS_INLINE for this toInt32 function. So use ALWAYS_INLINE. 
>> 
>> Benchmark report for Kraken on hanayamata.
>> 
>> VMs tested:
>> "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf-master/Release/bin/jsc
>> "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf/Release/bin/jsc
>> "always" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf2/Release/bin/jsc
>> 
>> Collected 100 samples per benchmark/VM, with 100 VM invocations per benchmark. Emitted a call to gc() between sample measurements.
>> Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level
>> timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.
>> 
>>                                  baseline                  patched                    always                always v. baseline    
>> 
>> stanford-crypto-pbkdf2       107.863+-0.376      ^     106.828+-0.409            106.087+-0.450         ^ definitely 1.0167x faster
>> 
>> <arithmetic>                 107.863+-0.376      ^     106.828+-0.409            106.087+-0.450         ^ definitely 1.0167x faster
> 
> I know that this function came from JSCJSValue.cpp, but I think a better place to put it is in runtime/MathCommon.h since it doesn't have anything to do with JSValues.

Right! I'll move this to MathCommon.h (And that is also the better place to put operationToInt32 since operationMathPow resides there).
Comment 8 Yusuke Suzuki 2016-06-10 22:29:10 PDT
Comment on attachment 281075 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=281075&action=review

>>>>> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1062
>>>>> +ALWAYS_INLINE int32_t toInt32(double number)
>>>> 
>>>> It may be good to change this ALWAYS_INLINE to inline to rely on the compiler's decision. I'll check this.
>>> 
>>> Performance results recommend us to use ALWAYS_INLINE for this toInt32 function. So use ALWAYS_INLINE. 
>>> 
>>> Benchmark report for Kraken on hanayamata.
>>> 
>>> VMs tested:
>>> "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf-master/Release/bin/jsc
>>> "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf/Release/bin/jsc
>>> "always" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/perf2/Release/bin/jsc
>>> 
>>> Collected 100 samples per benchmark/VM, with 100 VM invocations per benchmark. Emitted a call to gc() between sample measurements.
>>> Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level
>>> timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.
>>> 
>>>                                  baseline                  patched                    always                always v. baseline    
>>> 
>>> stanford-crypto-pbkdf2       107.863+-0.376      ^     106.828+-0.409            106.087+-0.450         ^ definitely 1.0167x faster
>>> 
>>> <arithmetic>                 107.863+-0.376      ^     106.828+-0.409            106.087+-0.450         ^ definitely 1.0167x faster
>> 
>> I know that this function came from JSCJSValue.cpp, but I think a better place to put it is in runtime/MathCommon.h since it doesn't have anything to do with JSValues.
> 
> Right! I'll move this to MathCommon.h (And that is also the better place to put operationToInt32 since operationMathPow resides there).

Looking into MathCommon, it includes several headers to gain some define flags. But I think these flags should be moved to WTF/Compiler.h.
I'll create a new patch later.
Comment 9 Yusuke Suzuki 2016-06-10 22:30:20 PDT
Committed r201964: <http://trac.webkit.org/changeset/201964>
Comment 10 Yusuke Suzuki 2016-06-10 23:00:20 PDT
Committed r201966: <http://trac.webkit.org/changeset/201966>
Comment 11 Darin Adler 2016-06-11 12:35:02 PDT
Comment on attachment 281075 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=281075&action=review

> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1057
> +// of the resulting bit-pattern (as such this metod is also called to implement

Typo here "metod".

> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1071
> +    // Note this case handles 0, -0, and all infinte, NaN, & denormal value.

Typo here "infinte"
Comment 12 Yusuke Suzuki 2016-06-13 09:47:28 PDT
Comment on attachment 281075 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=281075&action=review

Thanks! I'll land the follow up patch that fixes pointed typos.

>> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1057
>> +// of the resulting bit-pattern (as such this metod is also called to implement
> 
> Typo here "metod".

Thanks! Fixed in the follow up patch.

>> Source/JavaScriptCore/runtime/JSCJSValueInlines.h:1071
>> +    // Note this case handles 0, -0, and all infinte, NaN, & denormal value.
> 
> Typo here "infinte"

Fixed.
Comment 13 Yusuke Suzuki 2016-06-13 09:51:05 PDT
Committed r201993: <http://trac.webkit.org/changeset/201993>