Bug 98694 - [Performance] Speed-up DOM tree traversal on ARM platforms
Summary: [Performance] Speed-up DOM tree traversal on ARM platforms
Status: UNCONFIRMED
Alias: None
Product: WebKit
Classification: Unclassified
Component: CSS (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Enhancement
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-08 15:32 PDT by Kulanthaivel Palanichamy
Modified: 2014-02-05 11:08 PST (History)
8 users (show)

See Also:


Attachments
Proposed patch (7.49 KB, patch)
2012-10-08 17:59 PDT, Kulanthaivel Palanichamy
no flags Details | Formatted Diff | Diff
Proposed patch (10.39 KB, patch)
2012-10-08 18:06 PDT, Kulanthaivel Palanichamy
benjamin: review-
Details | Formatted Diff | Diff
Design doc (46.74 KB, application/pdf)
2012-10-11 16:49 PDT, Kulanthaivel Palanichamy
no flags Details
Performance tests results on x86 (303.34 KB, text/html)
2012-10-11 16:54 PDT, Kulanthaivel Palanichamy
no flags Details
Performance tests results on ARM (52.28 KB, image/png)
2012-10-11 16:59 PDT, Kulanthaivel Palanichamy
no flags Details
Proposed patch (16.58 KB, patch)
2012-10-16 16:37 PDT, Kulanthaivel Palanichamy
no flags Details | Formatted Diff | Diff
Proposed patch (15.92 KB, patch)
2012-10-19 11:56 PDT, Kulanthaivel Palanichamy
no flags Details | Formatted Diff | Diff
Proposed patch (15.89 KB, patch)
2012-11-01 14:45 PDT, Kulanthaivel Palanichamy
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Kulanthaivel Palanichamy 2012-10-08 15:32:49 PDT
DOM tree traversal seems to be too slow on ARM platforms probably due to its limited CPU cache size and high cache latencies.
Comment 1 Kulanthaivel Palanichamy 2012-10-08 17:59:54 PDT
Created attachment 167660 [details]
Proposed patch

I'm aware of the fact that the community doesn't prefer any platform specific code in platform independent parts of WebCore,
but considering the performance gain we get out of this change, it deemed worth a try. Please let me know if you have any
suggestions to do it in a much cleaner way.

This particular patch gives about ~16% improvement in Dromaeo CSS Selector Tests score on Krait S4, dual core @ 1.5 GHz.

http://dromaeo.com/?id=181274,181299

181299 - Stock WebKit
181274 - With prefetch optimization

Currently this patch will be enabled only on ARM platform with GCC or Clang compiler. The built-in prefetch has no effect on x86 platforms.
Comment 2 WebKit Review Bot 2012-10-08 18:01:36 PDT
Attachment 167660 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/wtf/Compiler.h', u'Source/WTF/w..." exit_code: 1
Source/WebCore/dom/Node.h:905:  Place brace on its own line for function definitions.  [whitespace/braces] [4]
Source/WebCore/dom/Node.h:917:  Place brace on its own line for function definitions.  [whitespace/braces] [4]
Total errors found: 2 in 7 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 Kulanthaivel Palanichamy 2012-10-08 18:06:39 PDT
Created attachment 167661 [details]
Proposed patch
Comment 4 Benjamin Poulain 2012-10-08 18:13:01 PDT
> I'm aware of the fact that the community doesn't prefer any platform specific code in platform independent parts of WebCore,
> but considering the performance gain we get out of this change, it deemed worth a try. Please let me know if you have any
> suggestions to do it in a much cleaner way.

This is not platform dependent code (in the sense of WebKit port), this is CPU specific code. Such change is perfectly acceptable in WebCore.
 
> This particular patch gives about ~16% improvement in Dromaeo CSS Selector Tests score on Krait S4, dual core @ 1.5 GHz.
> 
> http://dromaeo.com/?id=181274,181299
> 
> 181299 - Stock WebKit
> 181274 - With prefetch optimization

Please explain how you get from 181299 VS 181274 to 16% improvement.
Comment 5 Benjamin Poulain 2012-10-08 18:16:18 PDT
Comment on attachment 167661 [details]
Proposed patch

We need a much refine performance claim to use anything like this. The problem with prefetch is it varies with the architecture, and on ARM, with the implementation of the architecture.

Please:
1) Get solid performance number, use microbenchmarks if needed.
2) Include the performance analysis in the ChangeLog.
3) Test on various CPU if you have access to them.
Comment 6 Darin Adler 2012-10-08 18:33:44 PDT
Comment on attachment 167661 [details]
Proposed patch

View in context: https://bugs.webkit.org/attachment.cgi?id=167661&action=review

Thanks for submitting this. It’s cool that this can make DOM traversal much faster. I’m sure we can figure out a form of this that can go into the tree.

There are lots of design decisions that are not explained in this code. Need more comments explaining why, not just mystery code.

Clearly this prefetching code could make some operations slower. It’s good that it makes benchmarks faster, but we should also construct the pathological cases that become slower so we can see how much slower.

> Source/WTF/wtf/Compiler.h:222
> +#define PREFETCH(x)

This isn’t valuable. Seems highly unlikely we will want to compile code that uses the PREFETCH macro without also having an ENABLE conditional around it. I think we can just leave this out.

> Source/WTF/wtf/Platform.h:1033
> +/* Prefetch to minimize cache-miss latency by moving data into a cache before it is accessed */
> +#if !defined(BUILTIN_PREFETCH) && CPU(ARM) && COMPILER_SUPPORTS(BUILTIN_PREFETCH)
> +#define ENABLE_PREFETCH 1
> +#endif

I don’t understand the !defined(BUILTIN_PREFETCH) in this expression.

I don’t like the use of CPU(ARM) here, without a comment mentioning why it’s appropriate.

I don’t think ENABLE(PREFETCH) is specific enough. There are so many things that prefetch could mean. We should chose a name that is specific enough to make it clear that this is memory prefetching.

> Source/WebCore/dom/ContainerNode.h:279
> +    prefetch();

This function name is unclear. It doesn’t say what is being fetched. It’s not appropriate to have a function named just “prefetch” on a node. Need a human-comprehensible name.

At each call site we need to say why it’s right to call prefetch there. Otherwise it just looks like you made magic decisions about where to put these calls, and no one in the future can figure out whether it’s safe to remove them and where they need to be added then.

> Source/WebCore/dom/ContainerNode.h:287
> +    prefetch();

Not good to put this above the comment below, which should be at the top of the function.

> Source/WebCore/dom/Node.h:103
> +const int prefetchStride = 6;

The mystery magic constant needs a comment explaining where it came from. If you determined the value by testing, it needs to explain that. If you determined it by some theoretical means it needs to explain that.

> Source/WebCore/dom/Node.h:268
> +    void updatePrefetchTarget();
> +    void setPrefetchTarget(Node *prefetch) { m_prefetchTarget = prefetch; }

These new functions shouldn’t be crammed in with the setPrevious/NextSibling call. They should go in their own paragraph with comments that help explain the mechanism.

> Source/WebCore/dom/Node.h:269
> +    void setNextSibling(Node* next) { m_next = next; updatePrefetchTarget(); }

This is an inappropriate level to add the prefetch updating. It should be done at call sites, not in this setter, unless that’s completely impractical. This low level setter needs to remain a low level setter.

> Source/WebCore/dom/Node.h:815
> +#if ENABLE(PREFETCH)
> +    Node* m_prefetchTarget;
> +#endif

Is it important to have this data member here? Could it be put at the end of the data members instead where it would not be in the middle of others, making them confusing.

I don’t think m_prefetchTarget is the right name for this. Instead I would like it to be something that expresses more clearly what it’s supposed to represent, rather than how it’s used.

> Source/WebCore/dom/Node.h:909
> +    if (m_next) {
> +        Node* from = this;
> +        Node* n = from->traversePreviousNodePostOrder();
> +        for (int skew = prefetchStride; skew && n; skew--) {
> +            from = n;
> +            n = n->traversePreviousNodePostOrder();
> +        }
> +        from->setPrefetchTarget(m_next);
> +    }

This function is completely mysterious to me. There’s no why comment explaining why going backwards is what we want to do. No explanation of what “from” means as a variable name, or what “skew” means.

I gather that m_prefetchTarget is a sort of inaccurate “skip” pointer that contains an often incorrect, often correct pointer to the sixth sibling from the current sibling.

I can make guesses about why it does what it does, but I should not have to guess. You should be explaining this in comments.

> Source/WebCore/dom/Node.h:916
> +    if (LIKELY(m_prefetchTarget != 0))

Need a comment here explaining it’s OK to have a bad pointer as the prefetch target; otherwise this code looks like a bug because it uses a dangling pointer.

Do we really want this branch? If it’s OK to prefetch with any pointer, then why not initialize m_prefetchTarget to this rather than having 0 as a magic value?

> Source/WebCore/dom/Node.h:917
> +        PREFETCH(((char *) m_prefetchTarget));

The cast to (char *) should be something that’s part of the PREFETCH macro rather than done at each call site, since it’s part of the idiosyncrasy of a particular compiler’s feature and might be different in other compilers.
Comment 7 Darin Adler 2012-10-08 18:34:35 PDT
(In reply to comment #4)
> > http://dromaeo.com/?id=181274,181299
> > 
> > 181299 - Stock WebKit
> > 181274 - With prefetch optimization
> 
> Please explain how you get from 181299 VS 181274 to 16% improvement.

Those aren’t performance numbers, they are Dromaeo run numbers. Follow the links to see the Dromaeo results.
Comment 8 Benjamin Poulain 2012-10-08 20:28:10 PDT
> > Please explain how you get from 181299 VS 181274 to 16% improvement.
> 
> Those aren’t performance numbers, they are Dromaeo run numbers. Follow the links to see the Dromaeo results.

Wosh, my apologies, that'll teach me to skim over emails instead of reading bugzilla.

Ignore my comments, they were based on the assumption that you were adding prefetch for less than 1% improvement.
16% is an amazing improvement.
Comment 9 Benjamin Poulain 2012-10-10 01:04:44 PDT
(In reply to comment #0)
> DOM tree traversal seems to be too slow on ARM platforms probably due to its limited CPU cache size and high cache latencies.

Any update on your work? I am quite interested in seeing this patch progress.
Comment 10 Kulanthaivel Palanichamy 2012-10-10 11:08:41 PDT
(In reply to comment #9)
> (In reply to comment #0)
> > DOM tree traversal seems to be too slow on ARM platforms probably due to its limited CPU cache size and high cache latencies.
> 
> Any update on your work? I am quite interested in seeing this patch progress.

I'm working on running webkit perf test suite on various HW/SW combinations. I'll update you with the results as well as a detailed explanation of PLD and the algorithm we used to calculate the prefetch target nodes shortly.
Comment 11 Kulanthaivel Palanichamy 2012-10-11 16:49:23 PDT
Created attachment 168317 [details]
Design doc
Comment 12 Kulanthaivel Palanichamy 2012-10-11 16:50:46 PDT
Some Background info about PLD:

From GCC Doc:

— Built-in Function: void __builtin_prefetch (const void *addr, ...)
This function is used to minimize cache-miss latency by moving data into a cache before it is accessed.
You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that
is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated.
If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

From ARM Guide:

Data cache preload instruction
ARMv7-A specifies the PLD instruction as a preload hint instruction. The processor uses the PLD instruction
to preload cache lines to the L2 cache. If the PLD instruction results in a L1 cache hit, L2 cache hit, or TLB miss
no more action is taken. If a cache miss and TLB hit result, the line is retrieved from external memory and is loaded
into the L2 memory cache.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344i/Cbbbdaed.html

Some interesting facts about PLD
- No alignment checking is performed for preload instructions.
- This is a hint instruction, and its implementation is optional. If it is not implemented, it executes as NOP.
- PLD does not generate faults if addr is invalid.
- The amount of data brought in by a single PLD instruction depends on cache line size and CPU architecture.
  For example, in ARM Cortex A9 implements 64 bytes cache lines, but with supporting the same ISA (ARMv7a),
  Qualcomm's Scorpion and Krait architecture implements 128 bytes cache lines. But in all of these variants,
  only one cache line will be filled by a single PLD instruction. Most of the x86 based architectures implement
  64 bytes cache lines.

DOM PLD Optimization:

Design:

Any tree data structure with very high number of nodes that are dynamically allocated will have sparse memory access pattern,
and all of it's data may not be able to fit into CPU cache.
(Note : Here, our focus is on low-end processors (All of ARM and some of X86 based ultra mobile processors) with very limited CPU cache size.)
Usually traversing such a tree involves a very small amount of memory access to do key matching and to move to next node.
Any tree traversal algorithm on such a tree could be affected positively or negatively on a CPU by the following factors
	•	CPU cache size
	•	CPU cache line size
	•	CPU cache latency
	•	External memory latency

Our optimization follows these simple steps to speed up DOM tree traversal 

1. Add an extra pointer in WebCore::Node class to store the prefetch target (skip pointer) for each DOM node.

2. Calculate the prefetch stride N based on average CPU cycles it would take to reach Nth node (on a normal pre-order traversal),
   and average CPU wait cycles it would take to prefetch the required data from external memory into CPU cache.
   Based on our experiments, on most of ARM variants, optimum value of N could be anywhere between 6-9.

3. During DOM construction/modification, for pre-order traversal, set prefetch target of Nth node backwards
   in post-order as the current node.
 
4. Do speculative prefetch of prefetch target into CPU cache for each visited node during traversal.


Refer attachment https://bugs.webkit.org/attachment.cgi?id=168317 for the explanation, why post-order is used to calculate prefetch targets for pre-order traversal?
Comment 13 Kulanthaivel Palanichamy 2012-10-11 16:54:28 PDT
Created attachment 168319 [details]
Performance tests results on x86

  First Column : Stock WebKit
  Second Column : With optimization, but not including NULL check in prefetch()
  Third Column : With optimization, including NULL check in prefetch()
Comment 14 Kulanthaivel Palanichamy 2012-10-11 16:55:09 PDT
Test Results for the current patch:

  System Configuration:
  Mac Pro, 2 X 6-Core Intel Xeon @ 2.66 GHz, 24 GB RAM
  L2 Cache (per Core):	256 KB
  L3 Cache (per Processor):	12 MB
  WebKit revision : r130776
  
  Dromaeo : http://dromaeo.com/?id=181757,181771
  Stock WebKit - Dromaeo id 181771
  With optimization - Dromeao id 181757
  
  WebKit Performance test suite: https://bugs.webkit.org/attachment.cgi?id=168319 (attachment)
  First Column : Stock WebKit
  Second Column : With optimization, but not including NULL check in prefetch()
  Third Column : With optimization, including NULL check in prefetch()
    
    Observations: 
    - Not making NULL check for PLD addresse seems to have very negative impact on both X86 and ARM based architectures.
    - We have few cases were performance got regressed heavily due to the overhead of PLD target node calculations.
Comment 15 Kulanthaivel Palanichamy 2012-10-11 16:59:09 PDT
Created attachment 168321 [details]
Performance tests results on ARM
Comment 16 Kulanthaivel Palanichamy 2012-10-11 17:01:40 PDT
Test Results for the current patch:
  
  System Configuration:
  Android, 2-Core Qualcomm Krait S4 @ 1.5 GHz, 1 GB RAM
  L0 Cache :	4kB (data) + 4kB (instruction) direct mapped
  L1 Cache :	16 kB (data) + 16 kB (instruction)
  L2 Cache :    1 MB
  Build : Chromium content shell, release build
  
    WebKit Performance test suite: https://bugs.webkit.org/attachment.cgi?id=168321
  run-perf-test is now working on Chromium Content Shell build on android. These results are collected manually.
Comment 17 Benjamin Poulain 2012-10-11 17:11:28 PDT
>     - Not making NULL check for PLD addresse seems to have very negative impact on both X86 and ARM based architectures.

This is very surprising for ARM. In the design document, it says PLD has been designed explicitly with the branch issue in mind. Loading an invalid address has no impact, while the branch will cost cycles on misspredictions.

Have you only tested on Krait or did you get to try on a "regular" implementation of ARMv7? Because krait is more the exception than the norm.

Have you tried having Node::updatePrefetchTarget() implemented but being an NOOP? Because Node::updatePrefetchTarget() influence the cache so you are not measuring the PLD alone.

Very nice work in any case.
I think we may have to tweak the stride size by ARM implementation.
Comment 18 Kulanthaivel Palanichamy 2012-10-16 16:37:48 PDT
Created attachment 169056 [details]
Proposed patch

Addressed most of the review comments in this patch.

Note: This is still a work in progress and I'm trying to minimize the regression on some of the WebKit perf tests.
Comment 19 Kulanthaivel Palanichamy 2012-10-16 16:39:11 PDT
(In reply to comment #6)
> (From update of attachment 167661 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=167661&action=review
> 
> Thanks for submitting this. It’s cool that this can make DOM traversal much faster. I’m sure we can figure out a form of this that can go into the tree.
> 

Thanks, this is very encouraging.

> There are lots of design decisions that are not explained in this code. Need more comments explaining why, not just mystery code.
> 
> Clearly this prefetching code could make some operations slower. It’s good that it makes benchmarks faster, but we should also construct the pathological cases that become slower so we can see how much slower.
> 
This code definitely makes some of the tests slower in our WebKit performance suite itself. It may not be possible to reach a point where there is
no regression at all, but I'm working on addressing them one by one to reduce the regression to a level where it is acceptable.

> > Source/WTF/wtf/Compiler.h:222
> > +#define PREFETCH(x)
> 
> This isn’t valuable. Seems highly unlikely we will want to compile code that uses the PREFETCH macro without also having an ENABLE conditional around it. I think we can just leave this out.
> 
Renamed it to BUILTIN_PREFETCH to match the compiler __builtin_XXX naming format.

> > Source/WTF/wtf/Platform.h:1033
> > +/* Prefetch to minimize cache-miss latency by moving data into a cache before it is accessed */
> > +#if !defined(BUILTIN_PREFETCH) && CPU(ARM) && COMPILER_SUPPORTS(BUILTIN_PREFETCH)
> > +#define ENABLE_PREFETCH 1
> > +#endif
> 
> I don’t understand the !defined(BUILTIN_PREFETCH) in this expression.
It was a mistake. It suppose to be !define(ENABLE_PREFETCH), now it is renamed as ENABLE_PLD_DOM_OPTIMIZATION.

> 
> I don’t like the use of CPU(ARM) here, without a comment mentioning why it’s appropriate.
> 
Initially I tested this patch only on ARM, it looks like it benefits even x86.

> I don’t think ENABLE(PREFETCH) is specific enough. There are so many things that prefetch could mean. We should chose a name that is specific enough to make it clear that this is memory prefetching.
> 
> > Source/WebCore/dom/ContainerNode.h:279
> > +    prefetch();
> 
> This function name is unclear. It doesn’t say what is being fetched. It’s not appropriate to have a function named just “prefetch” on a node. Need a human-comprehensible name.
> 
Renamed it to preloadNextNodeIntoCPUCache(). Though it doesn't preload the next node into CPU cache, it preloads a node that is few nodes ahead in the traversal path. 

> At each call site we need to say why it’s right to call prefetch there. Otherwise it just looks like you made magic decisions about where to put these calls, and no one in the future can figure out whether it’s safe to remove them and where they need to be added then.
> 
> > Source/WebCore/dom/ContainerNode.h:287
> > +    prefetch();
> 
> Not good to put this above the comment below, which should be at the top of the function.
> 
Corrected.

> > Source/WebCore/dom/Node.h:103
> > +const int prefetchStride = 6;
> 
> The mystery magic constant needs a comment explaining where it came from. If you determined the value by testing, it needs to explain that. If you determined it by some theoretical means it needs to explain that.
> 
Done.

> > Source/WebCore/dom/Node.h:268
> > +    void updatePrefetchTarget();
> > +    void setPrefetchTarget(Node *prefetch) { m_prefetchTarget = prefetch; }
> 
> These new functions shouldn’t be crammed in with the setPrevious/NextSibling call. They should go in their own paragraph with comments that help explain the mechanism.
>
Done.
 
> > Source/WebCore/dom/Node.h:269
> > +    void setNextSibling(Node* next) { m_next = next; updatePrefetchTarget(); }
> 
> This is an inappropriate level to add the prefetch updating. It should be done at call sites, not in this setter, unless that’s completely impractical. This low level setter needs to remain a low level setter.
>
Done. Now updateCPUCachePreloadTarget() is getting called from proper call sites.
 
> > Source/WebCore/dom/Node.h:815
> > +#if ENABLE(PREFETCH)
> > +    Node* m_prefetchTarget;
> > +#endif
> 
> Is it important to have this data member here? Could it be put at the end of the data members instead where it would not be in the middle of others, making them confusing.
>
Moved it to the end.
 
> I don’t think m_prefetchTarget is the right name for this. Instead I would like it to be something that expresses more clearly what it’s supposed to represent, rather than how it’s used.
>
Renamed it to m_cpuCachePreloadTarget.
 
> > Source/WebCore/dom/Node.h:909
> > +    if (m_next) {
> > +        Node* from = this;
> > +        Node* n = from->traversePreviousNodePostOrder();
> > +        for (int skew = prefetchStride; skew && n; skew--) {
> > +            from = n;
> > +            n = n->traversePreviousNodePostOrder();
> > +        }
> > +        from->setPrefetchTarget(m_next);
> > +    }
> 
> This function is completely mysterious to me. There’s no why comment explaining why going backwards is what we want to do. No explanation of what “from” means as a variable name, or what “skew” means.
>
I've added some minimal comments in this function, but explaining the complete logic in comments seems to be too much in Node.h, so I've added the bugzilla link in comments
for further reference.
 
> I gather that m_prefetchTarget is a sort of inaccurate “skip” pointer that contains an often incorrect, often correct pointer to the sixth sibling from the current sibling.
> 
> I can make guesses about why it does what it does, but I should not have to guess. You should be explaining this in comments.
> 
> > Source/WebCore/dom/Node.h:916
> > +    if (LIKELY(m_prefetchTarget != 0))
> 
> Need a comment here explaining it’s OK to have a bad pointer as the prefetch target; otherwise this code looks like a bug because it uses a dangling pointer.
> 
> Do we really want this branch? If it’s OK to prefetch with any pointer, then why not initialize m_prefetchTarget to this rather than having 0 as a magic value?
> 
Removing NULL check regresses many of the performance tests.
Technically passing NULL to PLD should result in NOP, but it looks like it carries some penalty on ARM & x86.

> > Source/WebCore/dom/Node.h:917
> > +        PREFETCH(((char *) m_prefetchTarget));
> 
> The cast to (char *) should be something that’s part of the PREFETCH macro rather than done at each call site, since it’s part of the idiosyncrasy of a particular compiler’s feature and might be different in other compilers.
Done.
Comment 20 Kulanthaivel Palanichamy 2012-10-19 11:56:22 PDT
Created attachment 169667 [details]
Proposed patch

Minimized some of the regressions seen in WebKit performance test suite.
Comment 21 Benjamin Poulain 2012-10-19 17:01:05 PDT
Comment on attachment 169667 [details]
Proposed patch

View in context: https://bugs.webkit.org/attachment.cgi?id=169667&action=review

I am unconvinced. I think there is too much tradeoff.
I also think it should be by implementation at first, not just CPU(ARM).

> Source/WTF/ChangeLog:3
> +        [Performance] Speed-up DOM tree traversal on ARM platforms

The title needs to be updated, you are now also enabling this on x86 and x86_64.

> Source/WTF/wtf/Platform.h:1039
> +/* Prefetch to minimize cache-miss latency by moving DOM nodes into CPU cache before they are accessed.
> +   Since ARM & x86 seem to benefit most out of this optimizaiton, this is enabled only on those platforms.
> +   Note: This optimization is highly sensitive to CPU micro architecture design choises such as
> +   L0/L1/L2 cache size & latency, L0/L1/L2 cache line size, PLD and external memory (DDR) bandwidth/latency. */
> +#if !defined(ENABLE_PLD_DOM_OPTIMIZATION) && (CPU(X86) || CPU(X86_64) || CPU(ARM)) && COMPILER_SUPPORTS(BUILTIN_PREFETCH)
> +#define ENABLE_PLD_DOM_OPTIMIZATION 1
> +#endif

Such definition would be more suited inside WebCore than in Platform.h. You could probably move it to Node.h.
The name of the macro contains "PLD", which is an ARM instruction. Why not use "Prefetch" like for the builtin?

Typo: optimizaiton.
No need to say "(DDR)" for the memory.

> Source/WebCore/ChangeLog:25
> +        and external memory (DDR) latency.

"(DDR)" here is unnecessary.

> Source/WebCore/ChangeLog:41
> +        CSS Selector Tests (Total Score)        +16% (ARM), +11% (x86)
> +        Bindings/event-target-wrapper           -11% (ARM), -10% (x86)
> +        Bindings/undefined-get-element-by-id    -10% (ARM), -08% (x86)
> +        Bindings/undefined-id-getter            +01% (ARM), +14% (x86)
> +        DOM/Accessors                           +10% (ARM), +06% (x86)
> +        DOM/GetElement                          -09% (ARM), -10% (x86)
> +        DOM/CloneNodes                          -01% (ARM), -07% (x86)
> +        DOM/DOMTable                            +06% (ARM), +05% (x86)
> +        Dromaeo/dom-attr                        +05% (ARM), +03% (x86)
> +        Dromaeo/dom-modify                      -04% (ARM), -06% (x86)
> +        Dromaeo/jslib-attr-prototype            -00% (ARM), -05% (x86)
> +        Dromaeo/jslib-style-prototype           +04% (ARM), +19% (x86)
> +        CSS/StyleSheetInsert                    +31% (ARM), +07% (x86)
> +        Parser/query-selector-last              +09% (ARM), -04% (x86)

I quickly tried the patch on a ARMv7 core, here are my results:
-DOM/Accessors: -1.7%
-DOM/GetElement: -3.3%
-Dromaeo dom-attr: +2.8%

> Source/WebCore/dom/Node.h:108
> +const int prefetchStride = 7;

This should be unsigned.

> Source/WebCore/dom/Node.h:918
> +    for (int i = prefetchStride; i && previous; i--) {

This should be unsigned.

> Source/WebCore/dom/Node.h:934
> +    // Note: Since passing dangling/bad pointer to PLD results in NOP, it is ok to have
> +    // bad pointer in m_cpuCachePreloadTarget.

This is true for PLD, what about x86 prefetch?

> Source/WebCore/dom/Node.h:936
> +    if (LIKELY(m_cpuCachePreloadTarget != 0))

I am still unconvinced by this. :)

> Source/WebCore/dom/Node.h:937
> +        __builtin_prefetch((char *)(m_cpuCachePreloadTarget));

Use a C++ cast instead of a C cast.
__builtin_prefetch takes a void*
Comment 22 Benjamin Poulain 2012-10-19 17:02:11 PDT
Ryosuke, any opinion?
Comment 23 Kulanthaivel Palanichamy 2012-11-01 14:44:52 PDT
(In reply to comment #21)
> (From update of attachment 169667 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=169667&action=review
> 
> I am unconvinced. I think there is too much tradeoff.
> I also think it should be by implementation at first, not just CPU(ARM).
> 
> > Source/WTF/ChangeLog:3
> > +        [Performance] Speed-up DOM tree traversal on ARM platforms
> 
> The title needs to be updated, you are now also enabling this on x86 and x86_64.
> 
> > Source/WTF/wtf/Platform.h:1039
> > +/* Prefetch to minimize cache-miss latency by moving DOM nodes into CPU cache before they are accessed.
> > +   Since ARM & x86 seem to benefit most out of this optimizaiton, this is enabled only on those platforms.
> > +   Note: This optimization is highly sensitive to CPU micro architecture design choises such as
> > +   L0/L1/L2 cache size & latency, L0/L1/L2 cache line size, PLD and external memory (DDR) bandwidth/latency. */
> > +#if !defined(ENABLE_PLD_DOM_OPTIMIZATION) && (CPU(X86) || CPU(X86_64) || CPU(ARM)) && COMPILER_SUPPORTS(BUILTIN_PREFETCH)
> > +#define ENABLE_PLD_DOM_OPTIMIZATION 1
> > +#endif
> 
> Such definition would be more suited inside WebCore than in Platform.h. You could probably move it to Node.h.
but I thought we need some explanation (as per the previous review comment) about this flag in the place where it is defined.

> The name of the macro contains "PLD", which is an ARM instruction. Why not use "Prefetch" like for the builtin?
> 
x86 has 'PREFETCH' instructions that are equivalent to PLD on ARM. However it is renamed to PREFETCH_DOM_OPTIMIZATION.

> Typo: optimizaiton.
> No need to say "(DDR)" for the memory.
Done.
> 
> > Source/WebCore/ChangeLog:25
> > +        and external memory (DDR) latency.
> 
> "(DDR)" here is unnecessary.
Removed "(DDR)".
> 
> > Source/WebCore/ChangeLog:41
> > +        CSS Selector Tests (Total Score)        +16% (ARM), +11% (x86)
> > +        Bindings/event-target-wrapper           -11% (ARM), -10% (x86)
> > +        Bindings/undefined-get-element-by-id    -10% (ARM), -08% (x86)
> > +        Bindings/undefined-id-getter            +01% (ARM), +14% (x86)
> > +        DOM/Accessors                           +10% (ARM), +06% (x86)
> > +        DOM/GetElement                          -09% (ARM), -10% (x86)
> > +        DOM/CloneNodes                          -01% (ARM), -07% (x86)
> > +        DOM/DOMTable                            +06% (ARM), +05% (x86)
> > +        Dromaeo/dom-attr                        +05% (ARM), +03% (x86)
> > +        Dromaeo/dom-modify                      -04% (ARM), -06% (x86)
> > +        Dromaeo/jslib-attr-prototype            -00% (ARM), -05% (x86)
> > +        Dromaeo/jslib-style-prototype           +04% (ARM), +19% (x86)
> > +        CSS/StyleSheetInsert                    +31% (ARM), +07% (x86)
> > +        Parser/query-selector-last              +09% (ARM), -04% (x86)
> 
> I quickly tried the patch on a ARMv7 core, here are my results:
> -DOM/Accessors: -1.7%
> -DOM/GetElement: -3.3%
> -Dromaeo dom-attr: +2.8%
Finally I had a chance to test it on a Samsung Galaxy III which had Exynos QuadCore ARM A9
clocked at 1.4Ghz; I've seen only ~5% improvement in that device for CSS Selector tests.
I think we will see wild variations in performance gain/regression even within different ARMv7 cores
from Nvidia, Samsung, TI and Qualcomm and I guess the same is true for different generations of AMD/Intel x86 cores as well.

> 
> > Source/WebCore/dom/Node.h:108
> > +const int prefetchStride = 7;
> 
> This should be unsigned.
Corrected.

> 
> > Source/WebCore/dom/Node.h:918
> > +    for (int i = prefetchStride; i && previous; i--) {
> 
> This should be unsigned.
Corrected.

> 
> > Source/WebCore/dom/Node.h:934
> > +    // Note: Since passing dangling/bad pointer to PLD results in NOP, it is ok to have
> > +    // bad pointer in m_cpuCachePreloadTarget.
> 
> This is true for PLD, what about x86 prefetch?
Just checked the intel's document, it looks like x86 also behaves in a similar way.

> 
> > Source/WebCore/dom/Node.h:936
> > +    if (LIKELY(m_cpuCachePreloadTarget != 0))
> 
> I am still unconvinced by this. :)
Me neither, but removing NULL check regresses so much in both x86 as well as in ARM.
We can remove this NULL check if we can maintain the integrity of this preload pointers
on all possible DOM modification scenarios, but doing such a strict preload pointer calculation
will regress most of the DOM and Attribute modification related performance tests.

> 
> > Source/WebCore/dom/Node.h:937
> > +        __builtin_prefetch((char *)(m_cpuCachePreloadTarget));
> 
> Use a C++ cast instead of a C cast.
> __builtin_prefetch takes a void*
Corrected.
Comment 24 Kulanthaivel Palanichamy 2012-11-01 14:45:54 PDT
Created attachment 171935 [details]
Proposed patch
Comment 25 Ryosuke Niwa 2012-11-01 15:34:50 PDT
I think we need a lot of more sample than just one computer. If we're enabling this on x86 processors, I'd like to see samples on following architectures:

AMD: K8 (was one of the most popular processor lines for AMD), K10 (still used in many laptops), Turion or Fusion, and Bulldozer.

Intel: Nehalem or Westmere (there still quite few NetBurst architectures CPUs), original Core, Core 2, and Sandy Bridge.

Quite frankly, the variation of cache size, bus speed, branch prediction mechanisms and accuracy etc... between these architectures is so large that I'm highly skeptical that this optimization is helpful on the average for x86.
Comment 26 Allan Sandfeld Jensen 2012-11-07 08:42:37 PST
(In reply to comment #25)
> I think we need a lot of more sample than just one computer. If we're enabling this on x86 processors, I'd like to see samples on following architectures:
> 
> AMD: K8 (was one of the most popular processor lines for AMD), K10 (still used in many laptops), Turion or Fusion, and Bulldozer.
> 
> Intel: Nehalem or Westmere (there still quite few NetBurst architectures CPUs), original Core, Core 2, and Sandy Bridge.
> 
> Quite frankly, the variation of cache size, bus speed, branch prediction mechanisms and accuracy etc... between these architectures is so large that I'm highly skeptical that this optimization is helpful on the average for x86.

If remember correctly prefetch is a NOP operation on Intel chips because it used to be an AMD-only extension (until AMD64), but it does help on AMD chips and is part of x86-64/amd64.
Comment 27 Kulanthaivel Palanichamy 2012-11-09 10:59:03 PST
(In reply to comment #26)
> (In reply to comment #25)
> > I think we need a lot of more sample than just one computer. If we're enabling this on x86 processors, I'd like to see samples on following architectures:
> > 
> > AMD: K8 (was one of the most popular processor lines for AMD), K10 (still used in many laptops), Turion or Fusion, and Bulldozer.
> > 
> > Intel: Nehalem or Westmere (there still quite few NetBurst architectures CPUs), original Core, Core 2, and Sandy Bridge.
> > 
> > Quite frankly, the variation of cache size, bus speed, branch prediction mechanisms and accuracy etc... between these architectures is so large that I'm highly skeptical that this optimization is helpful on the average for x86.
> 
> If remember correctly prefetch is a NOP operation on Intel chips because it used to be an AMD-only extension (until AMD64), but it does help on AMD chips and is part of x86-64/amd64.

It could be true for Intel's legacy architectures, but from Pentium III onwards, it seems like they have implemented PREFETCH completely. Even then, among their PIII, Core2, Sandy Bridge and other recent architectures, they have many variations in their implementation like prefetch cache level, prefetch cache line size, etc..
Comment 28 Kulanthaivel Palanichamy 2012-11-09 11:08:44 PST
(In reply to comment #25)
> I think we need a lot of more sample than just one computer. If we're enabling this on x86 processors, I'd like to see samples on following architectures:
> 
> AMD: K8 (was one of the most popular processor lines for AMD), K10 (still used in many laptops), Turion or Fusion, and Bulldozer.
> 
> Intel: Nehalem or Westmere (there still quite few NetBurst architectures CPUs), original Core, Core 2, and Sandy Bridge.
> 
I do not have access to these many architectures to test this patch. I have couple of Intel Xeon machines, a Sandy Bridge based machine and few ARM variety.

> Quite frankly, the variation of cache size, bus speed, branch prediction mechanisms and accuracy etc... between these architectures is so large that I'm highly skeptical that this optimization is helpful on the average for x86.

Is is acceptable to detect CPU at run time and enable this patch only on those known architectures which give benefit from this patch?

I know V8 and JavaScriptCore are already doing similar run-time detection for JITed code generation based on CPU id.
Comment 29 Anders Carlsson 2014-02-05 11:08:24 PST
Comment on attachment 171935 [details]
Proposed patch

Clearing review flag on patches from before 2014. If this patch is still relevant, please reset the r? flag.