Bug 17510

Summary: Acid3 test 26 takes >33ms
Product: WebKit Reporter: Darin Adler <darin>
Component: JavaScriptCoreAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, ian, mjs, mrowe, puurtuur, sam, webmaster, wjmoore
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Mac   
OS: OS X 10.5   
Bug Depends on: 19239    
Bug Blocks: 17064    
Attachments:
Description Flags
Standalone version of test 26, for profiling goodness
none
alternate version, larger, non-variable loop count so easier to profile
none
better version of Darin's extended version
none
not quite working version of optimizing "document" to a static slot none

Description Darin Adler 2008-02-24 01:40:34 PST
While running test 26, Acid3 pauses. That's because the test forces garbage collection as a sort of mini performance test. I don't know how fixable this is, but even on my fast computer with a Release build the pause is noticeable. The problem is that the faster we run the other tests, the more even a slight pause is noticeable, so I'm not sure this is practical to fix.
Comment 1 Maciej Stachowiak 2008-03-05 00:08:49 PST
Reopening, I closed this by mistake.
Comment 2 Maciej Stachowiak 2008-03-05 00:11:42 PST
Some testing with a local copy shows that we'd have to speed up performance on the loop part of test 26 by 3-4x to avoid pausing noticeably. I'm not sure how practical this is.
Comment 3 Darin Adler 2008-03-05 08:28:47 PST
3-4x will be a serious challenge given how much faster we are already. I'd like to know what kind of loop it is and if it's pure JavaScript or involves DOM.
Comment 4 vasi 2008-03-05 12:18:58 PST
The loop does indeed involve the DOM. Here's the JS, for ease of reference:

      var loops = ((new Date().valueOf() - 1.1e12) / 32e9) * 0x500; // increases linearly over time
      for (var i = 0; i < loops; i += 1) {
        // we want to force a GC here, so we use up lots of memory
        // we take the opportunity to sneak in a perf test to make DOM and JS stuff faster...
        d = new Date();
        d = new (function (x) { return { toString: function () { return x.toString() } } })(d.valueOf());
        d = document.createTextNode("iteration " + i + " at " + d);
        document.createElement('a').appendChild(d);
        d = d.parentNode;
        document.body.insertBefore(d, document.getElementById('bucket1').parentNode);
        assert(document.getElementById('bucket2').nextSibling.parentNode.previousSibling.firstChild.data.match(/AT\W/i), "iteration " + i + " failed");
        d.setAttribute('class', d.textContent);
        document.body.removeChild(d);
      }

Comment 5 Ian 'Hixie' Hickson 2008-03-05 12:53:44 PST
What comes up on a profile of that loop?
Comment 6 Maciej Stachowiak 2008-03-16 19:54:56 PDT
Created attachment 19820 [details]
Standalone version of test 26, for profiling goodness
Comment 7 Darin Adler 2008-03-17 14:28:07 PDT
I started looking at the profile. There's no one huge item here--a lot of stuff is pretty fundamental--but there are various things we could improve. One recurring area seems to be conversions from numbers to strings and from WebCore strings to JavaScriptCore strings.

In this profile:

    - The top item on the list is 4.3% in KJS::Lookup::findEntry; the hash tables we use to look up built-in properties are not working well enough; some kind of single element cache might save the day, or we could come up with a more efficient technique for this.

    - We are spending 2.7% of the time in UString::from(double), converting the loop count into a string so it can be appended to another string to create a text node.

    - We are spending 2.3% of the time in StringImpl::create, allocating new WebCore::String objects. If you include ~StringImpl (and we should), then it's 3.4%. Most of this time, 1.8%, is in JSHTMLDocument::canGetItemsForName. We should change this code path so it doesn't involve allocating a String. But also 0.2% for setAttribute and 0.2% for getElementById and 0.1% for createTextNode.

    - 1.2% of the time is in Document::isValidName. We've got to be able to do that one faster!

    - 1.1% of the time is in JSHTMLDocument::canGetItemsForName, not counting the string creation. We've got to be able to do that one faster!

    - 1.1% of the time is spent in getDOMNodeForDocument, finding wrappers for the DOM nodes. Two hash table lookups every time. Maybe we should abandon the purity of it and put wrapper pointers into DOM::Node?

    - 1.1% of the time is in the GC allocator.

    - 1.1% of the time is in parseClassAttribute, which has to fold case and break up an arbitrary string into a list of class names. Maybe we could defer that parsing until style resolution time? Or add a fast case for when all the letters are non-upper-case ASCII.

    - 0.9% of the time is spent changing a char* into an AtomicString in JHTMLAnchorElementPrototype::self and JSTextPrototoype::self. Seems fixable.

    - 0.8% of the time is spent concatenating UString objects; because of the createTextNode and the assert in this function, it's a stress test of string appending.

    - We are spending 0.7% of our time in Document::isValidName checking the string "class" to see if it's a valid name, in an expensive ICU u_getIntPropertyValue call.

    - 0.7% is spent just in JSObject::type.

    - 0.5% of the time is spent changing an ID into an AtomicString in getElementById. And an additional 0.3% each is spent in setAttribute and createElement doing the same thing. Since these are the same short ASCII strings over and over again we should think about how to do it faster.

    - 0.3% of the time is converting a DOM string to a JavaScript string in JSCharacterData::getValueProperty; presumably this is the ".data" call in the assert.

There's also some more of the pure JavaScript interpreter overhead, looking up properties and such.

It's costing us performance that all calls to toJS are going through EventTarget* first and only then going on to Node*. We need to change the Node* one from PassRefPtr to Node* and then eliminate all the ambiguity. We want calls to all go straight to the Node* version!
Comment 8 Ian 'Hixie' Hickson 2008-03-17 16:24:50 PDT
Woot, now we're talking. That's the kind of scrutiny I was hoping this subtest would bring to the table! :-)
Comment 9 Darin Adler 2008-03-17 21:41:58 PDT
Created attachment 19858 [details]
alternate version, larger, non-variable loop count so easier to profile
Comment 10 Darin Adler 2008-03-17 21:42:38 PDT
(In reply to comment #8)
> Woot, now we're talking. That's the kind of scrutiny I was hoping this subtest
> would bring to the table! :-)

I'm sure we'll be able to get a lot faster, but I still think 3x may be impossible.
Comment 11 Darin Adler 2008-03-26 21:32:52 PDT
I don't see a pause at all on my Mac Pro now (in Release builds). So on faster machines this is resolved. Maciej was thinking of setting a target for how much faster we need to get it.
Comment 12 Maciej Stachowiak 2008-03-26 22:26:24 PDT
I get around 75 milliseconds on my MacBook Pro, but I do not see a visible pause. I would like to hear what the Acid3 editor thinks of the animation smoothness in a recent release build.
Comment 13 Ian 'Hixie' Hickson 2008-03-26 23:51:32 PDT
I got about 87ms. I definitely saw a noticeable skip during the second box.
Comment 14 Darin Adler 2008-03-27 10:10:07 PDT
Hixie's put a 33ms target into the test. Retitling this bug appropriately.

On my Mac Pro, we're currently at about 60ms, so it seems we still need to double our speed on this to make the target on this fast computer.
Comment 15 Brady Eidson 2008-03-27 11:44:49 PDT
Has there been any stance about the performance aspect of this test relying on the hardware its run on?

Say we almost double the performance so Darin's Mac Pro squeaks by at 32ms.  WebKit is still supported on some pretty old sub-ghz PPC machines, and test 26 would certainly be slower than 33ms there.
Comment 16 Ian 'Hixie' Hickson 2008-03-27 13:37:36 PDT
I wouldn't focus so much on the pass/fail aspect of this. The idea is to improve performance to the best possible. Darin pointed out a number of things that could be done to improve perf, and many of them would help across many pages, which is what matters.
Comment 17 Ian 'Hixie' Hickson 2008-03-27 13:39:35 PDT
BTW, it's psychologically hard to see how WebKit could make things faster, because it's the fastest. Obviously another browser can see that it is possible to go faster, since WebKit does so. What would you do if one of the other browsers managed to squeeze a few more milliseconds out and beat WebKit?
Comment 18 Darin Adler 2008-03-27 21:57:28 PDT
(In reply to comment #16)
> I wouldn't focus so much on the pass/fail aspect of this. The idea is to
> improve performance to the best possible. Darin pointed out a number of things
> that could be done to improve perf, and many of them would help across many
> pages, which is what matters.

Yes, and I've addressed almost all those things I listed already. It's a long road of small separate performance increases. A target is, in fact, (in my experience) very important for a project like this.
Comment 19 Ian 'Hixie' Hickson 2008-03-27 22:09:56 PDT
Ok. um, 33ms on the state of the art Apple laptop hardware, I guess. I'm just saying that I wouldn't prioritise this above correctness bugs. The goal is arbitrary. So long as WebKit is faster than everything else, it's hard to give a convincing reason why it should be faster.
Comment 20 Arthur Langereis 2008-03-28 00:43:02 PDT
I can give a reason why it should be faster: iPhone web development. The iPhone is probably the slowest device running WebKit on the market right now and anything, _anything_ that increases JS execution speed on it will help.

Don't get me wrong, I love all the progress done here and WK really _is_ the fastest kid on the block right now, but the iPhone is both very popular and very slow (relatively speaking.) Speeding up commonly used code by a factor of 1.5x to 2x by using common sense and overhead elimination is never a bad thing in my book.

And even beside the iPhone, I really wouldn't mind more JS speed any day on any machine. JS code has about a 1000x performance hit compared to native code and any and all efforts to narrow that gap will be much appreciated by anyone doing advanced client side programming.

Sorry for the rant, I hope my arguments are valid. Thanks for all the effort.
Comment 21 Arthur Langereis 2008-03-28 03:51:36 PDT
Sorry for the spam, but before I get lynched I'd like to correct my 1000x speed hit statement. This may have been true as of Safari 1, but currently the speed hit seems to be around 50x (pure language wise.) I tested this by comparing a native C implementation of my bits-in-byte tests to the JS versions that have been incorporated in Sunspider. Also, my JS implementation of my NES emulator can now emulate up to 30 frames of CPU activity (no audio or graphics) per second which is really quite respectable.
Comment 22 Maciej Stachowiak 2008-03-28 17:35:56 PDT
Improving JS execution speed for the iPhone is certainly something we plan to do. I am not sure this benchmark is the best metric to drive that though. Getting faster on this test would be all about Acid3.

Now that Opera has posted a build that otherwise passes Acid3, we can see that we are about 5x faster than them on this test on reasonable laptop hardware.
Comment 23 Darin Adler 2008-03-28 17:37:28 PDT
(In reply to comment #22)
> Now that Opera has posted a build that otherwise passes Acid3, we can see that
> we are about 5x faster than them on this test on reasonable laptop hardware.

Not that it's important, but what I saw was 3.5x, not 5x.
Comment 24 Waleof Suous 2008-05-16 08:40:10 PDT
(In reply to comment #22)
> Improving JS execution speed for the iPhone is certainly something we plan to
> do. I am not sure this benchmark is the best metric to drive that though.
> Getting faster on this test would be all about Acid3.
> 
> Now that Opera has posted a build that otherwise passes Acid3, we can see that
> we are about 5x faster than them on this test on reasonable laptop hardware.
> 

WebKit is faster than Opera in the two "failed" tests, but Opera is faster than WebKit in the overall speed. When they first released, WebKit takes more than 4 seconds total to finish the test 100/100, while Opera takes around 1 second total to finish the test 100/100
Comment 25 Maciej Stachowiak 2008-05-23 20:25:31 PDT
Created attachment 21325 [details]
better version of Darin's extended version

New version of the test content for profiling. Properly puts the test in function scope.
Comment 26 Maciej Stachowiak 2008-05-23 21:17:18 PDT
Some optimization ideas:

1) stringProtoFuncMatch spends 3.4% of its time calling jsString inside RegExpObjectImp::arrayOfMatches to generate the match array. But nothing in the test uses the match array, except to test if it is null or not. So, we could make a special lazy array subclass that fills in the array only if you request any of the matches. (Total time in stringProtoFuncMatch is 6.8%, there could be more to juice here - also, since this is the major source of garbage strings, it should reduce all the time we are spending in StringImp::~StringImp.

2) 1% of time is spent in KJS::resolve_skip, likely much of this is retrieving the global "document" property, though some of it is "Date". Since "document" is DontDelete and very commonly accessed, we could create it at the time the Window is created and statically put it in a var slot much like the Math object.

3) JSNode::getOwnPropertySlot and JSEventTarget::getOwnPropertySlot show up with self time, but are always called from a parent - should proably be inlined.

4) 0.2% of time calling strlen making a String from a char* in Node::textContent. Yuck. We should pre-make these strings.

5) 0.7% in Node::isReadOnlyNode(). This is solely to support entity and entity reference nodes in the DOM. But we expand entities so this doesn't normally even happen. So it sucks that we have to do this expensive check always at DOM mutation time. Unfortunately there are no spare bits in Node. However, Element has a 3-bit bitfield, so we could move a flag that only applies to Element (such as m_tabIndexSetExplicitly) and then have an m_readOnly flag which is set only for EntityReference and Entity nodes (they will never have children so no need to do propagation to their descendants). Alternately we could just make isReadOnlyNode inline and always returning false.

Comment 27 Alexey Proskuryakov 2008-05-24 14:34:29 PDT
(In reply to comment #26)
> 1)  RegExpObjectImp::arrayOfMatches

Did that, testing now.

> 4) 0.2% of time calling strlen making a String from a char* in
> Node::textContent. Yuck. We should pre-make these strings.

Did that and more in <http://trac.webkit.org/changeset/34108>.
Comment 28 Darin Adler 2008-05-24 21:49:34 PDT
(In reply to comment #26)
> 5) 0.7% in Node::isReadOnlyNode(). This is solely to support entity and entity
> reference nodes in the DOM. But we expand entities so this doesn't normally
> even happen. So it sucks that we have to do this expensive check always at DOM
> mutation time. Unfortunately there are no spare bits in Node. However, Element
> has a 3-bit bitfield, so we could move a flag that only applies to Element
> (such as m_tabIndexSetExplicitly) and then have an m_readOnly flag which is set
> only for EntityReference and Entity nodes (they will never have children so no
> need to do propagation to their descendants). Alternately we could just make
> isReadOnlyNode inline and always returning false.

I'd love to make isReadOnlyNode inline and always return false. But there's a problem with that. We do support the Document::createEntityReference function. So while there is a guarantee there are no Entity nodes in the DOM tree, there is no such guarantee for EntityReference nodes.
Comment 29 Darin Adler 2008-05-24 22:03:10 PDT
(In reply to comment #28)
> We do support the Document::createEntityReference function.

On the other hand, that function is useless, because we don't do the work to look up the entity and insert child nodes, which is how it's supposed to work. I think it'd be better if we didn't half-support that function.
Comment 30 Ian 'Hixie' Hickson 2008-05-25 00:08:31 PDT
If we get the Web DOM Core effort off the ground, I plan to advocate for the removal of EntityReference and Attr nodes outright, fwiw. Lack of support and removal of support in browsers is a good start. :-)
Comment 31 Maciej Stachowiak 2008-05-25 06:37:38 PDT
I have a fix for this one:

> 2) 1% of time is spent in KJS::resolve_skip, likely much of this is retrieving
> the global "document" property, though some of it is "Date". Since "document"
> is DontDelete and very commonly accessed, we could create it at the time the
> Window is created and statically put it in a var slot much like the Math
> object.
Comment 32 Darin Adler 2008-05-25 12:34:43 PDT
(In reply to comment #26)
> 5) 0.7% in Node::isReadOnlyNode().

I have a patch that cuts this down to about 0.1% attached to the related bug. (I noticed unnecessary multiple calls to nodeType() in Node::appendTextContent.)

Other optimization thoughts (not the kind of great ideas Maciej posted, but still perhaps useful):

A) 1.5% of the time is taken converting the string "class" from a UString to an AtomicString. If we could find some way to not search the hash table over and over again here, that'd be a big win.

B) DOM getOwnPropertySlot functions are taking a lot of time, more than 5%. The big costs are the PIC branch used to get access to the global table and the memory accesses in that table. And the size for the hash table for JSDOMNode, for example, is huge -- 4096 slots for the 19 values just to avoid the need to rehash. Maybe there's a better data structure we can use?

C) Overhead for render objects we never end up rendering is still a big part of the profile, even with Maciej's tear-down optimization. For example, CSSStyleSelector::styleForElement is 4.7% of the time and the total time for createRendererIfNeeded is 12.5%. And this is all for an element that's going to be removed before the next layout. If we can find a way to defer creating the renderer and never end up doing it at all then we will get a *big* speedup. Maybe we can change things so that renderers can be created at layout time.

D) Tons of time is spent inside toJS, making wrappers for the text node and HTMLAnchorElement objects that are created and looking for the existing wrapper for the result of parentNode, nextSibling, and others. We could cut down a lot of hash table overhead if we were willing to put a pointer in every WebCore::Node to the corresponding JSNode

E) We could avoid making a lot of JS string wrappers if we could create code paths that would keep the UString in the engine register until the value needs to be used as a generic value. The expression "iteration " + i + " failed" results in creation of many string wrappers and it's never used for anything at all. Similarly, the expression "iteration " + i + " at " + d results in the creation of many string wrappers and ultimately it's passed to createTextNode, which takes a string parameter rather than an arbitrary object value. I suspect that we could speed up by more than 5% by reducing the number of string wrappers if we could get the strings to the "+" operator and to the function call without converting to a wrapper, but that's probably a tall order.

F) The StringImp::toString function, 0.5%, would benefit from omit-frame-pointer.

G) The defaultValue function doesn't need to be a virtual function. Right now the bridge is using a custom implementation of defaultValue, but that's not really helpful. In fact, some of the overrides are actually using the wrong function signature (including ExecState*)! On the other hand, what's really slow here is all the overhead for doing a function call, and that's probably due to the crazy toString/defaultValue test case that's in test 26, so it might not be easy to optimize.

H) 0.6% of the time is in Document::body(). We can easily add code so that's cached.

I) We spend considerable time repeatedly checking if "class" is a valid name. Maybe we can cache that. We also pay function overhead to get the length and characters pointer from a String object. Those should probably be inlined.

J) Frame::page() is 0.3% -- we should probably move the page pointer from the private pointer to the top level of the Frame object and then inline the getter.

K) Inside the code to implement new Date we spend a lot of time in the floor inside getCurrentUTCTime. This could be avoided by dividing changing that code path tv_usec by 1000 as an integer operation instead of a floating point one. That's only 0.1% though.

L) NumberImp::toObject is taking 0.7% of the time. That's all being done to call toString() on the value of the date. Is there a way to avoid the toObject in that op_get_by_id opcode when it's a built-in function on Number? We already have to branch to convert to an object, but we could use that branch to instead go to a special version of the get operation for the non-object JavaScript types that avoids creating an object. Seems like it could be a 1% speedup.

M) documentBeingDestroyed() is showing up as 0.2% on the profile. It's simple and can be inlined.

N) It looks like the empty case of ~PropertyMap should be inlined. The 0.2% time spent in that function seems to all in that empty case.

O) numerProtoFuncToString could be changed around so that the "no radix passed" case is slightly faster for up to a 0.3% speedup.
Comment 33 Maciej Stachowiak 2008-05-25 14:51:47 PDT
(In reply to comment #32)
> (In reply to comment #26)
> > 5) 0.7% in Node::isReadOnlyNode().
> 
> I have a patch that cuts this down to about 0.1% attached to the related bug.
> (I noticed unnecessary multiple calls to nodeType() in
> Node::appendTextContent.)
> 
> Other optimization thoughts (not the kind of great ideas Maciej posted, but
> still perhaps useful):

Comments on a few of these.

> B) DOM getOwnPropertySlot functions are taking a lot of time, more than 5%. The
> big costs are the PIC branch used to get access to the global table and the
> memory accesses in that table. And the size for the hash table for JSDOMNode,
> for example, is huge -- 4096 slots for the 19 values just to avoid the need to
> rehash. Maybe there's a better data structure we can use?

I think just inlining some of them (especially the Node and EventTargetNode ones, which are never called directly) would help.

> C) Overhead for render objects we never end up rendering is still a big part of
> the profile, even with Maciej's tear-down optimization. For example,
> CSSStyleSelector::styleForElement is 4.7% of the time and the total time for
> createRendererIfNeeded is 12.5%. And this is all for an element that's going to
> be removed before the next layout. If we can find a way to defer creating the
> renderer and never end up doing it at all then we will get a *big* speedup.
> Maybe we can change things so that renderers can be created at layout time.

I have thought about this but I am not sure it is good for the normal case.

> D) Tons of time is spent inside toJS, making wrappers for the text node and
> HTMLAnchorElement objects that are created and looking for the existing wrapper
> for the result of parentNode, nextSibling, and others. We could cut down a lot
> of hash table overhead if we were willing to put a pointer in every
> WebCore::Node to the corresponding JSNode.

I think two things would speed this up a fair bit without the extra pointer:

D.i) Make a version of toJS that takes a node which is known to be newly created, and in that case skip the hash lookup for the existing wrapper, since there can't be one.

D.ii) Add type-spepcific overload for Text* and perhaps Element* or even HTMLElement* to avoid dispatching on the node type in the newly created case.

> E) We could avoid making a lot of JS string wrappers if we could create code
> paths that would keep the UString in the engine register until the value needs
> to be used as a generic value. The expression "iteration " + i + " failed"
> results in creation of many string wrappers and it's never used for anything at
> all. Similarly, the expression "iteration " + i + " at " + d results in the
> creation of many string wrappers and ultimately it's passed to createTextNode,
> which takes a string parameter rather than an arbitrary object value. I suspect
> that we could speed up by more than 5% by reducing the number of string
> wrappers if we could get the strings to the "+" operator and to the function
> call without converting to a wrapper, but that's probably a tall order.

I think we can make a significant improvement without changing the register representation, and adding only two opcodes. But I am not sure how general the optimization is. Here's the idea:

a) The condition for this to apply is an AddNode chain where the leftmost operand is a constant string (this means every addition is a string add). You have to be careful though because "xx" + (i + z) can't assume all string adds.

b) In this case, emit code to load the constant string operands and convert non-string operands to string (with a new tostring opcode) to sequential registers.

c) Emit a new append_strings opcode which assumes all operands are primitive strings, and concatenates them at one go into a fresh UString. This would both reduce allocation work in UString (can do multiple appends all at once) and reduce the number of temporaries created (in "iteration " + i + " at " + d, i and d are still converted to string resulting in temporaries, but the GC values for "iteration " + i and "iteration " + i + " at " need never be created, cutting the number of unobserved string temporaries in half).

I am not sure whether this optimization is general enough to be worth adding two opcodes.

> L) NumberImp::toObject is taking 0.7% of the time. That's all being done to
> call toString() on the value of the date. Is there a way to avoid the toObject
> in that op_get_by_id opcode when it's a built-in function on Number? We already
> have to branch to convert to an object, but we could use that branch to instead
> go to a special version of the get operation for the non-object JavaScript
> types that avoids creating an object. Seems like it could be a 1% speedup.

It's possible to optimize property access on primitives, but a bit tricky. If the property turns out to be a manually added prototype method that uses the "this" value and possibly even stores it, then the toObject conversion has to actually happen on entry to that function. Geoff and I have discussed ideas how to do this. Note that this would in general help strings more than numbers. It would help some of the string tests on SunSpider quite a bit. I can give more detail if needed.
Comment 34 Ian 'Hixie' Hickson 2008-05-25 15:44:23 PDT
I'd like to echo the sentiment underlying mjs' most recent comments -- when looking at optimisations here please consider how they would affect the real Web first. It doesn't matter how fast Acid3 is really, the idea is just that it might be a (poor) proxy for real Web usage.

Having said that, regarding the strings specifically, concatenating strings is pretty common. Why would that idea not generalise well?
Comment 35 Maciej Stachowiak 2008-05-25 16:11:59 PDT
(In reply to comment #33)
> 
> 
> > E) We could avoid making a lot of JS string wrappers if we could create code
> > paths that would keep the UString in the engine register until the value needs
> > to be used as a generic value. The expression "iteration " + i + " failed"
> > results in creation of many string wrappers and it's never used for anything at
> > all. Similarly, the expression "iteration " + i + " at " + d results in the
> > creation of many string wrappers and ultimately it's passed to createTextNode,
> > which takes a string parameter rather than an arbitrary object value. I suspect
> > that we could speed up by more than 5% by reducing the number of string
> > wrappers if we could get the strings to the "+" operator and to the function
> > call without converting to a wrapper, but that's probably a tall order.
> 
> I think we can make a significant improvement without changing the register
> representation, and adding only two opcodes. But I am not sure how general the
> optimization is. Here's the idea:

Some clarifiaction.

> 
> a) The condition for this to apply is an AddNode chain where the leftmost
> operand is a constant string (this means every addition is a string add). You
> have to be careful though because "xx" + (i + z) can't assume all string adds.
> 
> b) In this case, emit code to load the constant string operands and convert
> non-string operands to string (with a new tostring opcode) to sequential
> registers.

On second thought, the append_strings opcode could do any needed string conversion, so it could be done with only one extra opcode.

> c) Emit a new append_strings opcode which assumes all operands are primitive
> strings, and concatenates them at one go into a fresh UString. This would both
> reduce allocation work in UString (can do multiple appends all at once) and
> reduce the number of temporaries created (in "iteration " + i + " at " + d, i
> and d are still converted to string resulting in temporaries, but the GC values
> for "iteration " + i and "iteration " + i + " at " need never be created,
> cutting the number of unobserved string temporaries in half).

The append_strings opcode would take a starting register and a register count (or ending register) as its parameters. Outputting to the right registers could be handled as for function call parameters.
Comment 36 Maciej Stachowiak 2008-05-25 16:18:17 PDT
(In reply to comment #34)
> I'd like to echo the sentiment underlying mjs' most recent comments -- when
> looking at optimisations here please consider how they would affect the real
> Web first. It doesn't matter how fast Acid3 is really, the idea is just that it
> might be a (poor) proxy for real Web usage.

In all honesty the test 26 timing loop is not a very good real-world benchmark. As Darin mentions, one of the most effective ways to speed it up is to optimize handling of DOM nodes that are inserted into the document and then removed again before they can lay out or paint, something that very rarely happens in real situations. It is unlikely that highly effective general optimizations will take 26 into the win category, for instance our massive JS interpreter rewrite had minimal impact even though it speeds up many more realistic benchmarks by factors of 1.2x - 2.1x.

That being said, we'd like to win anyway, and hopefully can do it with optimizations that at least do not harm the general case or may be useful for other reasons.

> Having said that, regarding the strings specifically, concatenating strings is
> pretty common. Why would that idea not generalise well?

It might. Note that my suggestion only covers some very specific patterns. For instance, these cases could not be optimized:

1) var s = x + y + " some constant string";
2) var s = "foo"; s += x; s += y;
3) var s = "foo" + x; // no intermediate values avoided here

The question is how common such patterns are compared to var s = "foo" + x + y; or similar. If the optimizable pattern isn't that common as a hot spot in real code, then it is unwise to add complexity to the bytecode interpreter's main loop to handle it. I am not really sure of the frequency distribution.
Comment 37 Maciej Stachowiak 2008-05-25 19:48:32 PDT
Created attachment 21344 [details]
not quite working version of optimizing "document" to a static slot

This still fails some layout tests. I probably have the timing wrong.
Comment 38 Maciej Stachowiak 2008-05-25 20:19:25 PDT
Besides the obvious simple inlines and additional caches, I like D and L from Darin's ideas (as modified by my suggestions) though L is tricky to implement. E also sounds like it could be good, though maybe also tricky to implement.

I have a feeling we won't win this one without some form of C, but we will need some input from Hyatt.
Comment 39 Geoffrey Garen 2008-05-26 13:05:23 PDT
As for E, I believe that type specialization in the bytecode engine would naturally remove all string wrappers created by 

"iteration " + i + " at " + d

Since "iteration" and " at " are known strings at compile time, we know that this expression is a string add, so i and d should produce strings, not generic values.

We would still need to create a string wrapper for the sake of calling createTextNode, though. Specializing for the expected types of arbitrary DOM APIs would be very hard. (I'm not even sure how we would do that.)
Comment 40 Maciej Stachowiak 2008-05-26 15:58:07 PDT
(In reply to comment #39)
> As for E, I believe that type specialization in the bytecode engine would
> naturally remove all string wrappers created by 
> 
> "iteration " + i + " at " + d
> 
> Since "iteration" and " at " are known strings at compile time, we know that
> this expression is a string add, so i and d should produce strings, not generic
> values.

i and d have to produce generic values first because they have to go through toString/valueOf. d in particular has a custom toString() function in this case, so that's not just a theoretical issue. This approach would in theory be able to remove the same temporary wrappers as my proposed multi-concatenation opcode, but not the UString temporaries. So I think optimizing multiple concatenations might be worthwhile even when we have type specialization.

> We would still need to create a string wrapper for the sake of calling
> createTextNode, though. Specializing for the expected types of arbitrary DOM
> APIs would be very hard. (I'm not even sure how we would do that.)

If registers are type-coded in an obvious way, then it could still be a UString in the arguments vector. However, UString is problematic to use in register values at all because of the refcounting. Unlike primitive values or JSValue*, when a register containing such a value is killed we'd have to deref.
Comment 41 Maciej Stachowiak 2008-05-26 22:37:52 PDT
I took a fresh profile and looked at a top-down view to get a rough estimate of how readily we can improve times on this test. We need to cut out approximately 23% of the current execution time (a 1.3x speedup).

Top-down, I see the following mutually exclusive categories of time:

7.5% self time in KJS::Machine::privateExecute (could potentially be reduced by optimizing JS)
16.8% total time in JSNode::insertBefore
8.5% total time in JSNode::removeChild
8.4% total time in JSObject::defaultValue
7.7% total time in jsString()
4.8% total time in JSElement::setAttribute
4.0% total time in jsDocumentPrototypeFunctionCreateElement
3.1% total time in stringProtoFuncMatch
3.1% total time in JSNode::getValueProperty
2.3% total time in JSNode::appendChild
2.0% total time in jsDocumentPrototypeFunctionCreateTextNode
1.6% total time in makeFunction
1.6% total time in UString concatenation constructor
1.5% total time in jsDocumentPrototypeFunctionGetElementById
1.3% total time in DateObjectImp::construct
1.1% total time in JSCharacterData::getValueProperty

Seems like optimizing string concatenation could plausibly remove half the jsString() time and half the UString constructor time so that is a plausible big win. Devirtualizing defaultValue could also be a good payoff. Reducing the insertBefore or removeChild time seems like a challenge though may be necessary to win.
Comment 42 Maciej Stachowiak 2008-05-26 23:01:36 PDT
Comment on attachment 21344 [details]
not quite working version of optimizing "document" to a static slot

Obsoleting this, since I have landed a working version.
Comment 43 Peter Neumayr 2008-06-01 08:50:27 PDT
With the latest nightly build (34279) the test passes on my MacBook in most cases, in some cases test 26 takes up to 37ms, but since a 2GHz dual-core MacBook should be considerable slower than the latest state of the art apple laptop, this should be ok. So, how fast do you want to be until announcing that you've won the race?
Comment 44 Mark Rowe (bdash) 2008-06-06 14:51:27 PDT
With TOT WebKit, test 26 passes comfortably on a 2.4GHz MacBook Pro, which is slower than what Hixie has chosen as the reference hardware for the performance aspect of the test.  Closing as fixed.