Bug 94810 - Vector::shrinkToFit should use realloc when canMoveWithMemcpy is true
Summary: Vector::shrinkToFit should use realloc when canMoveWithMemcpy is true
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-23 07:15 PDT by Yong Li
Modified: 2012-08-30 14:56 PDT (History)
12 users (show)

See Also:


Attachments
the patch (3.52 KB, patch)
2012-08-27 10:03 PDT, Yong Li
benjamin: review-
buildbot: commit-queue-
Details | Formatted Diff | Diff
only for blocks >=1k (3.69 KB, patch)
2012-08-28 11:07 PDT, Yong Li
yong.li.webkit: review-
Details | Formatted Diff | Diff
Try again (4.17 KB, patch)
2012-08-28 12:17 PDT, Yong Li
no flags Details | Formatted Diff | Diff
should use canMoveWithMemcpy (4.17 KB, patch)
2012-08-28 12:26 PDT, Yong Li
no flags Details | Formatted Diff | Diff
the test code I'm using... (1.52 KB, text/x-c++src)
2012-08-28 12:33 PDT, Yong Li
no flags Details
Patch (4.01 KB, patch)
2012-08-28 12:52 PDT, Yong Li
benjamin: review-
buildbot: commit-queue-
Details | Formatted Diff | Diff
Updated (4.17 KB, patch)
2012-08-29 08:37 PDT, Yong Li
buildbot: commit-queue-
Details | Formatted Diff | Diff
try again (4.39 KB, patch)
2012-08-29 09:35 PDT, Yong Li
peter+ews: commit-queue-
Details | Formatted Diff | Diff
Again (4.38 KB, patch)
2012-08-29 11:16 PDT, Yong Li
benjamin: review+
benjamin: commit-queue-
Details | Formatted Diff | Diff
To commit (see if GTK can build now) (4.38 KB, patch)
2012-08-30 10:09 PDT, Yong Li
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yong Li 2012-08-23 07:15:38 PDT
This should be safe. But I don't know why there are both canCopyWithMemcpy and canMoveWithMemcpy. Is there any case they are not equivalent?
Comment 1 Alexey Proskuryakov 2012-08-23 11:52:20 PDT
Can realloc result in more memory fragmentation?
Comment 2 Yong Li 2012-08-23 12:07:31 PDT
(In reply to comment #1)
> Can realloc result in more memory fragmentation?

It could... But that is up to the implementation of realloc, because realloc doesn't have to keep the address unchanged when shrinking.
Comment 3 Alexey Proskuryakov 2012-08-23 12:14:54 PDT
I think that at least OS X man page for realloc implies that it cannot move the allocation when shrinking.
Comment 4 Yong Li 2012-08-23 12:22:17 PDT
(In reply to comment #3)
> I think that at least OS X man page for realloc implies that it cannot move the allocation when shrinking.

Hm... good to know. For large blocks, the memory allocator can decommit/munmap the free space. For small blocks, probably we don't care that much?

Also StringBuilder/StringImpl is using fastRealloc to shrink memory anyway.
Comment 5 Yong Li 2012-08-27 10:03:42 PDT
Created attachment 160738 [details]
the patch

It doesn't show any noticeable diff on sunspider and octane, though.

Any other suggested perf tests?
Comment 6 Build Bot 2012-08-27 10:38:33 PDT
Comment on attachment 160738 [details]
the patch

Attachment 160738 [details] did not pass win-ews (win):
Output: http://queues.webkit.org/results/13619208
Comment 7 Benjamin Poulain 2012-08-27 22:19:57 PDT
Comment on attachment 160738 [details]
the patch

r- because you do not provide any rationale for making the change.

1) Please microbenchmark this to provide numbers for the improvement.
2) Please find what is the memory impact. Exactly how does fastMalloc work with those Realloc.


> Also StringBuilder/StringImpl is using fastRealloc to shrink memory anyway.

This is pretty much irrelevant. The memory allocated there is small on average and the lost memory on shrink is at most a little less the size of the useful payload.
Comment 8 Yong Li 2012-08-28 07:33:22 PDT
(In reply to comment #7)
> (From update of attachment 160738 [details])
> r- because you do not provide any rationale for making the change.
> 
> 1) Please microbenchmark this to provide numbers for the improvement.
> 2) Please find what is the memory impact. Exactly how does fastMalloc work with those Realloc.

OK

> 
> 
> > Also StringBuilder/StringImpl is using fastRealloc to shrink memory anyway.
> 
> This is pretty much irrelevant. The memory allocated there is small on average and the lost memory on shrink is at most a little less the size of the useful payload.

Isn't this the worst case of using realloc (causing fragments)? If the memory allocated isn't small, and realloc() would keep the address unchanged, it saves a memcpy on a large block. If realloc() uses different addresses, it is same as malloc/memcpy/free.

I understand this depends very much on the actual malloc algorithm. But what Vector is trying to do here is just a re-alloc. It is very natural to use realloc, and lets the malloc algorithm to decide what to do for best performance. Probably adding an ENABLE switch is better?
Comment 9 Benjamin Poulain 2012-08-28 08:37:59 PDT
> Isn't this the worst case of using realloc (causing fragments)? If the memory allocated isn't small, and realloc() would keep the address unchanged, it saves a memcpy on a large block. If realloc() uses different addresses, it is same as malloc/memcpy/free.
> 
> I understand this depends very much on the actual malloc algorithm. But what Vector is trying to do here is just a re-alloc. It is very natural to use realloc, and lets the malloc algorithm to decide what to do for best performance.

I am not sure what you arguing for? (against?)
What you say is correct, and I agree using realloc seems like the good thing to do.

But you should not make changes like this one without measuring the benefits. You are modifying a abstract data type used everywhere in WebKit in a non obvious way, you should be able to prove the change is correct.

It is not obvious if
1) This will increase memory fragmentation because realloc is not flexible for shrink.
2) This will not provide any improvement and it is just extra code to support.
3) Fastmalloc does the right thing and the patch avoid some memcopy.

(3) should go in WebKit, (1) and (2) should just be discarded. I only ask you make sure you have the third case.

> Probably adding an ENABLE switch is better?

No, please. ENABLE switch is a bad idea here.
Comment 10 Yong Li 2012-08-28 10:20:26 PDT
(In reply to comment #9)
> It is not obvious if
> 1) This will increase memory fragmentation because realloc is not flexible for shrink.
> 2) This will not provide any improvement and it is just extra code to support.
> 3) Fastmalloc does the right thing and the patch avoid some memcopy.
> 
> (3) should go in WebKit, (1) and (2) should just be discarded. I only ask you make sure you have the third case.
> 
> > Probably adding an ENABLE switch is better?
> 
> No, please. ENABLE switch is a bad idea here.

It seems FastMalloc is OK, as its realloc() allocates a new block when "the new size is significantly smaller than the old size".

  // Reallocate if the new size is larger than the old size,
  // or if the new size is significantly smaller than the old size.
  if ((new_size > old_size) || (AllocationSize(new_size) < old_size)) {

But not all ports are using same memory allocator. Some ports have USE_SYSTEM_MALLOC defined. As Alexey said, "at least OS X man page for realloc implies that it cannot move the allocation when shrinking".

PLATFORM(BLACKBERRY) also uses the system memory allocator. I'm trying to get some numbers on QNX, but the results won't apply to all ports.


BTW, I've seen in reality WebKit runs out-of-memory in shrinkToFit, which is a little bit weird: it is trying to shrink memory usage, but ends up with allocating more memory and runs OOM even faster. Reusing the old memory address for larges blocks can also reduce memory spikes in some cases.
Comment 11 Yong Li 2012-08-28 11:06:06 PDT
I tested with the following code, and found that realloc() gives noticeable performance boost only on large blocks (>=1k).  It is even slower than malloc/memcpy/free solution on very small blocks. The larger the blocks are, the more performance boost realloc() gives. So it seems it is better to do this only when both old and new capacities are larger than 1k.

        double start = currentTime();
        for (int i = 0; i < 100000; ++i) {
            Vector<char> vector;
            vector.reserveCapacity(<capacity>);
            vector.resize(<size>);
            vector.shrinkToFit();
        }
        double elapsed = currentTime() - start;
Comment 12 Yong Li 2012-08-28 11:07:10 PDT
Created attachment 161015 [details]
only for blocks >=1k

Can anyone also try the patch?
Comment 13 Yong Li 2012-08-28 11:29:35 PDT
Comment on attachment 161015 [details]
only for blocks >=1k

actually we should use canMoveWithMemcpy
Comment 14 Benjamin Poulain 2012-08-28 11:36:58 PDT
View in context: https://bugs.webkit.org/attachment.cgi?id=161015&action=review

So you picked the 1K based on measurement on your platform, one of the few that does not use USE_SYSTEM_MALLOC.

And you set the patch for review for all platforms. Don't you see anything wrong with that?

> Source/WTF/ChangeLog:10
> +        Use realloc to shrink buffer when it is changing from a large malloc buffer
> +        to another large one (> 1k), and canCopyWithMemcpy is true.
> +

You should give the performance analysis and the numbers in your ChangeLog.

Here is an example of a good ChangeLog of performance work: http://trac.webkit.org/changeset/124990
Comment 15 Yong Li 2012-08-28 11:45:53 PDT
(In reply to comment #14)
> View in context: https://bugs.webkit.org/attachment.cgi?id=161015&action=review
> 
> So you picked the 1K based on measurement on your platform, one of the few that does not use USE_SYSTEM_MALLOC.
> 
> And you set the patch for review for all platforms. Don't you see anything wrong with that?

yes I do :). Probably I shouldn't set it for review. This is why I suggested to use some macros, and I'm asking for help to investigate this plan on other ports...
Comment 16 Benjamin Poulain 2012-08-28 11:50:19 PDT
> > So you picked the 1K based on measurement on your platform, one of the few that does not use USE_SYSTEM_MALLOC.
> > 
> > And you set the patch for review for all platforms. Don't you see anything wrong with that?
> 
> yes I do :). Probably I shouldn't set it for review. This is why I suggested to use some macros, and I'm asking for help to investigate this plan on other ports...

Having some #ifdef for Blackberry is perfectly reasonable here. It is the only platform for which you can demonstrate your patch is an improvement.
Comment 17 Benjamin Poulain 2012-08-28 12:00:30 PDT
> Having some #ifdef for Blackberry is perfectly reasonable here. It is the only platform for which you can demonstrate your patch is an improvement.

To be clear, I think ENABLE() is a bad idea, PLATFORM(BLACKBERRY) or something even more specific to the allocator is perfectly reasonable.

E.g.

#if USE(SYSTEM_MALLOC) && PLATFORM(BLACKBERRY)
const unsigned reallocOnShrinkSizeThreshold = 1024;
#define VECTOR_REALLOC_ON_SHRINK 
#endif // USE(SYSTEM_MALLOC) && PLATFORM(BLACKBERRY)
Comment 18 Yong Li 2012-08-28 12:17:23 PDT
Created attachment 161029 [details]
Try again

How about this? Hopefully we can get some numbers for realloc vs malloc/memcpy/free comparison on other ports very soon...
Comment 19 Yong Li 2012-08-28 12:26:17 PDT
Created attachment 161031 [details]
should use canMoveWithMemcpy

actually it is not making copies, but just moving.
Comment 20 Yong Li 2012-08-28 12:33:44 PDT
Created attachment 161033 [details]
the test code I'm using...

It is a little bit ugly I know :)

Basically in every cycle, it creates a vector, allocates buffer, sets the size, and then call shrinkToFit(), and destructs the buffer.
Comment 21 Yong Li 2012-08-28 12:52:58 PDT
Created attachment 161042 [details]
Patch

Sorry for keeping changing it... Actually we don't want the 1K threshold. 1) the measurement for small blocks is not stable. It is actually very close as realloc() doesn't nothing but malloc/memcpy/free. 2) if realloc() was slower, it should be fixed and improved anyway.
Comment 22 Build Bot 2012-08-28 18:00:54 PDT
Comment on attachment 161042 [details]
Patch

Attachment 161042 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13656552
Comment 23 Benjamin Poulain 2012-08-28 22:55:31 PDT
Comment on attachment 161042 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=161042&action=review

I just r- for the minor issues, but everything looks good really. 

I'll have to double check the VectorTraits<T>::canMoveWithMemcpy, I do not remember by heart if it is the right one.

> Source/WTF/ChangeLog:8
> +        Only tested on BlackBerry. So it is wrapped with PLATFORM(BLACKBERRY) at the mean time.

"at the mean time" -> "in the meantime"/"for now"

> Source/WTF/ChangeLog:12
> +        Use realloc to shrink buffer when it is changing from a large malloc buffer
> +        to another large one (> 1k), and canMoveWithMemcpy is true. When a Vector<char> shrinks
> +        capacity from 2K to 1K, this gives more than 10x times performance boost on shrinkToFit()
> +        (Tested on BlackBerry).

This is outdated, your removed the 1K limit.
Please use a new paragraph for the performance detail, that'll make it easier to track.

10 times faster!!! I should really try your bench with fastMalloc!

> Source/WTF/wtf/Vector.h:288
> +            // FIXME: Return true when realloc() can give better performance.

The "when" is ambiguous IMHO. More like:
// FIXME: Return true on the platform where realloc() give better performance.
?

+ UNUSED_PARAM(newCapacity); (for the Mac break).
Comment 24 Yong Li 2012-08-29 07:47:11 PDT
(In reply to comment #23)
> (From update of attachment 161042 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=161042&action=review
> I'll have to double check the VectorTraits<T>::canMoveWithMemcpy, I do not remember by heart if it is the right one.

If I understand correctly, this is for the objects with something like ref-counter. For example, RefPtr<T> cannot be C++-copied with memcpy, as creating a copy should increase the ref-count, but memcpy doesn't do that. However, it can be moved with memcpy as it is nothing but a pointer to the RefCounted object.

> 10 times faster!!! I should really try your bench with fastMalloc!

That is only for large blocks. And it is reasonable because it just releases the free space to system in this case, so it takes no time compared to allocating a new block, memcpying, and freeing the old block.

Thanks a lot for reviewing and all the suggestions!
Comment 25 Yong Li 2012-08-29 08:37:19 PDT
Created attachment 161235 [details]
Updated

I changed the number "10x" because it is actually an estimate after deducting the extra malloc/free in every cycle. Also I made a mistake that it was on 8K->4K shrinking. To be clear, for 2K-1K shrinking, the time it spends given by the test reduces from 100% to 60-70%, For 8k-4k shrinking, the elapsed time reduces from 100% to ~55%.. I'm not sure how to calculate this performance boost percentage. Now I'm using this formula: (<old time cost> - <new time cost>) / <old time cost>. Then this cannot explain why we see "2x times faster" in some logs...
Comment 26 Build Bot 2012-08-29 09:14:58 PDT
Comment on attachment 161235 [details]
Updated

Attachment 161235 [details] did not pass win-ews (win):
Output: http://queues.webkit.org/results/13680259
Comment 27 Build Bot 2012-08-29 09:19:44 PDT
Comment on attachment 161235 [details]
Updated

Attachment 161235 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13690259
Comment 28 Yong Li 2012-08-29 09:35:43 PDT
Created attachment 161248 [details]
try again
Comment 29 Peter Beverloo (cr-android ews) 2012-08-29 11:09:40 PDT
Comment on attachment 161248 [details]
try again

Attachment 161248 [details] did not pass cr-android-ews (chromium-android):
Output: http://queues.webkit.org/results/13687288
Comment 30 Early Warning System Bot 2012-08-29 11:11:40 PDT
Comment on attachment 161248 [details]
try again

Attachment 161248 [details] did not pass qt-ews (qt):
Output: http://queues.webkit.org/results/13689318
Comment 31 Yong Li 2012-08-29 11:13:02 PDT
(In reply to comment #29)
> (From update of attachment 161248 [details])
> Attachment 161248 [details] did not pass cr-android-ews (chromium-android):
> Output: http://queues.webkit.org/results/13687288

:(. So many senseless warnings!
Comment 32 Yong Li 2012-08-29 11:16:55 PDT
Created attachment 161271 [details]
Again
Comment 33 Yong Li 2012-08-30 07:55:03 PDT
GTK build failure:

  CXXLD  Programs/TestWebKitAPI/TestWebKit2
make[1]: Leaving directory `/home/phil/gst/jhbuild/makes/WebKit/WebKitBuild/Release'
Segmentation fault


Don't think it is any issue in the patch.

Ping reviewers?
Comment 34 Benjamin Poulain 2012-08-30 09:44:28 PDT
Comment on attachment 161271 [details]
Again

View in context: https://bugs.webkit.org/attachment.cgi?id=161271&action=review

The patch looks correct. Just make sure there is no problem with GTK when you land it.

> Source/WTF/ChangeLog:8
> +        Only tested on BlackBerry. So it is wrapped with PLATFORM(BLACKBERRY) at the mean time.

"at the mean time" -> in the meantime.
Comment 35 Yong Li 2012-08-30 10:05:09 PDT
Comment on attachment 161271 [details]
Again

Thanks Ben. Commit bot should be able to verify it builds. I'll keep an eye on it.
Comment 36 Benjamin Poulain 2012-08-30 10:06:12 PDT
Comment on attachment 161271 [details]
Again

You also need to update the changelog.
Comment 37 Yong Li 2012-08-30 10:09:25 PDT
Created attachment 161504 [details]
To commit (see if GTK can build now)
Comment 38 WebKit Review Bot 2012-08-30 13:51:45 PDT
Comment on attachment 161504 [details]
To commit (see if GTK can build now)

Clearing flags on attachment: 161504

Committed r127186: <http://trac.webkit.org/changeset/127186>
Comment 39 WebKit Review Bot 2012-08-30 13:51:51 PDT
All reviewed patches have been landed.  Closing bug.
Comment 40 Darin Adler 2012-08-30 14:56:39 PDT
(In reply to comment #0)
> But I don't know why there are both canCopyWithMemcpy and canMoveWithMemcpy. Is there any case they are not equivalent?

Yes. A class like RefPtr can return true for canMoveWithMemcpy but must return false for canCopyWithMemcpy because a copy requires incrementing the reference count.