Bug 194811 - [bmalloc] bmalloc::Cache should not be instantiated if we are using system malloc
Summary: [bmalloc] bmalloc::Cache should not be instantiated if we are using system ma...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: bmalloc (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
: 194794 (view as bug list)
Depends on: 194856
Blocks:
  Show dependency treegraph
 
Reported: 2019-02-19 01:29 PST by Yusuke Suzuki
Modified: 2019-03-20 12:42 PDT (History)
5 users (show)

See Also:


Attachments
Patch (12.90 KB, patch)
2019-02-19 16:00 PST, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (12.92 KB, patch)
2019-02-19 16:01 PST, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (12.95 KB, patch)
2019-02-19 16:26 PST, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (12.55 KB, patch)
2019-02-19 17:24 PST, Yusuke Suzuki
mark.lam: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2019-02-19 01:29:01 PST
bmalloc::Cache etc. is anyway initialized, and it takes too much memory.
Comment 1 Geoffrey Garen 2019-02-19 09:02:10 PST
A few thoughts:

(1) I don't understand why DebugHeap has m_sizeMap. We should probably remove that.

(2) I guess you could have the static Cache allocation / deallocation functions allocate / deallocate through the debug heap directly, without ever creating a per-thread cache. That's acceptable for performance because, when bmalloc is enabled, each thread will do it only once. But I'm not sure how you would synchronize on creating the DebugHeap. It's probably too slow just to take a global lock on every allocation / deallocation when using system malloc.
Comment 2 Yusuke Suzuki 2019-02-19 11:36:56 PST
(In reply to Geoffrey Garen from comment #1)
> A few thoughts:
> 
> (1) I don't understand why DebugHeap has m_sizeMap. We should probably
> remove that.

It seems that DebugHeap is using this map to record the requested size for memalignLarge, and we need to use this size when deallocating this region with munmap.

> 
> (2) I guess you could have the static Cache allocation / deallocation
> functions allocate / deallocate through the debug heap directly, without
> ever creating a per-thread cache. That's acceptable for performance because,
> when bmalloc is enabled, each thread will do it only once. But I'm not sure
> how you would synchronize on creating the DebugHeap. It's probably too slow
> just to take a global lock on every allocation / deallocation when using
> system malloc.

DebugHeap is PerProcess<>. So creation itself is OK. If the cache you said means DebugHeap* in Cache (::m_debugHeap), right, I want to keep this too.
I'm OK having per-thread Cache, if sizeof(bmalloc::Cache) is <= page size (mmap memory region), or even smaller (and allocated from system malloc in system malloc configuration).
I want to reduce the size of bmalloc::Cache etc. sizeof(bmalloc::Cache) is 14000 (13KB), and we have this per HeapKind => PerHeapKind<bmalloc::Cache> is 42000 (41KB). Some of this part is touched unfortunately, and vmmap says "WebKit Malloc" has 72KB dirty memory even while we are using system malloc configuration "Malloc=1".
We can see large mmap is executed in bmalloc by using dtruss (`sudo dtruss -t mmap -p $PID`). The following is that simple hello_world.js case.

mmap(0x0, 0x17, 0x1, 0x2, 0x3, 0x0)              = 0x110DA8000 0
mmap(0x0, 0xB000, 0x3, 0x1002, 0x35000000, 0x0)          = 0x110DA9000 0
mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0)          = 0x110DB4000 0
mmap(0x0, 0x3000, 0x3, 0x1002, 0x35000000, 0x0)          = 0x110DB5000 0
mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0)          = 0x110DB8000 0
mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0)          = 0x110DE5000 0
mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0)          = 0x110DE6000 0
mmap(0x3220BE000000, 0x40002000, 0x7, 0x1802, 0x40000000, 0x0)           = 0x3220BE000000 0 (This is protection change of the JIT region. Does not affect on dirty memory)
mmap(0x3220BE000000, 0x1000, 0x0, 0x1012, 0x40000000, 0x0)               = 0x3220BE000000 0 (...)
mmap(0x3220FE001000, 0x1000, 0x0, 0x1012, 0x40000000, 0x0)               = 0x3220FE001000 0 (...)
mmap(0x0, 0x1000, 0x1, 0x1, 0x4, 0x0)            = 0x110DE7000 0
mmap(0x0, 0xB000, 0x3, 0x1002, 0x35000000, 0x0)          = 0x110DE8000 0

If we use string_lists.js, our "WebKit Malloc" region is even extended, "380K". This is because Heap Helper Threads starts waking up and touch Cache, then Cache is created per thread. Now, I'm considering that our large thread memory footprint in JSC is mainly occupied by "system malloc", "stack" and this "WebKit Malloc".
Comment 3 Yusuke Suzuki 2019-02-19 14:05:16 PST
I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to verify bmalloc data structures introduce dirty memory region. vmmap tells that the size shrinks from 72KB to 24KB ("WebKit Malloc", we can reduce further).
And we get footprint reduction.

BEFORE:
Code finished, language environment destroyed..
    Current Footprint: 2363392
    Peak Footprint: 2363392
Press enter to continue.

AFTER:
Code finished, language environment destroyed..
    Current Footprint: 2273280
    Peak Footprint: 2273280
Press enter to continue.
Comment 4 Geoffrey Garen 2019-02-19 15:00:59 PST
> > (1) I don't understand why DebugHeap has m_sizeMap. We should probably
> > remove that.
> 
> It seems that DebugHeap is using this map to record the requested size for
> memalignLarge, and we need to use this size when deallocating this region
> with munmap.

Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge, and call through to posix_memalign() and free() instead?

> DebugHeap is PerProcess<>. So creation itself is OK. If the cache you said
> means DebugHeap* in Cache (::m_debugHeap), right, I want to keep this too.

If DebugHeap is PerProcss, maybe the Cache slow path can test for it directly before allocating the Cache.
Comment 5 Geoffrey Garen 2019-02-19 15:02:23 PST
> I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to
> verify bmalloc data structures introduce dirty memory region.

It's very likely that reducing those values will cause a regression in PerformanceTests/MallocBench, especially on machines with many CPUs.

Probably better to avoid creating the Cache when we don't use it, but keep the Cache big when we need it -- if possible.
Comment 6 Yusuke Suzuki 2019-02-19 15:18:16 PST
(In reply to Geoffrey Garen from comment #5)
> > I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to
> > verify bmalloc data structures introduce dirty memory region.
> 
> It's very likely that reducing those values will cause a regression in
> PerformanceTests/MallocBench, especially on machines with many CPUs.

Agreed. The simple experiment shows that Cache's (and possibly Heap's) size is actually the problem.
We should have some way to avoid creation of them in debug heap mode.

> 
> Probably better to avoid creating the Cache when we don't use it, but keep
> the Cache big when we need it -- if possible.

Yes. We should move debugHeap->malloc/free things into slow path of Cache.
And instead of initializing Cache by using getSlow(), use debugHeap at the slow path.
Then, Cache is kept large, and bmalloc still goes the fast path since Cache::getFast() succeeds.
And DebugHeap case goes to the static slow path functions in Cache, and allocate memory there from DebugHeap.
I think DebugHeap case does not introduce any performance regression too, because DebugHeap case anyway goes to the slow case of the allocation right now (bump allocators always say fast path is not usable).
Comment 7 Yusuke Suzuki 2019-02-19 15:19:58 PST
(In reply to Yusuke Suzuki from comment #6)
> Yes. We should move debugHeap->malloc/free things into slow path of Cache.
> And instead of initializing Cache by using getSlow(), use debugHeap at the
> slow path.
> Then, Cache is kept large, and bmalloc still goes the fast path since
> Cache::getFast() succeeds.
> And DebugHeap case goes to the static slow path functions in Cache, and
> allocate memory there from DebugHeap.
> I think DebugHeap case does not introduce any performance regression too,
> because DebugHeap case anyway goes to the slow case of the allocation right
> now (bump allocators always say fast path is not usable).

I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr.
Comment 8 Yusuke Suzuki 2019-02-19 15:30:01 PST
https://bugs.webkit.org/show_bug.cgi?id=194836 spawned. In this issue, I first do it for bmalloc::Cache.
Comment 9 Yusuke Suzuki 2019-02-19 15:51:10 PST
(In reply to Geoffrey Garen from comment #4)
> Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge,
> and call through to posix_memalign() and free() instead?

It seems like that some code in JSC assumes that allocated memory from that is VM region. I need to investigate more why they assume so, and attempt to remove this from DebugHeap in a separate patch.
Comment 10 Yusuke Suzuki 2019-02-19 15:52:21 PST
Spawned https://bugs.webkit.org/show_bug.cgi?id=194838
Comment 11 Geoffrey Garen 2019-02-19 15:54:13 PST
(In reply to Yusuke Suzuki from comment #9)
> (In reply to Geoffrey Garen from comment #4)
> > Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge,
> > and call through to posix_memalign() and free() instead?
> 
> It seems like that some code in JSC assumes that allocated memory from that
> is VM region. I need to investigate more why they assume so, and attempt to
> remove this from DebugHeap in a separate patch.

If there's code that really needs a VM allocation, I'd suggest using the PageReservation class, or similar, directly, rather than calling malloc/free.
Comment 12 Geoffrey Garen 2019-02-19 15:56:54 PST
> I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr.

Sounds good.

In theory, all we need is a global TriState: { Uninitialized, Enabled, Disabled }. Any thread that sees Uninitialized runs the function to decide if bmalloc is enabled and then sets Enabled or Disabled. Any thread that sees Enabled uses bmalloc. And any thread that sees Disabled uses the DebugHeap. Such a global wouldn't really need any synchronization or locking, since it's a single word, and no two threads will ever set it to differing values.

I'm not sure if that's exactly the best implementation, but it's the essence of the idea.

Imagine one global and two functions:

bool computeBmallocEnabled();

DebugHeap* debugHeap();

And this TriState:

{ Uninitialized, Enabled, Disabled } s_isBmallocEnabled;

Here's the implementation of debugHeap():

if (s_isBmallocEnabled == Enabled)
    return nullptr;

if (s_isBmallocEnabled == Uninitialized) {
    s_isBmallocEnabled = computeBmallocEnabled();
    if (s_isBmallocEnabled == Enabled)
        return nullptr;
}

return PerProcess<DebugHeap>::get();

Here's the implementation of every Cache slow path:

if (DebugHeap* debugHeap = debugHeap())
    return debugHeap->allocationFunction(...);

return PerThread<PerHeapKind<Cache>>::getSlowCase()->allocationFunction(...);
Comment 13 Yusuke Suzuki 2019-02-19 16:00:21 PST
Created attachment 362446 [details]
Patch
Comment 14 Yusuke Suzuki 2019-02-19 16:01:52 PST
Created attachment 362447 [details]
Patch
Comment 15 Yusuke Suzuki 2019-02-19 16:13:17 PST
Comment on attachment 362447 [details]
Patch

Still crashing. Checking...
Comment 16 Mark Lam 2019-02-19 16:13:52 PST
Comment on attachment 362447 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=362447&action=review

> Source/bmalloc/bmalloc/Cache.cpp:51
> +    if (debugHeap)

UNLIKELY(debugHeap) ?

> Source/bmalloc/bmalloc/Cache.cpp:53
> +    if (PerProcess<Environment>::get()->isDebugHeapEnabled()) {

UNLIKELY too?
Comment 17 Yusuke Suzuki 2019-02-19 16:26:18 PST
Created attachment 362451 [details]
Patch
Comment 18 Mark Lam 2019-02-19 16:30:39 PST
Comment on attachment 362451 [details]
Patch

r=me
Comment 19 Yusuke Suzuki 2019-02-19 16:59:02 PST
(In reply to Geoffrey Garen from comment #12)
> > I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr.
> 
> Sounds good.
> 
> In theory, all we need is a global TriState: { Uninitialized, Enabled,
> Disabled }. Any thread that sees Uninitialized runs the function to decide
> if bmalloc is enabled and then sets Enabled or Disabled. Any thread that
> sees Enabled uses bmalloc. And any thread that sees Disabled uses the
> DebugHeap. Such a global wouldn't really need any synchronization or
> locking, since it's a single word, and no two threads will ever set it to
> differing values.

Thanks. Yes! I used DebugHeap* and Cache status for this flag.
In my patch, I put `static DebugHeap* debugHeap` in Cache.cpp.
TriState is represented as follows.

1. Uninitialized Cache::getFast() => nullptr, and DebugHeap* => nullptr
2. Enabled => Cache::getFast() => pointer, and DebugHeap* => nullptr
3. Disabled => Cache::getFast() => nullptr, and DebugHeap* => actual DebugHeap

Important thing is Cache::getFast() is TLS status (per-thread), and DebugHeap* is global variable so it's per-process status.
And setting `debugHeap = PerProcess<DebugHeap>::get()` does not have synchronization. (While DebugHeap creation needs and has synchronization in PerProcess<>).
The synchronization for setting `debugHeap` is unnecessary since, as you said, no two threads will ever set it to differing values.

> 
> I'm not sure if that's exactly the best implementation, but it's the essence
> of the idea.
> 
> Imagine one global and two functions:
> 
> bool computeBmallocEnabled();

Yeah, we have `PerProcess<Environment>::get()->isDebugHeapEnabled()`. It caches the result of `computeIsDebugHeapEnabled()`.

> 
> DebugHeap* debugHeap();
> 
> And this TriState:
> 
> { Uninitialized, Enabled, Disabled } s_isBmallocEnabled;
> 
> Here's the implementation of debugHeap():
> 
> if (s_isBmallocEnabled == Enabled)
>     return nullptr;
> 
> if (s_isBmallocEnabled == Uninitialized) {
>     s_isBmallocEnabled = computeBmallocEnabled();
>     if (s_isBmallocEnabled == Enabled)
>         return nullptr;
> }
> 
> return PerProcess<DebugHeap>::get();
> 
> Here's the implementation of every Cache slow path:
> 
> if (DebugHeap* debugHeap = debugHeap())
>     return debugHeap->allocationFunction(...);
> 
> return PerThread<PerHeapKind<Cache>>::getSlowCase()->allocationFunction(...);

Yeah, my uploaded patch takes similar approach :)


So, my patch is like,


void* allocate(...)
{
    if (Cache* cache = Cache::getFast())
        return cache->allocate();
    return allocateSlowPath();
}

static DebugHeap* debugHeap { nullptr };
Cache* getCache()
{
    if (debugHeap)
        return nullptr;
    if (isDebugHeapEnabled()) {   // BTW, this status computation is cached by PerProcess<Environment>. So it is also cheap.
        debugHeap = PerProcess<DebugHeap>::get();
        return nullptr;
    }
    return Cache::getSlow();
}

void* allocateSlowPath(...)
{
    if (Cache* cache = getCache())
       return cache->allocate();
    return debugHeap->allocate();
}
Comment 20 Yusuke Suzuki 2019-02-19 17:21:26 PST
if (auto* heap = debugHeap())
   ...
return Cache::getSlow()->at(...)->...;

seems less diff, and easy to read. I'll update the patch with Geoff's suggestion :)
Comment 21 Yusuke Suzuki 2019-02-19 17:24:48 PST
Created attachment 362459 [details]
Patch
Comment 22 Mark Lam 2019-02-19 17:32:37 PST
Comment on attachment 362459 [details]
Patch

r=me
Comment 23 Yusuke Suzuki 2019-02-19 17:52:00 PST
Committed r241789: <https://trac.webkit.org/changeset/241789>
Comment 24 Radar WebKit Bug Importer 2019-02-19 17:54:38 PST
<rdar://problem/48222791>
Comment 25 WebKit Commit Bot 2019-02-20 10:02:53 PST
Re-opened since this is blocked by bug 194856
Comment 26 Yusuke Suzuki 2019-02-20 11:18:31 PST
I found the reason.

`bool bmalloc::api::isEnabled(HeapKind kind)` implementation is this.

bool isEnabled(HeapKind kind)
{
    kind = mapToActiveHeapKind(kind);
    std::unique_lock<Mutex> lock(Heap::mutex());
    return !PerProcess<PerHeapKind<Heap>>::getFastCase()->at(kind).debugHeap();
}

This API implicitly assumes that `PerProcess<PerHeapKind<Heap>>` is initialized by someone when `isEnabled` is called. But with this patch, we do not instantiate Cache.
Since Cache accesses Heap, Heap is instantiated by Cache. Removing Cache instantiation makes Heap not-instantiated status, and this `getFastCase()` returns nullptr when it is used from WebProcess:: initializeWebProcess -> MemoryPressureHandler::setShouldUsePeriodicMemoryMonitor -> bmalloc::api::isEnabled.

But relying on Heap instantiation done by somewhere else is not good. We should just use `PerProcess<Environment>::get()->isDebugHeapEnabled()` here.
Comment 27 Yusuke Suzuki 2019-02-20 13:29:00 PST
(In reply to Yusuke Suzuki from comment #26)
> I found the reason.
> 
> `bool bmalloc::api::isEnabled(HeapKind kind)` implementation is this.
> 
> bool isEnabled(HeapKind kind)
> {
>     kind = mapToActiveHeapKind(kind);
>     std::unique_lock<Mutex> lock(Heap::mutex());
>     return
> !PerProcess<PerHeapKind<Heap>>::getFastCase()->at(kind).debugHeap();
> }
> 
> This API implicitly assumes that `PerProcess<PerHeapKind<Heap>>` is
> initialized by someone when `isEnabled` is called. But with this patch, we
> do not instantiate Cache.
> Since Cache accesses Heap, Heap is instantiated by Cache. Removing Cache
> instantiation makes Heap not-instantiated status, and this `getFastCase()`
> returns nullptr when it is used from WebProcess:: initializeWebProcess ->
> MemoryPressureHandler::setShouldUsePeriodicMemoryMonitor ->
> bmalloc::api::isEnabled.
> 
> But relying on Heap instantiation done by somewhere else is not good. We
> should just use `PerProcess<Environment>::get()->isDebugHeapEnabled()` here.

I'll land this patch again with the above change for `isEnabled(HeapKind)`. I've grepped the source code with `getFastCase()` and the other use cases are OK.
Comment 28 Yusuke Suzuki 2019-02-20 13:30:52 PST
Committed r241832: <https://trac.webkit.org/changeset/241832>
Comment 29 David Kilzer (:ddkilzer) 2019-03-20 12:42:31 PDT
*** Bug 194794 has been marked as a duplicate of this bug. ***