WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
194811
[bmalloc] bmalloc::Cache should not be instantiated if we are using system malloc
https://bugs.webkit.org/show_bug.cgi?id=194811
Summary
[bmalloc] bmalloc::Cache should not be instantiated if we are using system ma...
Yusuke Suzuki
Reported
2019-02-19 01:29:01 PST
bmalloc::Cache etc. is anyway initialized, and it takes too much memory.
Attachments
Patch
(12.90 KB, patch)
2019-02-19 16:00 PST
,
Yusuke Suzuki
no flags
Details
Formatted Diff
Diff
Patch
(12.92 KB, patch)
2019-02-19 16:01 PST
,
Yusuke Suzuki
no flags
Details
Formatted Diff
Diff
Patch
(12.95 KB, patch)
2019-02-19 16:26 PST
,
Yusuke Suzuki
no flags
Details
Formatted Diff
Diff
Patch
(12.55 KB, patch)
2019-02-19 17:24 PST
,
Yusuke Suzuki
mark.lam
: review+
Details
Formatted Diff
Diff
Show Obsolete
(3)
View All
Add attachment
proposed patch, testcase, etc.
Geoffrey Garen
Comment 1
2019-02-19 09:02:10 PST
A few thoughts: (1) I don't understand why DebugHeap has m_sizeMap. We should probably remove that. (2) I guess you could have the static Cache allocation / deallocation functions allocate / deallocate through the debug heap directly, without ever creating a per-thread cache. That's acceptable for performance because, when bmalloc is enabled, each thread will do it only once. But I'm not sure how you would synchronize on creating the DebugHeap. It's probably too slow just to take a global lock on every allocation / deallocation when using system malloc.
Yusuke Suzuki
Comment 2
2019-02-19 11:36:56 PST
(In reply to Geoffrey Garen from
comment #1
)
> A few thoughts: > > (1) I don't understand why DebugHeap has m_sizeMap. We should probably > remove that.
It seems that DebugHeap is using this map to record the requested size for memalignLarge, and we need to use this size when deallocating this region with munmap.
> > (2) I guess you could have the static Cache allocation / deallocation > functions allocate / deallocate through the debug heap directly, without > ever creating a per-thread cache. That's acceptable for performance because, > when bmalloc is enabled, each thread will do it only once. But I'm not sure > how you would synchronize on creating the DebugHeap. It's probably too slow > just to take a global lock on every allocation / deallocation when using > system malloc.
DebugHeap is PerProcess<>. So creation itself is OK. If the cache you said means DebugHeap* in Cache (::m_debugHeap), right, I want to keep this too. I'm OK having per-thread Cache, if sizeof(bmalloc::Cache) is <= page size (mmap memory region), or even smaller (and allocated from system malloc in system malloc configuration). I want to reduce the size of bmalloc::Cache etc. sizeof(bmalloc::Cache) is 14000 (13KB), and we have this per HeapKind => PerHeapKind<bmalloc::Cache> is 42000 (41KB). Some of this part is touched unfortunately, and vmmap says "WebKit Malloc" has 72KB dirty memory even while we are using system malloc configuration "Malloc=1". We can see large mmap is executed in bmalloc by using dtruss (`sudo dtruss -t mmap -p $PID`). The following is that simple hello_world.js case. mmap(0x0, 0x17, 0x1, 0x2, 0x3, 0x0) = 0x110DA8000 0 mmap(0x0, 0xB000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DA9000 0 mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DB4000 0 mmap(0x0, 0x3000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DB5000 0 mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DB8000 0 mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DE5000 0 mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DE6000 0 mmap(0x3220BE000000, 0x40002000, 0x7, 0x1802, 0x40000000, 0x0) = 0x3220BE000000 0 (This is protection change of the JIT region. Does not affect on dirty memory) mmap(0x3220BE000000, 0x1000, 0x0, 0x1012, 0x40000000, 0x0) = 0x3220BE000000 0 (...) mmap(0x3220FE001000, 0x1000, 0x0, 0x1012, 0x40000000, 0x0) = 0x3220FE001000 0 (...) mmap(0x0, 0x1000, 0x1, 0x1, 0x4, 0x0) = 0x110DE7000 0 mmap(0x0, 0xB000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DE8000 0 If we use string_lists.js, our "WebKit Malloc" region is even extended, "380K". This is because Heap Helper Threads starts waking up and touch Cache, then Cache is created per thread. Now, I'm considering that our large thread memory footprint in JSC is mainly occupied by "system malloc", "stack" and this "WebKit Malloc".
Yusuke Suzuki
Comment 3
2019-02-19 14:05:16 PST
I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to verify bmalloc data structures introduce dirty memory region. vmmap tells that the size shrinks from 72KB to 24KB ("WebKit Malloc", we can reduce further). And we get footprint reduction. BEFORE: Code finished, language environment destroyed.. Current Footprint: 2363392 Peak Footprint: 2363392 Press enter to continue. AFTER: Code finished, language environment destroyed.. Current Footprint: 2273280 Peak Footprint: 2273280 Press enter to continue.
Geoffrey Garen
Comment 4
2019-02-19 15:00:59 PST
> > (1) I don't understand why DebugHeap has m_sizeMap. We should probably > > remove that. > > It seems that DebugHeap is using this map to record the requested size for > memalignLarge, and we need to use this size when deallocating this region > with munmap.
Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge, and call through to posix_memalign() and free() instead?
> DebugHeap is PerProcess<>. So creation itself is OK. If the cache you said > means DebugHeap* in Cache (::m_debugHeap), right, I want to keep this too.
If DebugHeap is PerProcss, maybe the Cache slow path can test for it directly before allocating the Cache.
Geoffrey Garen
Comment 5
2019-02-19 15:02:23 PST
> I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to > verify bmalloc data structures introduce dirty memory region.
It's very likely that reducing those values will cause a regression in PerformanceTests/MallocBench, especially on machines with many CPUs. Probably better to avoid creating the Cache when we don't use it, but keep the Cache big when we need it -- if possible.
Yusuke Suzuki
Comment 6
2019-02-19 15:18:16 PST
(In reply to Geoffrey Garen from
comment #5
)
> > I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to > > verify bmalloc data structures introduce dirty memory region. > > It's very likely that reducing those values will cause a regression in > PerformanceTests/MallocBench, especially on machines with many CPUs.
Agreed. The simple experiment shows that Cache's (and possibly Heap's) size is actually the problem. We should have some way to avoid creation of them in debug heap mode.
> > Probably better to avoid creating the Cache when we don't use it, but keep > the Cache big when we need it -- if possible.
Yes. We should move debugHeap->malloc/free things into slow path of Cache. And instead of initializing Cache by using getSlow(), use debugHeap at the slow path. Then, Cache is kept large, and bmalloc still goes the fast path since Cache::getFast() succeeds. And DebugHeap case goes to the static slow path functions in Cache, and allocate memory there from DebugHeap. I think DebugHeap case does not introduce any performance regression too, because DebugHeap case anyway goes to the slow case of the allocation right now (bump allocators always say fast path is not usable).
Yusuke Suzuki
Comment 7
2019-02-19 15:19:58 PST
(In reply to Yusuke Suzuki from
comment #6
)
> Yes. We should move debugHeap->malloc/free things into slow path of Cache. > And instead of initializing Cache by using getSlow(), use debugHeap at the > slow path. > Then, Cache is kept large, and bmalloc still goes the fast path since > Cache::getFast() succeeds. > And DebugHeap case goes to the static slow path functions in Cache, and > allocate memory there from DebugHeap. > I think DebugHeap case does not introduce any performance regression too, > because DebugHeap case anyway goes to the slow case of the allocation right > now (bump allocators always say fast path is not usable).
I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr.
Yusuke Suzuki
Comment 8
2019-02-19 15:30:01 PST
https://bugs.webkit.org/show_bug.cgi?id=194836
spawned. In this issue, I first do it for bmalloc::Cache.
Yusuke Suzuki
Comment 9
2019-02-19 15:51:10 PST
(In reply to Geoffrey Garen from
comment #4
)
> Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge, > and call through to posix_memalign() and free() instead?
It seems like that some code in JSC assumes that allocated memory from that is VM region. I need to investigate more why they assume so, and attempt to remove this from DebugHeap in a separate patch.
Yusuke Suzuki
Comment 10
2019-02-19 15:52:21 PST
Spawned
https://bugs.webkit.org/show_bug.cgi?id=194838
Geoffrey Garen
Comment 11
2019-02-19 15:54:13 PST
(In reply to Yusuke Suzuki from
comment #9
)
> (In reply to Geoffrey Garen from
comment #4
) > > Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge, > > and call through to posix_memalign() and free() instead? > > It seems like that some code in JSC assumes that allocated memory from that > is VM region. I need to investigate more why they assume so, and attempt to > remove this from DebugHeap in a separate patch.
If there's code that really needs a VM allocation, I'd suggest using the PageReservation class, or similar, directly, rather than calling malloc/free.
Geoffrey Garen
Comment 12
2019-02-19 15:56:54 PST
> I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr.
Sounds good. In theory, all we need is a global TriState: { Uninitialized, Enabled, Disabled }. Any thread that sees Uninitialized runs the function to decide if bmalloc is enabled and then sets Enabled or Disabled. Any thread that sees Enabled uses bmalloc. And any thread that sees Disabled uses the DebugHeap. Such a global wouldn't really need any synchronization or locking, since it's a single word, and no two threads will ever set it to differing values. I'm not sure if that's exactly the best implementation, but it's the essence of the idea. Imagine one global and two functions: bool computeBmallocEnabled(); DebugHeap* debugHeap(); And this TriState: { Uninitialized, Enabled, Disabled } s_isBmallocEnabled; Here's the implementation of debugHeap(): if (s_isBmallocEnabled == Enabled) return nullptr; if (s_isBmallocEnabled == Uninitialized) { s_isBmallocEnabled = computeBmallocEnabled(); if (s_isBmallocEnabled == Enabled) return nullptr; } return PerProcess<DebugHeap>::get(); Here's the implementation of every Cache slow path: if (DebugHeap* debugHeap = debugHeap()) return debugHeap->allocationFunction(...); return PerThread<PerHeapKind<Cache>>::getSlowCase()->allocationFunction(...);
Yusuke Suzuki
Comment 13
2019-02-19 16:00:21 PST
Created
attachment 362446
[details]
Patch
Yusuke Suzuki
Comment 14
2019-02-19 16:01:52 PST
Created
attachment 362447
[details]
Patch
Yusuke Suzuki
Comment 15
2019-02-19 16:13:17 PST
Comment on
attachment 362447
[details]
Patch Still crashing. Checking...
Mark Lam
Comment 16
2019-02-19 16:13:52 PST
Comment on
attachment 362447
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=362447&action=review
> Source/bmalloc/bmalloc/Cache.cpp:51 > + if (debugHeap)
UNLIKELY(debugHeap) ?
> Source/bmalloc/bmalloc/Cache.cpp:53 > + if (PerProcess<Environment>::get()->isDebugHeapEnabled()) {
UNLIKELY too?
Yusuke Suzuki
Comment 17
2019-02-19 16:26:18 PST
Created
attachment 362451
[details]
Patch
Mark Lam
Comment 18
2019-02-19 16:30:39 PST
Comment on
attachment 362451
[details]
Patch r=me
Yusuke Suzuki
Comment 19
2019-02-19 16:59:02 PST
(In reply to Geoffrey Garen from
comment #12
)
> > I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr. > > Sounds good. > > In theory, all we need is a global TriState: { Uninitialized, Enabled, > Disabled }. Any thread that sees Uninitialized runs the function to decide > if bmalloc is enabled and then sets Enabled or Disabled. Any thread that > sees Enabled uses bmalloc. And any thread that sees Disabled uses the > DebugHeap. Such a global wouldn't really need any synchronization or > locking, since it's a single word, and no two threads will ever set it to > differing values.
Thanks. Yes! I used DebugHeap* and Cache status for this flag. In my patch, I put `static DebugHeap* debugHeap` in Cache.cpp. TriState is represented as follows. 1. Uninitialized Cache::getFast() => nullptr, and DebugHeap* => nullptr 2. Enabled => Cache::getFast() => pointer, and DebugHeap* => nullptr 3. Disabled => Cache::getFast() => nullptr, and DebugHeap* => actual DebugHeap Important thing is Cache::getFast() is TLS status (per-thread), and DebugHeap* is global variable so it's per-process status. And setting `debugHeap = PerProcess<DebugHeap>::get()` does not have synchronization. (While DebugHeap creation needs and has synchronization in PerProcess<>). The synchronization for setting `debugHeap` is unnecessary since, as you said, no two threads will ever set it to differing values.
> > I'm not sure if that's exactly the best implementation, but it's the essence > of the idea. > > Imagine one global and two functions: > > bool computeBmallocEnabled();
Yeah, we have `PerProcess<Environment>::get()->isDebugHeapEnabled()`. It caches the result of `computeIsDebugHeapEnabled()`.
> > DebugHeap* debugHeap(); > > And this TriState: > > { Uninitialized, Enabled, Disabled } s_isBmallocEnabled; > > Here's the implementation of debugHeap(): > > if (s_isBmallocEnabled == Enabled) > return nullptr; > > if (s_isBmallocEnabled == Uninitialized) { > s_isBmallocEnabled = computeBmallocEnabled(); > if (s_isBmallocEnabled == Enabled) > return nullptr; > } > > return PerProcess<DebugHeap>::get(); > > Here's the implementation of every Cache slow path: > > if (DebugHeap* debugHeap = debugHeap()) > return debugHeap->allocationFunction(...); > > return PerThread<PerHeapKind<Cache>>::getSlowCase()->allocationFunction(...);
Yeah, my uploaded patch takes similar approach :) So, my patch is like, void* allocate(...) { if (Cache* cache = Cache::getFast()) return cache->allocate(); return allocateSlowPath(); } static DebugHeap* debugHeap { nullptr }; Cache* getCache() { if (debugHeap) return nullptr; if (isDebugHeapEnabled()) { // BTW, this status computation is cached by PerProcess<Environment>. So it is also cheap. debugHeap = PerProcess<DebugHeap>::get(); return nullptr; } return Cache::getSlow(); } void* allocateSlowPath(...) { if (Cache* cache = getCache()) return cache->allocate(); return debugHeap->allocate(); }
Yusuke Suzuki
Comment 20
2019-02-19 17:21:26 PST
if (auto* heap = debugHeap()) ... return Cache::getSlow()->at(...)->...; seems less diff, and easy to read. I'll update the patch with Geoff's suggestion :)
Yusuke Suzuki
Comment 21
2019-02-19 17:24:48 PST
Created
attachment 362459
[details]
Patch
Mark Lam
Comment 22
2019-02-19 17:32:37 PST
Comment on
attachment 362459
[details]
Patch r=me
Yusuke Suzuki
Comment 23
2019-02-19 17:52:00 PST
Committed
r241789
: <
https://trac.webkit.org/changeset/241789
>
Radar WebKit Bug Importer
Comment 24
2019-02-19 17:54:38 PST
<
rdar://problem/48222791
>
WebKit Commit Bot
Comment 25
2019-02-20 10:02:53 PST
Re-opened since this is blocked by
bug 194856
Yusuke Suzuki
Comment 26
2019-02-20 11:18:31 PST
I found the reason. `bool bmalloc::api::isEnabled(HeapKind kind)` implementation is this. bool isEnabled(HeapKind kind) { kind = mapToActiveHeapKind(kind); std::unique_lock<Mutex> lock(Heap::mutex()); return !PerProcess<PerHeapKind<Heap>>::getFastCase()->at(kind).debugHeap(); } This API implicitly assumes that `PerProcess<PerHeapKind<Heap>>` is initialized by someone when `isEnabled` is called. But with this patch, we do not instantiate Cache. Since Cache accesses Heap, Heap is instantiated by Cache. Removing Cache instantiation makes Heap not-instantiated status, and this `getFastCase()` returns nullptr when it is used from WebProcess:: initializeWebProcess -> MemoryPressureHandler::setShouldUsePeriodicMemoryMonitor -> bmalloc::api::isEnabled. But relying on Heap instantiation done by somewhere else is not good. We should just use `PerProcess<Environment>::get()->isDebugHeapEnabled()` here.
Yusuke Suzuki
Comment 27
2019-02-20 13:29:00 PST
(In reply to Yusuke Suzuki from
comment #26
)
> I found the reason. > > `bool bmalloc::api::isEnabled(HeapKind kind)` implementation is this. > > bool isEnabled(HeapKind kind) > { > kind = mapToActiveHeapKind(kind); > std::unique_lock<Mutex> lock(Heap::mutex()); > return > !PerProcess<PerHeapKind<Heap>>::getFastCase()->at(kind).debugHeap(); > } > > This API implicitly assumes that `PerProcess<PerHeapKind<Heap>>` is > initialized by someone when `isEnabled` is called. But with this patch, we > do not instantiate Cache. > Since Cache accesses Heap, Heap is instantiated by Cache. Removing Cache > instantiation makes Heap not-instantiated status, and this `getFastCase()` > returns nullptr when it is used from WebProcess:: initializeWebProcess -> > MemoryPressureHandler::setShouldUsePeriodicMemoryMonitor -> > bmalloc::api::isEnabled. > > But relying on Heap instantiation done by somewhere else is not good. We > should just use `PerProcess<Environment>::get()->isDebugHeapEnabled()` here.
I'll land this patch again with the above change for `isEnabled(HeapKind)`. I've grepped the source code with `getFastCase()` and the other use cases are OK.
Yusuke Suzuki
Comment 28
2019-02-20 13:30:52 PST
Committed
r241832
: <
https://trac.webkit.org/changeset/241832
>
David Kilzer (:ddkilzer)
Comment 29
2019-03-20 12:42:31 PDT
***
Bug 194794
has been marked as a duplicate of this bug. ***
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug