194811 – [bmalloc] bmalloc::Cache should not be instantiated if we are using system malloc

RESOLVED FIXED 194811

[bmalloc] bmalloc::Cache should not be instantiated if we are using system malloc

https://bugs.webkit.org/show_bug.cgi?id=194811

Summary [bmalloc] bmalloc::Cache should not be instantiated if we are using system ma...

Yusuke Suzuki

Reported 2019-02-19 01:29:01 PST

bmalloc::Cache etc. is anyway initialized, and it takes too much memory.

Attachments
Patch (12.90 KB, patch) 2019-02-19 16:00 PST, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (12.92 KB, patch) 2019-02-19 16:01 PST, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (12.95 KB, patch) 2019-02-19 16:26 PST, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (12.55 KB, patch) 2019-02-19 17:24 PST, Yusuke Suzuki	mark.lam: review+	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Geoffrey Garen

Comment 1 2019-02-19 09:02:10 PST

A few thoughts: (1) I don't understand why DebugHeap has m_sizeMap. We should probably remove that. (2) I guess you could have the static Cache allocation / deallocation functions allocate / deallocate through the debug heap directly, without ever creating a per-thread cache. That's acceptable for performance because, when bmalloc is enabled, each thread will do it only once. But I'm not sure how you would synchronize on creating the DebugHeap. It's probably too slow just to take a global lock on every allocation / deallocation when using system malloc.

Yusuke Suzuki

Comment 2 2019-02-19 11:36:56 PST

(In reply to Geoffrey Garen from comment #1) > A few thoughts: > > (1) I don't understand why DebugHeap has m_sizeMap. We should probably > remove that. It seems that DebugHeap is using this map to record the requested size for memalignLarge, and we need to use this size when deallocating this region with munmap. > > (2) I guess you could have the static Cache allocation / deallocation > functions allocate / deallocate through the debug heap directly, without > ever creating a per-thread cache. That's acceptable for performance because, > when bmalloc is enabled, each thread will do it only once. But I'm not sure > how you would synchronize on creating the DebugHeap. It's probably too slow > just to take a global lock on every allocation / deallocation when using > system malloc. DebugHeap is PerProcess<>. So creation itself is OK. If the cache you said means DebugHeap* in Cache (::m_debugHeap), right, I want to keep this too. I'm OK having per-thread Cache, if sizeof(bmalloc::Cache) is <= page size (mmap memory region), or even smaller (and allocated from system malloc in system malloc configuration). I want to reduce the size of bmalloc::Cache etc. sizeof(bmalloc::Cache) is 14000 (13KB), and we have this per HeapKind => PerHeapKind<bmalloc::Cache> is 42000 (41KB). Some of this part is touched unfortunately, and vmmap says "WebKit Malloc" has 72KB dirty memory even while we are using system malloc configuration "Malloc=1". We can see large mmap is executed in bmalloc by using dtruss (`sudo dtruss -t mmap -p $PID`). The following is that simple hello_world.js case. mmap(0x0, 0x17, 0x1, 0x2, 0x3, 0x0) = 0x110DA8000 0 mmap(0x0, 0xB000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DA9000 0 mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DB4000 0 mmap(0x0, 0x3000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DB5000 0 mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DB8000 0 mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DE5000 0 mmap(0x0, 0x1000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DE6000 0 mmap(0x3220BE000000, 0x40002000, 0x7, 0x1802, 0x40000000, 0x0) = 0x3220BE000000 0 (This is protection change of the JIT region. Does not affect on dirty memory) mmap(0x3220BE000000, 0x1000, 0x0, 0x1012, 0x40000000, 0x0) = 0x3220BE000000 0 (...) mmap(0x3220FE001000, 0x1000, 0x0, 0x1012, 0x40000000, 0x0) = 0x3220FE001000 0 (...) mmap(0x0, 0x1000, 0x1, 0x1, 0x4, 0x0) = 0x110DE7000 0 mmap(0x0, 0xB000, 0x3, 0x1002, 0x35000000, 0x0) = 0x110DE8000 0 If we use string_lists.js, our "WebKit Malloc" region is even extended, "380K". This is because Heap Helper Threads starts waking up and touch Cache, then Cache is created per thread. Now, I'm considering that our large thread memory footprint in JSC is mainly occupied by "system malloc", "stack" and this "WebKit Malloc".

Yusuke Suzuki

Comment 3 2019-02-19 14:05:16 PST

I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to verify bmalloc data structures introduce dirty memory region. vmmap tells that the size shrinks from 72KB to 24KB ("WebKit Malloc", we can reduce further). And we get footprint reduction. BEFORE: Code finished, language environment destroyed.. Current Footprint: 2363392 Peak Footprint: 2363392 Press enter to continue. AFTER: Code finished, language environment destroyed.. Current Footprint: 2273280 Peak Footprint: 2273280 Press enter to continue.

Geoffrey Garen

Comment 4 2019-02-19 15:00:59 PST

> > (1) I don't understand why DebugHeap has m_sizeMap. We should probably > > remove that. > > It seems that DebugHeap is using this map to record the requested size for > memalignLarge, and we need to use this size when deallocating this region > with munmap. Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge, and call through to posix_memalign() and free() instead? > DebugHeap is PerProcess<>. So creation itself is OK. If the cache you said > means DebugHeap* in Cache (::m_debugHeap), right, I want to keep this too. If DebugHeap is PerProcss, maybe the Cache slow path can test for it directly before allocating the Cache.

Geoffrey Garen

Comment 5 2019-02-19 15:02:23 PST

> I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to > verify bmalloc data structures introduce dirty memory region. It's very likely that reducing those values will cause a regression in PerformanceTests/MallocBench, especially on machines with many CPUs. Probably better to avoid creating the Cache when we don't use it, but keep the Cache big when we need it -- if possible.

Yusuke Suzuki

Comment 6 2019-02-19 15:18:16 PST

(In reply to Geoffrey Garen from comment #5) > > I've experimentally reduced the bmalloc::Cache size by modifying Sizes.h to > > verify bmalloc data structures introduce dirty memory region. > > It's very likely that reducing those values will cause a regression in > PerformanceTests/MallocBench, especially on machines with many CPUs. Agreed. The simple experiment shows that Cache's (and possibly Heap's) size is actually the problem. We should have some way to avoid creation of them in debug heap mode. > > Probably better to avoid creating the Cache when we don't use it, but keep > the Cache big when we need it -- if possible. Yes. We should move debugHeap->malloc/free things into slow path of Cache. And instead of initializing Cache by using getSlow(), use debugHeap at the slow path. Then, Cache is kept large, and bmalloc still goes the fast path since Cache::getFast() succeeds. And DebugHeap case goes to the static slow path functions in Cache, and allocate memory there from DebugHeap. I think DebugHeap case does not introduce any performance regression too, because DebugHeap case anyway goes to the slow case of the allocation right now (bump allocators always say fast path is not usable).

Yusuke Suzuki

Comment 7 2019-02-19 15:19:58 PST

(In reply to Yusuke Suzuki from comment #6) > Yes. We should move debugHeap->malloc/free things into slow path of Cache. > And instead of initializing Cache by using getSlow(), use debugHeap at the > slow path. > Then, Cache is kept large, and bmalloc still goes the fast path since > Cache::getFast() succeeds. > And DebugHeap case goes to the static slow path functions in Cache, and > allocate memory there from DebugHeap. > I think DebugHeap case does not introduce any performance regression too, > because DebugHeap case anyway goes to the slow case of the allocation right > now (bump allocators always say fast path is not usable). I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr.

Yusuke Suzuki

Comment 8 2019-02-19 15:30:01 PST

https://bugs.webkit.org/show_bug.cgi?id=194836 spawned. In this issue, I first do it for bmalloc::Cache.

Yusuke Suzuki

Comment 9 2019-02-19 15:51:10 PST

(In reply to Geoffrey Garen from comment #4) > Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge, > and call through to posix_memalign() and free() instead? It seems like that some code in JSC assumes that allocated memory from that is VM region. I need to investigate more why they assume so, and attempt to remove this from DebugHeap in a separate patch.

Yusuke Suzuki

Comment 10 2019-02-19 15:52:21 PST

Spawned https://bugs.webkit.org/show_bug.cgi?id=194838

Geoffrey Garen

Comment 11 2019-02-19 15:54:13 PST

(In reply to Yusuke Suzuki from comment #9) > (In reply to Geoffrey Garen from comment #4) > > Can we remove the code in DebugHeap::memalignLarge and DebugHeap::freeLarge, > > and call through to posix_memalign() and free() instead? > > It seems like that some code in JSC assumes that allocated memory from that > is VM region. I need to investigate more why they assume so, and attempt to > remove this from DebugHeap in a separate patch. If there's code that really needs a VM allocation, I'd suggest using the PageReservation class, or similar, directly, rather than calling malloc/free.

Geoffrey Garen

Comment 12 2019-02-19 15:56:54 PST

> I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr. Sounds good. In theory, all we need is a global TriState: { Uninitialized, Enabled, Disabled }. Any thread that sees Uninitialized runs the function to decide if bmalloc is enabled and then sets Enabled or Disabled. Any thread that sees Enabled uses bmalloc. And any thread that sees Disabled uses the DebugHeap. Such a global wouldn't really need any synchronization or locking, since it's a single word, and no two threads will ever set it to differing values. I'm not sure if that's exactly the best implementation, but it's the essence of the idea. Imagine one global and two functions: bool computeBmallocEnabled(); DebugHeap* debugHeap(); And this TriState: { Uninitialized, Enabled, Disabled } s_isBmallocEnabled; Here's the implementation of debugHeap(): if (s_isBmallocEnabled == Enabled) return nullptr; if (s_isBmallocEnabled == Uninitialized) { s_isBmallocEnabled = computeBmallocEnabled(); if (s_isBmallocEnabled == Enabled) return nullptr; } return PerProcess<DebugHeap>::get(); Here's the implementation of every Cache slow path: if (DebugHeap* debugHeap = debugHeap()) return debugHeap->allocationFunction(...); return PerThread<PerHeapKind<Cache>>::getSlowCase()->allocationFunction(...);

Yusuke Suzuki

Comment 13 2019-02-19 16:00:21 PST

Created attachment 362446 [details] Patch

Yusuke Suzuki

Comment 14 2019-02-19 16:01:52 PST

Created attachment 362447 [details] Patch

Yusuke Suzuki

Comment 15 2019-02-19 16:13:17 PST

Comment on attachment 362447 [details] Patch Still crashing. Checking...

Mark Lam

Comment 16 2019-02-19 16:13:52 PST

Comment on attachment 362447 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=362447&action=review > Source/bmalloc/bmalloc/Cache.cpp:51 > + if (debugHeap) UNLIKELY(debugHeap) ? > Source/bmalloc/bmalloc/Cache.cpp:53 > + if (PerProcess<Environment>::get()->isDebugHeapEnabled()) { UNLIKELY too?

Yusuke Suzuki

Comment 17 2019-02-19 16:26:18 PST

Created attachment 362451 [details] Patch

Mark Lam

Comment 18 2019-02-19 16:30:39 PST

Comment on attachment 362451 [details] Patch r=me

Yusuke Suzuki

Comment 19 2019-02-19 16:59:02 PST

(In reply to Geoffrey Garen from comment #12) > > I mean, in debug heap mode, instead of creating Caches, just using debugHeap in Caches instantiation path. And keep Cache::getFast() returning nullptr. > > Sounds good. > > In theory, all we need is a global TriState: { Uninitialized, Enabled, > Disabled }. Any thread that sees Uninitialized runs the function to decide > if bmalloc is enabled and then sets Enabled or Disabled. Any thread that > sees Enabled uses bmalloc. And any thread that sees Disabled uses the > DebugHeap. Such a global wouldn't really need any synchronization or > locking, since it's a single word, and no two threads will ever set it to > differing values. Thanks. Yes! I used DebugHeap* and Cache status for this flag. In my patch, I put `static DebugHeap* debugHeap` in Cache.cpp. TriState is represented as follows. 1. Uninitialized Cache::getFast() => nullptr, and DebugHeap* => nullptr 2. Enabled => Cache::getFast() => pointer, and DebugHeap* => nullptr 3. Disabled => Cache::getFast() => nullptr, and DebugHeap* => actual DebugHeap Important thing is Cache::getFast() is TLS status (per-thread), and DebugHeap* is global variable so it's per-process status. And setting `debugHeap = PerProcess<DebugHeap>::get()` does not have synchronization. (While DebugHeap creation needs and has synchronization in PerProcess<>). The synchronization for setting `debugHeap` is unnecessary since, as you said, no two threads will ever set it to differing values. > > I'm not sure if that's exactly the best implementation, but it's the essence > of the idea. > > Imagine one global and two functions: > > bool computeBmallocEnabled(); Yeah, we have `PerProcess<Environment>::get()->isDebugHeapEnabled()`. It caches the result of `computeIsDebugHeapEnabled()`. > > DebugHeap* debugHeap(); > > And this TriState: > > { Uninitialized, Enabled, Disabled } s_isBmallocEnabled; > > Here's the implementation of debugHeap(): > > if (s_isBmallocEnabled == Enabled) > return nullptr; > > if (s_isBmallocEnabled == Uninitialized) { > s_isBmallocEnabled = computeBmallocEnabled(); > if (s_isBmallocEnabled == Enabled) > return nullptr; > } > > return PerProcess<DebugHeap>::get(); > > Here's the implementation of every Cache slow path: > > if (DebugHeap* debugHeap = debugHeap()) > return debugHeap->allocationFunction(...); > > return PerThread<PerHeapKind<Cache>>::getSlowCase()->allocationFunction(...); Yeah, my uploaded patch takes similar approach :) So, my patch is like, void* allocate(...) { if (Cache* cache = Cache::getFast()) return cache->allocate(); return allocateSlowPath(); } static DebugHeap* debugHeap { nullptr }; Cache* getCache() { if (debugHeap) return nullptr; if (isDebugHeapEnabled()) { // BTW, this status computation is cached by PerProcess<Environment>. So it is also cheap. debugHeap = PerProcess<DebugHeap>::get(); return nullptr; } return Cache::getSlow(); } void* allocateSlowPath(...) { if (Cache* cache = getCache()) return cache->allocate(); return debugHeap->allocate(); }

Yusuke Suzuki

Comment 20 2019-02-19 17:21:26 PST

if (auto* heap = debugHeap()) ... return Cache::getSlow()->at(...)->...; seems less diff, and easy to read. I'll update the patch with Geoff's suggestion :)

Yusuke Suzuki

Comment 21 2019-02-19 17:24:48 PST

Created attachment 362459 [details] Patch

Mark Lam

Comment 22 2019-02-19 17:32:37 PST

Comment on attachment 362459 [details] Patch r=me

Yusuke Suzuki

Comment 23 2019-02-19 17:52:00 PST

Committed r241789: <https://trac.webkit.org/changeset/241789>

Radar WebKit Bug Importer

Comment 24 2019-02-19 17:54:38 PST

<rdar://problem/48222791>

WebKit Commit Bot

Comment 25 2019-02-20 10:02:53 PST

Re-opened since this is blocked by bug 194856

Yusuke Suzuki

Comment 26 2019-02-20 11:18:31 PST

I found the reason. `bool bmalloc::api::isEnabled(HeapKind kind)` implementation is this. bool isEnabled(HeapKind kind) { kind = mapToActiveHeapKind(kind); std::unique_lock<Mutex> lock(Heap::mutex()); return !PerProcess<PerHeapKind<Heap>>::getFastCase()->at(kind).debugHeap(); } This API implicitly assumes that `PerProcess<PerHeapKind<Heap>>` is initialized by someone when `isEnabled` is called. But with this patch, we do not instantiate Cache. Since Cache accesses Heap, Heap is instantiated by Cache. Removing Cache instantiation makes Heap not-instantiated status, and this `getFastCase()` returns nullptr when it is used from WebProcess:: initializeWebProcess -> MemoryPressureHandler::setShouldUsePeriodicMemoryMonitor -> bmalloc::api::isEnabled. But relying on Heap instantiation done by somewhere else is not good. We should just use `PerProcess<Environment>::get()->isDebugHeapEnabled()` here.

Yusuke Suzuki

Comment 27 2019-02-20 13:29:00 PST

(In reply to Yusuke Suzuki from comment #26) > I found the reason. > > `bool bmalloc::api::isEnabled(HeapKind kind)` implementation is this. > > bool isEnabled(HeapKind kind) > { > kind = mapToActiveHeapKind(kind); > std::unique_lock<Mutex> lock(Heap::mutex()); > return > !PerProcess<PerHeapKind<Heap>>::getFastCase()->at(kind).debugHeap(); > } > > This API implicitly assumes that `PerProcess<PerHeapKind<Heap>>` is > initialized by someone when `isEnabled` is called. But with this patch, we > do not instantiate Cache. > Since Cache accesses Heap, Heap is instantiated by Cache. Removing Cache > instantiation makes Heap not-instantiated status, and this `getFastCase()` > returns nullptr when it is used from WebProcess:: initializeWebProcess -> > MemoryPressureHandler::setShouldUsePeriodicMemoryMonitor -> > bmalloc::api::isEnabled. > > But relying on Heap instantiation done by somewhere else is not good. We > should just use `PerProcess<Environment>::get()->isDebugHeapEnabled()` here. I'll land this patch again with the above change for `isEnabled(HeapKind)`. I've grepped the source code with `getFastCase()` and the other use cases are OK.

Yusuke Suzuki

Comment 28 2019-02-20 13:30:52 PST

Committed r241832: <https://trac.webkit.org/changeset/241832>

David Kilzer (:ddkilzer)

Comment 29 2019-03-20 12:42:31 PDT

*** Bug 194794 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component bmalloc

Assignee

Yusuke Suzuki

Reported

2019-02-19 01:29 PST

Modified

2019-03-20 12:42 PDT History

CC List

5 users Show

URL

Keywords InRadar

Duplicates (1)

194794 View as bug list

Depends on

194856

Blocks

Dependencies

tree graph