261821 – Resizable array buffer stress tests run out of memory on ppc64le

Michael Catanzaro

Reported 2023-09-20 08:00:29 PDT

Using a cloop build on ppc64le, the following JSC stress tests crash due to running out of memory: stress/resizable-byteoffset.js.default stress/resizable-byteoffset.js.bytecode-cache stress/resizable-byteoffset.js.mini-mode stress/resizable-byteoffset.js.lockdown stress/resizable-bytelength.js.default stress/resizable-bytelength.js.bytecode-cache stress/resizable-bytelength.js.mini-mode stress/resizable-bytelength.js.lockdown stress/resizable-length.js.default stress/resizable-length.js.bytecode-cache stress/resizable-length.js.mini-mode stress/resizable-length.js.lockdown All of the failures look like this: Running stress/resizable-byteoffset.js.default stress/resizable-byteoffset.js.default: Crashing because current footprint: 631242752 exceeds limit: 629145600 stress/resizable-byteoffset.js.default: test_script_452: line 2: 1830913 Aborted (core dumped) ( "$@" /home/jenkins/workspace/WebKit-JSC/label/ppc64le/WebKitBuild/Release/bin/jsc --useFTLJIT\=false --useFunctionDotArguments\=true --validateExceptionChecks\=true --useDollarVM\=true --maxPerThreadStackUsage\=1572864 --useFTLJIT\=true --useResizableArrayBuffer\=1 resizable-byteoffset.js ) stress/resizable-byteoffset.js.default: ERROR: Unexpected exit code: 134 FAIL: stress/resizable-byteoffset.js.default My first instinct was to mark the tests as memoryLimited, since we commonly do this whenever a test requires a large amount of memory to run: //@ skip if $memoryLimited But then I realized that the tests do not look like they should require very much memory at all. E.g. the resizable-length test only creates one ArrayBuffer at a time, then resizes it up to 1024 bytes. It does this 10,000 times. I'm OK with skipping the tests, but I wonder if it's appropriate; perhaps something is wrong with garbage collection on this architecture, for example? Would any sort of debug info be helpful to decide?

Radar WebKit Bug Importer

Comment 1 2023-09-27 08:01:11 PDT

<rdar://problem/116121832>

Michael Catanzaro

Comment 2 2023-10-03 06:27:22 PDT

(In reply to Michael Catanzaro from comment #0) > But then I realized that the tests do not look like they should require very > much memory at all. E.g. the resizable-length test only creates one > ArrayBuffer at a time, then resizes it up to 1024 bytes. It does this 10,000 > times. I'm OK with skipping the tests, but I wonder if it's appropriate; > perhaps something is wrong with garbage collection on this architecture, for > example? Would any sort of debug info be helpful to decide? I've also asked for guidance in Slack and didn't receive any response, so I'm going to mark these tests as memory limited.

Michael Catanzaro

Comment 3 2023-10-03 06:30:46 PDT

Pull request: https://github.com/WebKit/WebKit/pull/18548

Michael Catanzaro

Comment 4 2023-10-04 09:08:46 PDT

I got access to a ppc64le machine to run the test there, but it has a huge amount of RAM and the test passes. I made two modifications to the test: diff --git a/JSTests/stress/resizable-byteoffset.js b/JSTests/stress/resizable-byteoffset.js index c25d9cbf22bb..cf26153ed8fb 100644 --- a/JSTests/stress/resizable-byteoffset.js +++ b/JSTests/stress/resizable-byteoffset.js @@ -15,7 +15,8 @@ function test() { else shouldBe(array.byteOffset, 64); } + gc(); } -for (var i = 0; i < 1e4; ++i) +for (var i = 0; i < 1e5; ++i) test(); The change to 1e5 is to make the test run longer. With this change, I see RAM usage increase to just above 5 GiB before garbage collection kicks in . If I additionally add the gc() to trigger garbage collection, then RAM usage stays nice and low the whole time. So my conclusion is clear: the garbage collector works, but is waiting way too long before it runs, long enough that it runs out of RAM on the builder that runs the CI. I wonder how JSC decides when to run garbage collection? Notably, bugs specific to ppc64le almost always turn out to be related to page size, which for ppc64le is 64 KiB, much larger than 4 KiB that's generally used on x86_64 and aarch64 or 16 KiB used on Apple platforms. I wonder if that could be related.

Mark Lam

Comment 5 2023-10-04 13:49:30 PDT

How does it behave on x86_64 in contrast?

Michael Catanzaro

Comment 6 2023-10-04 14:00:02 PDT

(In reply to Mark Lam from comment #5) > How does it behave on x86_64 in contrast? Good question. Memory use grows to about 400 MB, then it drops (presumably because garbage collection runs). So it looks like the threshold to trigger garbage collection is way lower on x86_64 than on ppc64le.

Michael Catanzaro

Comment 7 2023-10-04 14:05:50 PDT

Hm, it looks like JSC considers total system RAM when deciding how frequently to run GC. In this example, the ppc64le test system has >200 GB of RAM while the x86_64 system has 64 GB. So maybe this could be a red herring? :/

Mark Lam

Comment 8 2023-10-04 14:10:30 PDT

(In reply to Michael Catanzaro from comment #7) > Hm, it looks like JSC considers total system RAM when deciding how > frequently to run GC. In this example, the ppc64le test system has >200 GB > of RAM while the x86_64 system has 64 GB. So maybe this could be a red > herring? :/ Find the code that determines available system RAM. Find out how it determines this for actual devices with limited memory e.g. ARMv7 and MIPS, etc. Do they override with a lower limit somehow, or does the OS give the correct value to guide the heuristics? Does PPC64's OS give a reasonable value here, or is it overly optimistic about available memory? Does PPC64's port need to override with a lower limit here? Is there already a mechanism to override the limit?

Michael Catanzaro

Comment 9 2023-10-05 13:02:18 PDT

Thanks for the tips. I found Heap::collectIfNecessaryOrDefer and I'm guessing that is where the decision for whether to garbage collect or not based on critical memory threshold occurs: size_t bytesAllowedThisCycle = m_maxEdenSize; #if USE(BMALLOC_MEMORY_FOOTPRINT_API) if (overCriticalMemoryThreshold()) bytesAllowedThisCycle = std::min(m_maxEdenSizeWhenCritical, bytesAllowedThisCycle); #endif if (m_bytesAllocatedThisCycle <= bytesAllowedThisCycle) return; Notably the overCriticalMemoryThreshold() call depends on USE(BMALLOC_MEMORY_FOOTPRINT_API), and bmalloc doesn't work on ppc64le due to page size, bug #209360. So I think there is no code looking at available system RAM. Next, I changed Heap::updateAllocationLimits to enable the verbose debug prints, expecting it to print something useful. After changing the loop from 1e4 to 1e5 to make the test run for longer, the function is called only once: bytesAllocatedThisCycle = 33554629 totalBytesVisited = 97776, currentHeapSize = 97776 extraMemorySize() = 12678, currentHeapSize = 110454 Full: maxHeapSize = 33554432 Full: maxEdenSize = 33443978 Full: sizeAfterLastFullCollect = 110454 Full: bytesAbandonedSinceLastFullCollect = 0 sizeAfterLastCollect = 110454 33554629 bytes is almost exactly 32 MiB, but in fact jsc is using over 5 GiB of resident memory, so I'm suspicious about m_bytesAllocatedThisCycle, which is updated by Heap::didAllocate. That's called from a variety of different locations, so that's as far as I've taken this for now. On x86_64, when a collection occurs, it looks like this: bytesAllocatedThisCycle = 33554533 totalBytesVisited = 97824, currentHeapSize = 97824 extraMemorySize() = 12678, currentHeapSize = 110502 Full: maxHeapSize = 33554432 Full: maxEdenSize = 33443930 Full: sizeAfterLastFullCollect = 110502 Full: bytesAbandonedSinceLastFullCollect = 0 sizeAfterLastCollect = 110502 The numbers look almost identical, but on x86_64 the jsc process is using roughly 400 MB of RAM when the GC runs, rather than 5 GB on ppc64le. So it looks like the lack of BMALLOC_MEMORY_FOOTPRINT_API doesn't actually matter: JSC is reaching the bytesAllocatedThisCycle limit on both architectures, but this limit corresponds to only a small portion of the process memory. Yet all of that memory really is freed by the GC. I wonder if there is a simple explanation for this that I am missing. It looks like m_bytesAllocatedThisCycle is wrong by about one order of magnitude on x86_64, and two orders of magnitude on ppc64le.

Michael Catanzaro

Comment 10 2023-10-05 13:05:38 PDT

m, immediately after I posted that long comment, I realized there are multiple Heap objects and that debug only prints for the particular Heap that happens to reach its size limit. So that explains the size discrepancy.

Michael Catanzaro

Comment 11 2023-10-05 13:12:23 PDT

(In reply to Michael Catanzaro from comment #10) > m, immediately after I posted that long comment, I realized there are > multiple Heap objects and that debug only prints for the particular Heap > that happens to reach its size limit. So that explains the size discrepancy. Ah well this seemed like a great theory, but I added a debug print to the Heap constructor and there is actually one Heap. Sorry for the noise.

Mark Lam

Comment 12 2023-10-05 14:22:39 PDT

Michael, you're debugging on a ppc64le which is configured differently (in terms of RAM at least) than the test machine that this issue manifested on. Can you check what the available RAM is on the test machine where this is failing? FWIW, I don't think your approach of increasing the loop iteration count to 1e5 is necessarily valid. You may be looking one symptom that increases memory usage this way, but it may or may not be what the test machine is experiencing. Do you have access to the test machine that is failing the test? If so, can you just ssh in and run the test there and poke at it instead? If not, is there some way to force the OS on the ppc64le to only use the same (lesser) amount of RAM as that on the test device? Another tip: If you're running the jsc shell, you can pass it `--logGC=1` to see if GC is run or not. Another tip: I see that ppc64le is using the cloop. x86_64 is not. Maybe try forcing the x86_64 build to use cloop also and see if there are any changes in memory use behavior?

Michael Catanzaro

Comment 13 2023-10-05 15:39:43 PDT

(In reply to Mark Lam from comment #12) > Michael, you're debugging on a ppc64le which is configured differently (in > terms of RAM at least) than the test machine that this issue manifested on. > Can you check what the available RAM is on the test machine where this is > failing? Yes. I will ask how much RAM is available on the machine where it's failing. (That said, based on what I've seen so far, I'm thinking available system RAM is not a consideration at all except when bmalloc is used.) > Do you have access to the test machine that is failing the test? If so, can > you just ssh in and run the test there and poke at it instead? If not, is > there some way to force the OS on the ppc64le to only use the same (lesser) > amount of RAM as that on the test device? I can probably access the CI machine where the test is failing, yes. And yes, it's just a bunch of VMs, so we can also reduce memory level if needed. > Another tip: If you're running the jsc shell, you can pass it `--logGC=1` to > see if GC is run or not. Neat. Didn't know this. > Another tip: I see that ppc64le is using the cloop. x86_64 is not. Maybe > try forcing the x86_64 build to use cloop also and see if there are any > changes in memory use behavior? Actually I built x86_64 also using cloop to ensure the only differences in test results should be due to processor architecture. The CI also disables bmalloc, but I didn't disable bmalloc in my local testing, which could be relevant. I should disable that to ensure the only difference is architecture.

Michael Catanzaro

Comment 14 2023-10-06 06:50:05 PDT

(In reply to Michael Catanzaro from comment #13) > Yes. I will ask how much RAM is available on the machine where it's failing. 32 GB. Since this test never seems to use more than 5 GB, I guess other tests are running at the same time.

Michael Catanzaro

Comment 15 2023-10-06 11:55:11 PDT

(In reply to Michael Catanzaro from comment #13) > Actually I built x86_64 also using cloop to ensure the only differences in > test results should be due to processor architecture. The CI also disables > bmalloc, but I didn't disable bmalloc in my local testing, which could be > relevant. I should disable that to ensure the only difference is > architecture. Whoops, I really did have bmalloc disabled in the x86_64 build, so it was an apples-to-apples comparison.