RESOLVED FIXED 211809
[bmalloc] Introduce lock-less ObjectType query
https://bugs.webkit.org/show_bug.cgi?id=211809
Summary [bmalloc] Introduce lock-less ObjectType query
Yusuke Suzuki
Reported 2020-05-12 15:24:38 PDT
Talked with Phil, and we now agree that we will introduce this even if this is not the part of contention right now (not measured) as a pure improvement.
Attachments
Patch (27.64 KB, patch)
2020-05-12 19:20 PDT, Yusuke Suzuki
no flags
Patch (27.64 KB, patch)
2020-05-12 19:24 PDT, Yusuke Suzuki
no flags
Patch (27.64 KB, patch)
2020-05-12 19:41 PDT, Yusuke Suzuki
mark.lam: review+
Patch for landing (28.26 KB, patch)
2020-05-13 14:06 PDT, Yusuke Suzuki
no flags
Patch for landing (28.26 KB, patch)
2020-05-13 14:08 PDT, Yusuke Suzuki
no flags
Patch for landing (28.43 KB, patch)
2020-05-13 15:52 PDT, Yusuke Suzuki
no flags
Patch for landing (28.41 KB, patch)
2020-05-13 15:59 PDT, Yusuke Suzuki
no flags
Yusuke Suzuki
Comment 1 2020-05-12 19:20:50 PDT
Yusuke Suzuki
Comment 2 2020-05-12 19:24:02 PDT
Yusuke Suzuki
Comment 3 2020-05-12 19:30:15 PDT
Comment on attachment 399231 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=399231&action=review > Source/bmalloc/bmalloc/ObjectTypeTable.cpp:64 > + size_t size = vmSize(sizeof(Bits) + (roundUpToMultipleOf<size_t>(ObjectTypeTable::Bits::bitCountPerWord, count) / 8)); > + BASSERT(isPowerOf2(size)); > + size = roundUpToPowerOf2(size, vmPageSize()); Only thing I'm not sure is whether I should use vmPageSizePhysical or vmPageSize here. I'm using vmPageSize() since bmalloc::Vector<> is using that.
Yusuke Suzuki
Comment 4 2020-05-12 19:41:07 PDT
Created attachment 399232 [details] Patch Fix debug build
Mark Lam
Comment 5 2020-05-12 21:09:07 PDT
Comment on attachment 399231 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=399231&action=review r=me with suggested improvements. > Source/bmalloc/ChangeLog:9 > + This patch introduces ObjectTypeTable, which allows lock-less ObjectType query for Chunk*. > + It has bit-vector to store ObjectType per address. And 1bit represents 1MB VA region since Please add a comment somewhere around here to explain that the reason we can represent the ObjectType with a bit is because there only 2 ObjectTypes: Small and Large. /1bit/each bit/ /1MB VA/1MB of VA/ > Source/bmalloc/ChangeLog:12 > + is limited and it does not waste memory so much because Chunk's size is enough large (1MB). Since each /waste memory so much/waste much memory/ > Source/bmalloc/ChangeLog:14 > + happens. I ensured that 4KB page can handle memory allocation in JetStream2 and Gmail. I think you meant "verified" instead of "ensured". I presume you confirmed that they already fit, and not that you changed anything to make them fit. Is that right? /allocation/allocations/ > Source/bmalloc/bmalloc/Algorithm.h:224 > + return ((size + alignment - 1) & -alignment); Please add an assertion here that alignment isPowerOf2 since this algorithm depends on that for correctness. > Source/bmalloc/bmalloc/ObjectTypeTable.cpp:45 > + // This is initial allocation of ObjectTypeTable. In this case, it would be possible that the first registration could be > + // possible that some VAs are already allocated for different purpose, and later it will be reused for bmalloc. In that case, > + // soon, we will see smaller index request than the initial one. > + // We add 128MB offset to the initial newBegin to cover such patterns without extending table too quickly. /it would be possible that the first registration could be possible that some VAs are already/it could be possible that for the first registration, some VAs are already/ /for different purpose/for a different purpose/ /add 128MB/subtract a 128MB/ /we see smaller index request than the initial one/we will see a smaller index request than this initial one/ > Source/bmalloc/bmalloc/ObjectTypeTable.cpp:51 > + newBegin = std::min<unsigned>(index, (bits->begin() - (bits->end() - bits->begin()))); I suggest you add: inline unsigned Bits::count() { return end() - begin(); } ... and express this line instead as: newBegin = std::min<unsigned>(index, bits->begin() - bits->count()); I think this reads clearer as doubling the Bits size below begin(). I would have used "size" over "count", but you're using "count" in the local var below. So, let's just keep it consistent. > Source/bmalloc/bmalloc/ObjectTypeTable.cpp:57 > + newEnd = std::max<unsigned>(index + 1, bits->end() + (bits->end() - bits->begin())); Ditto: newEnd = std::max<unsigned>(index + 1, bits->end() + bits->size()); > Source/bmalloc/bmalloc/ObjectTypeTable.cpp:71 > + for (unsigned index = bits->begin(); index < bits->end(); ++index) > + newBits->set(index, bits->get(index)); Above, you only round the size up. Instead, why not round down newBegin to a WordType boundary first, and then round up newEnd as above. By doing this, you guarantee that a Nth bit in a word will always be the Nth bit in some word no matter how many new empty words you grow on each end. With that, you can compute how many words you added before the old begin, and how many words after. With this info, you can do a memcpy of the old Bits words into the new one instead of this slow per bit iteration. > Source/bmalloc/bmalloc/ObjectTypeTable.cpp:77 > + BASSERT(static_cast<ObjectType>(value) == objectType); Nice. I was wondering if there's a way to ensure that there can only be 2 ObjectTypes: Small and Large, so that if we ever add a Medium type, we would fail here. This assertion is even better than the naive one I thought of. > Source/bmalloc/bmalloc/ObjectTypeTable.h:57 > + return static_cast<unsigned>((address & addressMask) >> shiftAmount); Do we really need to & with addressMask here? By definition, chunk / address cannot have bits above BOS_EFFECTIVE_ADDRESS_WIDTH, right? Or are we storing some bits there? Secondly, why not shift first, and then mask against an indexMask (where indexMask = addressMask >> shiftAmount)? The indexMask is a smaller constant and may allow for more efficient instruction encoding. Additionally, the indexMask's low bit aligns perfectly on a 16-bit boundary. For example, on ARM64, indexMask would be 16bits. A single movz can initialize the constant in a register. But for addressMask, because the low bit is bit 20, we all need at least to 1 movz + 1 movk to form the constant. > Source/bmalloc/bmalloc/ObjectTypeTable.h:65 > + using WordType = unsigned; How about using WordType = UCPURegister? It's not a big advantage to do so, but will make the memcpy I suggested above just a bit faster.
Yusuke Suzuki
Comment 6 2020-05-13 14:04:03 PDT
Comment on attachment 399231 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=399231&action=review >> Source/bmalloc/ChangeLog:9 >> + It has bit-vector to store ObjectType per address. And 1bit represents 1MB VA region since > > Please add a comment somewhere around here to explain that the reason we can represent the ObjectType with a bit is because there only 2 ObjectTypes: Small and Large. > > /1bit/each bit/ > /1MB VA/1MB of VA/ Fixed. >> Source/bmalloc/ChangeLog:12 >> + is limited and it does not waste memory so much because Chunk's size is enough large (1MB). Since each > > /waste memory so much/waste much memory/ Fixed. >> Source/bmalloc/ChangeLog:14 >> + happens. I ensured that 4KB page can handle memory allocation in JetStream2 and Gmail. > > I think you meant "verified" instead of "ensured". I presume you confirmed that they already fit, and not that you changed anything to make them fit. Is that right? > /allocation/allocations/ Right, fixed. >> Source/bmalloc/bmalloc/Algorithm.h:224 >> + return ((size + alignment - 1) & -alignment); > > Please add an assertion here that alignment isPowerOf2 since this algorithm depends on that for correctness. Added. >> Source/bmalloc/bmalloc/ObjectTypeTable.cpp:45 >> + // We add 128MB offset to the initial newBegin to cover such patterns without extending table too quickly. > > /it would be possible that the first registration could be possible that some VAs are already/it could be possible that for the first registration, some VAs are already/ > /for different purpose/for a different purpose/ > /add 128MB/subtract a 128MB/ > /we see smaller index request than the initial one/we will see a smaller index request than this initial one/ Fixed. >> Source/bmalloc/bmalloc/ObjectTypeTable.cpp:51 >> + newBegin = std::min<unsigned>(index, (bits->begin() - (bits->end() - bits->begin()))); > > I suggest you add: > inline unsigned Bits::count() { return end() - begin(); } > ... and express this line instead as: > newBegin = std::min<unsigned>(index, bits->begin() - bits->count()); > > I think this reads clearer as doubling the Bits size below begin(). I would have used "size" over "count", but you're using "count" in the local var below. So, let's just keep it consistent. Fixed. >> Source/bmalloc/bmalloc/ObjectTypeTable.cpp:57 >> + newEnd = std::max<unsigned>(index + 1, bits->end() + (bits->end() - bits->begin())); > > Ditto: > newEnd = std::max<unsigned>(index + 1, bits->end() + bits->size()); Fixed. >> Source/bmalloc/bmalloc/ObjectTypeTable.cpp:71 >> + newBits->set(index, bits->get(index)); > > Above, you only round the size up. Instead, why not round down newBegin to a WordType boundary first, and then round up newEnd as above. By doing this, you guarantee that a Nth bit in a word will always be the Nth bit in some word no matter how many new empty words you grow on each end. > > With that, you can compute how many words you added before the old begin, and how many words after. With this info, you can do a memcpy of the old Bits words into the new one instead of this slow per bit iteration. Fixed. >> Source/bmalloc/bmalloc/ObjectTypeTable.h:57 >> + return static_cast<unsigned>((address & addressMask) >> shiftAmount); > > Do we really need to & with addressMask here? By definition, chunk / address cannot have bits above BOS_EFFECTIVE_ADDRESS_WIDTH, right? Or are we storing some bits there? > > Secondly, why not shift first, and then mask against an indexMask (where indexMask = addressMask >> shiftAmount)? The indexMask is a smaller constant and may allow for more efficient instruction encoding. Additionally, the indexMask's low bit aligns perfectly on a 16-bit boundary. For example, on ARM64, indexMask would be 16bits. A single movz can initialize the constant in a register. But for addressMask, because the low bit is bit 20, we all need at least to 1 movz + 1 movk to form the constant. Yeah, this addressMask is not necessary. Removed. >> Source/bmalloc/bmalloc/ObjectTypeTable.h:65 >> + using WordType = unsigned; > > How about using WordType = UCPURegister? It's not a big advantage to do so, but will make the memcpy I suggested above just a bit faster. Since extension is extremely rare (never happens in practice), using UCPURegister does not matter. I keep using unsigned here because it is simpler.
Yusuke Suzuki
Comment 7 2020-05-13 14:06:01 PDT
Created attachment 399297 [details] Patch for landing Patch for landing
Yusuke Suzuki
Comment 8 2020-05-13 14:08:04 PDT
Created attachment 399298 [details] Patch for landing Patch for landing
Mark Lam
Comment 9 2020-05-13 14:14:27 PDT
Comment on attachment 399297 [details] Patch for landing View in context: https://bugs.webkit.org/attachment.cgi?id=399297&action=review > Source/bmalloc/bmalloc/ObjectTypeTable.cpp:56 > + newBegin = static_cast<unsigned>(roundDownToMultipleOf<size_t>(ObjectTypeTable::Bits::bitCountPerWord, bits->begin())); This rounding doesn't hurt, but is not needed. The old bits->begin() should already be properly aligned. You can just assert it if you like. Alternatively, instead of doing the rounding down in all 3 cases here, just do it below after this if else statement.
Yusuke Suzuki
Comment 10 2020-05-13 15:52:00 PDT
Created attachment 399306 [details] Patch for landing Patch for landing
Yusuke Suzuki
Comment 11 2020-05-13 15:59:41 PDT
Created attachment 399311 [details] Patch for landing Patch for landing
EWS
Comment 12 2020-05-13 18:09:24 PDT
Committed r261667: <https://trac.webkit.org/changeset/261667> All reviewed patches have been landed. Closing bug and clearing flags on attachment 399311 [details].
Radar WebKit Bug Importer
Comment 13 2020-05-13 18:10:18 PDT
Note You need to log in before you can comment on or make changes to this bug.