It seems like it takes about 1us to destruct a CodeBlock in Speedometer. That's crazy. Maybe this is inherent. In that case, we should consider parallelizing the destruction. Maybe we should make it easy to parallelize destruction, for those types that don't have to be tied to thread-unsafe ref counting.