add generic list api with slab and rc impls#48291
Conversation
|
Benchmarks ran on M2 Pro |
There is indeed room for improvement. As expected, list operations that don't preallocate the nodes, such that new and drop-with-destruct (refs == 1) rc operations are included in the measurements, show the arena-based impl outperforming. Now the results make more sense. I also switched to using Looking at the assembly, the overhead of twiddling link refs is about as minimal as one would expect. The chain of method calls, including trait methods, gets inlined down to a handful of instructions. Maybe the unstable Cell::get_cloned could knock off an instruction or two someday. In any case, the 8x overhead seems like the cost of doing business if we want ref-counted nodes. That said, nodes very often get reused. The overhead of the actual list operations is more like 2.5x. That may be a more acceptable trade for the convenience of ref-counting.
Rust uses glibc's allocator on Linux. Cursory investigation of glibc suggests it does contain some optimizations for "small" allocations, likely using thread-local storage with a slab-like algorithm, i.e. nearly the same thing our arena is doing. This could explain the competitive performance. It would be good to investigate these "small" allocation optimizations more deeply and understand how they may apply in our case. I would bet most of our list nodes are at most a few hundred bytes, within the threshold of the optimized pathway. |
4ca9cf8 to
e3b560b
Compare
TL;DR: Work-in-progress new linked list implementation that doesn't require a fixed capacity.
Background
Currently, our linked list implementation stores nodes in preallocated slabs in order to avoid heap operations at runtime. This approach is performant but requires knowing in advance how many nodes will be needed, and this isn't always easy to know. Notably, reactor registrations are kept in a list, and determining the number of needed registrations in the whole app pretty much requires reading the entire codebase since registrations can occur anywhere.
Approach
This PR aims to provide a linked list that has a dynamic capacity while remaining performant. It introduces a new list type capable of working with ref-counted nodes, either
arena::Rc(from our core lib) orstd::rc::Rc.arena::Rcis likestd::rc::Rcbut it allocates into a slab. This approach lets us continue to use preallocated slabs for node memory, with the advantage that not all of a list's nodes have to live in the same slab, and if slabs are full we can fall back to the heap.The intended way to use the new list is to allocate per-task slabs for node memory, with some kind of automatic right-sizing for the slabs. Each task would create nodes within its own slabs, even if the nodes will be added to lists shared among multiple tasks. If a task wants to create a node into a slab that's full, it can create it on the heap instead and note somewhere that the next spawned task will need a larger slab.
The ref-counted nodes are
!Send, and there is at least one place where this would cause us trouble: connmgr'sPoolwhich is shared between threads. The simplest solution to that problem is to continue using a list implementation based on a single slab there. In order to avoid having multiple linked list implementations, the new list is generic over aBackendwhich supplies indexing and linking logic.Two backends are provided: one using
usizeindexes with a single slab for node memory (by implementing the trait directly on theSlabtype:impl<T> Backend for Slab<SlabNode<T>>), and one usingarena::Rc/std::rc::Rcfor indexes with node memory living wherever (theRcBackendzero-sized type).Compatibility / perf
The API of the new list is basically the same as the current one, except
headandtailare now methods instead of fields.Care is taken to ensure the API doesn't require unnecessary cloning of ref-counted nodes, mainly in case we ever want to add an
Arc-based backend. For example, theremove()method takes an index reference (&RcNodewhen using the rc-based backend) rather than an owned index.At the same time, we don't want to have to pass a
&usizewhen using the single slab backend as this adds unnecessary indirection. To work around that, the index reference type is made generic. For the single slab backend, the index type isusizeand the index reference type is alsousize, whereas for the rc-based backend, the index type isRcNodeand the index reference type is&RcNode.In theory, being able to index using
usizeby value should enable the single slab backend to remain as performant as the current list implementation which does the same, though the generified code is a bit noisy (<Backend::Index as Index>::Refall over the place). The benchmarks appear to support this.Benchmarks
Some benchmarks are included that do 1000 pushes/pops against the various implementations. Results on Linux:
The first benchmark is of the current implementation (single slab), and the second is of the new implementation with the single slab backend, and the numbers are very close. This makes sense since they're both the same logic. With static dispatch and good inlining they should compile down to more or less the same thing.
The rc-based benches (arena and std) are ~8x slower.
For the arena bench, the slowness is surprising. Its overhead should only be a bunch of ref-counting and
RefCelllogic. Also surprising is that the std bench has similar performance as the arena bench despite having heap operation overhead on top of that.TODO
Both of the benchmark surprises deserve deeper investigation. Maybe the rc logic could be more optimized. Maybe the allocator on Linux can be super fast (for small types?).