add generic list api with slab and rc impls by jkarneges · Pull Request #48291 · fastly/pushpin

jkarneges · 2026-01-09T01:49:57Z

TL;DR: Work-in-progress new linked list implementation that doesn't require a fixed capacity.

Background

Currently, our linked list implementation stores nodes in preallocated slabs in order to avoid heap operations at runtime. This approach is performant but requires knowing in advance how many nodes will be needed, and this isn't always easy to know. Notably, reactor registrations are kept in a list, and determining the number of needed registrations in the whole app pretty much requires reading the entire codebase since registrations can occur anywhere.

Approach

This PR aims to provide a linked list that has a dynamic capacity while remaining performant. It introduces a new list type capable of working with ref-counted nodes, either arena::Rc (from our core lib) or std::rc::Rc. arena::Rc is like std::rc::Rc but it allocates into a slab. This approach lets us continue to use preallocated slabs for node memory, with the advantage that not all of a list's nodes have to live in the same slab, and if slabs are full we can fall back to the heap.

The intended way to use the new list is to allocate per-task slabs for node memory, with some kind of automatic right-sizing for the slabs. Each task would create nodes within its own slabs, even if the nodes will be added to lists shared among multiple tasks. If a task wants to create a node into a slab that's full, it can create it on the heap instead and note somewhere that the next spawned task will need a larger slab.

The ref-counted nodes are !Send, and there is at least one place where this would cause us trouble: connmgr's Pool which is shared between threads. The simplest solution to that problem is to continue using a list implementation based on a single slab there. In order to avoid having multiple linked list implementations, the new list is generic over a Backend which supplies indexing and linking logic.

Two backends are provided: one using usize indexes with a single slab for node memory (by implementing the trait directly on the Slab type: impl<T> Backend for Slab<SlabNode<T>>), and one using arena::Rc/std::rc::Rc for indexes with node memory living wherever (the RcBackend zero-sized type).

Compatibility / perf

The API of the new list is basically the same as the current one, except head and tail are now methods instead of fields.

Care is taken to ensure the API doesn't require unnecessary cloning of ref-counted nodes, mainly in case we ever want to add an Arc-based backend. For example, the remove() method takes an index reference (&RcNode when using the rc-based backend) rather than an owned index.

At the same time, we don't want to have to pass a &usize when using the single slab backend as this adds unnecessary indirection. To work around that, the index reference type is made generic. For the single slab backend, the index type is usize and the index reference type is also usize, whereas for the rc-based backend, the index type is RcNode and the index reference type is &RcNode.

In theory, being able to index using usize by value should enable the single slab backend to remain as performant as the current list implementation which does the same, though the generified code is a bit noisy (<Backend::Index as Index>::Ref all over the place). The benchmarks appear to support this.

Benchmarks

Some benchmarks are included that do 1000 pushes/pops against the various implementations. Results on Linux:

index slab list push pop 1000
                        time:   [8.5856 µs 8.5910 µs 8.5978 µs]
[...]
generic slab list push pop 1000
                        time:   [8.3771 µs 8.3810 µs 8.3867 µs]
[...]
arena rc list push pop 1000
                        time:   [66.058 µs 66.180 µs 66.333 µs]
[...]
std rc list push pop 1000
                        time:   [69.915 µs 69.929 µs 69.946 µs]

The first benchmark is of the current implementation (single slab), and the second is of the new implementation with the single slab backend, and the numbers are very close. This makes sense since they're both the same logic. With static dispatch and good inlining they should compile down to more or less the same thing.

The rc-based benches (arena and std) are ~8x slower.

For the arena bench, the slowness is surprising. Its overhead should only be a bunch of ref-counting and RefCell logic. Also surprising is that the std bench has similar performance as the arena bench despite having heap operation overhead on top of that.

TODO

Both of the benchmark surprises deserve deeper investigation. Maybe the rc logic could be more optimized. Maybe the allocator on Linux can be super fast (for small types?).

deg4uss3r · 2026-01-09T14:20:21Z

Benchmarks ran on M2 Pro

index slab list push pop 1000
                        time:   [5.4506 µs 5.4555 µs 5.4605 µs]
[...]
generic slab list push pop 1000
                        time:   [5.3215 µs 5.3258 µs 5.3308 µs]
[...]
arena rc list push pop 1000
                        time:   [34.838 µs 34.866 µs 34.898 µs]
[...]
std rc list push pop 1000
                        time:   [34.515 µs 34.548 µs 34.582 µs]

jkarneges · 2026-01-12T21:39:36Z

Maybe the rc logic could be more optimized.

There is indeed room for improvement. arena::Rc is written in mostly safe Rust and hits a RefCell for every operation, whereas std::rc::Rc uses raw pointers and has better inlining. Reworking arena::Rc to be like std's makes clone, drop-without-destruct (refs > 1), and deref operations perform similarly. I've added benchmarks using preallocated nodes to show this.

As expected, list operations that don't preallocate the nodes, such that new and drop-with-destruct (refs == 1) rc operations are included in the measurements, show the arena-based impl outperforming.

Now the results make more sense. cargo bench --bench list:

index slab list push pop 10000
                        time:   [86.453 µs 86.479 µs 86.508 µs]
[...]
list push pop 10000
                        time:   [81.789 µs 82.121 µs 82.535 µs]
[...]
arena rc list push pop 10000
                        time:   [766.01 µs 766.56 µs 767.26 µs]
[...]
std rc list push pop 10000
                        time:   [875.06 µs 875.56 µs 876.14 µs]
[...]
arena rc list push pop 10000 (preallocated nodes)
                        time:   [208.33 µs 208.37 µs 208.42 µs]
[...]
std rc list push pop 10000 (preallocated nodes)
                        time:   [208.98 µs 209.02 µs 209.06 µs]
[...]

I also switched to using Cell instead of RefCell for node links. I wasn't able to measure any improvement from this, but it's better for the optimizer. Cell ops are always zero cost, whereas eliding ref count manipulations in RefCell requires full context.

Looking at the assembly, the overhead of twiddling link refs is about as minimal as one would expect. The chain of method calls, including trait methods, gets inlined down to a handful of instructions. Maybe the unstable Cell::get_cloned could knock off an instruction or two someday. In any case, the 8x overhead seems like the cost of doing business if we want ref-counted nodes.

That said, nodes very often get reused. The overhead of the actual list operations is more like 2.5x. That may be a more acceptable trade for the convenience of ref-counting.

Maybe the allocator on Linux can be super fast (for small types?).

Rust uses glibc's allocator on Linux. Cursory investigation of glibc suggests it does contain some optimizations for "small" allocations, likely using thread-local storage with a slab-like algorithm, i.e. nearly the same thing our arena is doing. This could explain the competitive performance.

It would be good to investigate these "small" allocation optimizations more deeply and understand how they may apply in our case. I would bet most of our list nodes are at most a few hundred bytes, within the threshold of the optimized pathway.

add generic list api with slab and rc impls

e3b560b

jkarneges force-pushed the jkarneges/generic-list branch from 4ca9cf8 to e3b560b Compare January 17, 2026 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add generic list api with slab and rc impls#48291

add generic list api with slab and rc impls#48291
jkarneges wants to merge 1 commit intomainfrom
jkarneges/generic-list

jkarneges commented Jan 9, 2026

Uh oh!

deg4uss3r commented Jan 9, 2026

Uh oh!

jkarneges commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkarneges commented Jan 9, 2026

Background

Approach

Compatibility / perf

Benchmarks

TODO

Uh oh!

deg4uss3r commented Jan 9, 2026

Uh oh!

jkarneges commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants