core: consider using an atomic linked list for the callsite registry

## Feature Request

### Crates

- `tracing-core`

### Motivation

`tracing-core`'s callsite registry stores every active dispatcher, and a `&'static` reference to every callsite, so that the callsite can be re-registered when the `Interest` cache is invalidated.

Currently, this is represented as a pair of `Vec`s inside a `Mutex`:
https://github.com/tokio-rs/tracing/blob/aa5c5239ef0fe2027988f356e863de3558d17270/tracing-core/src/callsite.rs#L16-L19

The callsite registry is accessed in a few ways:
1. New callsites are added to the registry. This happens the _first_ time a callsite is hit.
2. A `Dispatch` is added or removed from the registry. This happens whenever a `Dispatch` is created or dropped, which is infrequent — if a program only uses the global default dispatcher, a `Dispatch` is created once in the process' lifetime, and is never dropped. This will iterate over every registered callsite to register it with the new `Dispatch`.
2. The registered callsites are iterated over to invalidate the cache. Depending on the program, this is either somewhat infrequent (a filtering change is triggered by user input), or never occurs (all filter configurations are static and don't change).

Of these, adding new callsites to the registry is the hottest path. Although it is not *terribly* frequent, it still happens several orders of magnitude than iterating over the registered callsites. Because the callsite registry is behind a `Mutex`, every registration is forced to synchronize. This means that if multiple threads hit different callsites for the first time, they must all contend the lock. 

In most programs, registrations will happen more frequently at the beginning of the process' lifetime, and decrease over time as more and more callsites are hit for the first time. In long-running programs, the impact of contending the registry lock is highly amortized over the programs lifetime, and has a fairly small performance impact. However, in shorter-running programs, like CLI tools, it may be more meaningful. Also, even in longer-running programs, decreased performance at startup may be an issue.


### Proposal

We may want to consider replacing the `Mutex<Vec<...>>` representation for the callsite registry with an atomic singly-linked list. With the current implementation, _iterating_ over the registry is fast (because it's a `Vec`) but inserting must always contend the lock. A linked list is slower to iterate, but there is no `Mutex` in the insertion path. This probably makes more sense in this use case, since (as we discussed above) adding new callsites to the registry occurs much more frequently than iterating over currently registered callsites. 

Since we only need to iterate in one direction, and never *remove* from the middle of the list (or from *anywhere* in the list, in point of fact), a singly-linked list is fine. This means that we can use an atomic linked list without too much pain. We don't even need to worry about the ABA problem, because (as I mentioned) callsites are never removed, and have unique addresses. We would treat the list somewhat like a stack (except without pops), storing the pointer to the _most_ recently appended callsite, and use this as the head for both appends and iterations. This means that we don't have to link-hop in appends.

If we want to be fancy, we could even make this an intrusive list, by adding a next pointer to the callsites themselves. That way, the static callsite would store the next pointer's address, rather than putting it on the heap. Then, we wouldn't need to make any allocations. I don't think the performance impact of allocations in this code path is terribly significant — getting rid of the lock is probably good enough for the callsite registration path. However, it _does_ show up as a "leak" in tools like `valgrind`, since any allocations we make to register a callsite are never freed, and this can look scary to users who aren't aware of what this code is doing. It could be nice for this to not show up in valgrind. A side benefit is that this would also move `tracing-core` one step closer to compiling on `no_std` _without_ `liballoc`. I don't know if there's really any current demand for this (and it would take some additional work to pull off, such as making the global dispatcher work without `Arc`) but it would be cool to be able to say we _can_ do it.

Obviously, this change should be guided by benchmarking. It's possible my intuition is just wrong, and the `Mutex` is faster. Of course, we would want to benchmark this under concurrency, since that's when the `Mutex` contention might actually matter.

### Alternatives

- We could just _not_ do this. Since registering callsites only happens once, it _is_ entirely possible that there are bigger things to focus on optimizing...and they may not require writing Yet Another Atomic Linked List, too.
- We could get rid of _runtime_ callsite registration entirely, and — @nagisa will love this one, I think — use a [`#[link_section]` attribute](https://doc.rust-lang.org/reference/abi.html#the-link_section-attribute) to get the linker to put a reference to each callsite in a special section, which we would then treat as an array at runtime. This is similar to what @mystor did here: https://github.com/hawkw/mycelium/blob/03ef7cdefb2563d04c2c22219d24da32e8db387f/util/src/testing.rs#L38-L69. 
  + The main advantages of this approach are: 
    * We could iterate over the section as an array immediately when the first subscriber is created, so every callsite will already be registered the first time it's hit — no registration overhead in user code at all
    * There's no need for any concurrency control of the callsite registry, since it will already be generated for us and _never_ need to be mutated
    * Array iteration is fast
    * This would never use the heap at all
    * It's a cool hack that would make me feel like a _real_ systems programmer
  + However, it has some downsides:
    * Adds a bunch of cool new unsafe code that some people will definitely side-eye
    * I have no idea how portable this is. My understanding is that `#[link_section]` is not specific to a _particular_ linker like `lld` or GNU `ld`, but I don't know if this would work at all on exotic targets like WASM, or if someone tries to test code using `tracing` in Miri. We _could_ try to guard the whole system behind conditional compilation flags, and fall back to runtime registration if we can't guarantee that Cool Linker Tricks will work on the target platform...but that is a whole new can of worms in and of itself...


	static ref REGISTRY: Mutex<Registry> = Mutex::new(Registry {
	callsites: Vec::new(),
	dispatchers: Vec::new(),
	});

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

core: consider using an atomic linked list for the callsite registry #860

Feature Request

Crates

Motivation

Proposal

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

core: consider using an atomic linked list for the callsite registry #860

Description

Feature Request

Crates

Motivation

Proposal

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions