-
Notifications
You must be signed in to change notification settings - Fork 847
Description
Feature Request
Crates
tracing-core
Motivation
tracing-core's callsite registry stores every active dispatcher, and a &'static reference to every callsite, so that the callsite can be re-registered when the Interest cache is invalidated.
Currently, this is represented as a pair of Vecs inside a Mutex:
tracing/tracing-core/src/callsite.rs
Lines 16 to 19 in aa5c523
| static ref REGISTRY: Mutex<Registry> = Mutex::new(Registry { | |
| callsites: Vec::new(), | |
| dispatchers: Vec::new(), | |
| }); |
The callsite registry is accessed in a few ways:
- New callsites are added to the registry. This happens the first time a callsite is hit.
- A
Dispatchis added or removed from the registry. This happens whenever aDispatchis created or dropped, which is infrequent — if a program only uses the global default dispatcher, aDispatchis created once in the process' lifetime, and is never dropped. This will iterate over every registered callsite to register it with the newDispatch. - The registered callsites are iterated over to invalidate the cache. Depending on the program, this is either somewhat infrequent (a filtering change is triggered by user input), or never occurs (all filter configurations are static and don't change).
Of these, adding new callsites to the registry is the hottest path. Although it is not terribly frequent, it still happens several orders of magnitude than iterating over the registered callsites. Because the callsite registry is behind a Mutex, every registration is forced to synchronize. This means that if multiple threads hit different callsites for the first time, they must all contend the lock.
In most programs, registrations will happen more frequently at the beginning of the process' lifetime, and decrease over time as more and more callsites are hit for the first time. In long-running programs, the impact of contending the registry lock is highly amortized over the programs lifetime, and has a fairly small performance impact. However, in shorter-running programs, like CLI tools, it may be more meaningful. Also, even in longer-running programs, decreased performance at startup may be an issue.
Proposal
We may want to consider replacing the Mutex<Vec<...>> representation for the callsite registry with an atomic singly-linked list. With the current implementation, iterating over the registry is fast (because it's a Vec) but inserting must always contend the lock. A linked list is slower to iterate, but there is no Mutex in the insertion path. This probably makes more sense in this use case, since (as we discussed above) adding new callsites to the registry occurs much more frequently than iterating over currently registered callsites.
Since we only need to iterate in one direction, and never remove from the middle of the list (or from anywhere in the list, in point of fact), a singly-linked list is fine. This means that we can use an atomic linked list without too much pain. We don't even need to worry about the ABA problem, because (as I mentioned) callsites are never removed, and have unique addresses. We would treat the list somewhat like a stack (except without pops), storing the pointer to the most recently appended callsite, and use this as the head for both appends and iterations. This means that we don't have to link-hop in appends.
If we want to be fancy, we could even make this an intrusive list, by adding a next pointer to the callsites themselves. That way, the static callsite would store the next pointer's address, rather than putting it on the heap. Then, we wouldn't need to make any allocations. I don't think the performance impact of allocations in this code path is terribly significant — getting rid of the lock is probably good enough for the callsite registration path. However, it does show up as a "leak" in tools like valgrind, since any allocations we make to register a callsite are never freed, and this can look scary to users who aren't aware of what this code is doing. It could be nice for this to not show up in valgrind. A side benefit is that this would also move tracing-core one step closer to compiling on no_std without liballoc. I don't know if there's really any current demand for this (and it would take some additional work to pull off, such as making the global dispatcher work without Arc) but it would be cool to be able to say we can do it.
Obviously, this change should be guided by benchmarking. It's possible my intuition is just wrong, and the Mutex is faster. Of course, we would want to benchmark this under concurrency, since that's when the Mutex contention might actually matter.
Alternatives
- We could just not do this. Since registering callsites only happens once, it is entirely possible that there are bigger things to focus on optimizing...and they may not require writing Yet Another Atomic Linked List, too.
- We could get rid of runtime callsite registration entirely, and — @nagisa will love this one, I think — use a
#[link_section]attribute to get the linker to put a reference to each callsite in a special section, which we would then treat as an array at runtime. This is similar to what @mystor did here: https://github.com/hawkw/mycelium/blob/03ef7cdefb2563d04c2c22219d24da32e8db387f/util/src/testing.rs#L38-L69.- The main advantages of this approach are:
- We could iterate over the section as an array immediately when the first subscriber is created, so every callsite will already be registered the first time it's hit — no registration overhead in user code at all
- There's no need for any concurrency control of the callsite registry, since it will already be generated for us and never need to be mutated
- Array iteration is fast
- This would never use the heap at all
- It's a cool hack that would make me feel like a real systems programmer
- However, it has some downsides:
- Adds a bunch of cool new unsafe code that some people will definitely side-eye
- I have no idea how portable this is. My understanding is that
#[link_section]is not specific to a particular linker likelldor GNUld, but I don't know if this would work at all on exotic targets like WASM, or if someone tries to test code usingtracingin Miri. We could try to guard the whole system behind conditional compilation flags, and fall back to runtime registration if we can't guarantee that Cool Linker Tricks will work on the target platform...but that is a whole new can of worms in and of itself...
- The main advantages of this approach are: