Description
The self-profiling feature is going to make profiling the compilers performance a lot easier. However, a recent first stab at collecting more detailed information (see #58085) still has too much overhead.
Here are some of the things that could be improved:
-
Move post-processing of the collected data out of the
rustc
process, as much as possible. SelfProfiler::get_results() does a lot of work for generating the statistics from the collected events. All of this should probably be moved to a separate tool that runs after profiling is done. -
Reduce the amount of dispatch and locking that needs to be done for each event. For each event we have to get exclusive access to the profiler (
RefCell
/parking_lot
mutex) and then look up the event stream for the current thread in anFxHashMap
. This should probably solved via thread-local data somehow. -
Reduce the size of events. Events are quite big (32 bytes on
x86_64
would be my guess). The timestamp can be reduced to 64 bits if we just measure the time from process start. The&str
containing the query name can be replaced by a 4 byte tag. -
Persist events to disk in a binary format. We should probably open a memory mapped file per thread that we write events to directly. If events don't contain pointers they can be written to disk verbatim. The post-processing tool can then convert them to something platform independent.
Some time soon we also want to record query keys per event. This can already be done efficiently by storing the 32 bit DepNodeIndex
that corresponds to a query (which also obviates the need to store the query name in each event). However, in order for the DepNodeIndex
to be useful, we'll need to create a persist a mapping of DepNodeIndex -> String
at some point before the tcx
is destroyed (i.e. in the middle of the compilation process). I expect that creating this map will not be entirely cheap :/
@Mark-Simulacrum, as you can see the whole workflow around self-profiling will change quite a bit, so I
think it's too early to add infrastructure for it to perf.rlo just yet.
cc @wesleywiser @nnethercote (and @rust-lang/wg-compiler-performance for good measure)