Description
While working on #61486 / #87194 I came across backrefs in byte array values (with repeating elements), which I hadn't thought about (in terms of implications on optimizing the length of a symbol), but which is otherwise entirely expected. In order to demonstrate an edge case, I encoded some...longer...path::<{ &[0, 0, 0, 0] }>
to get KRAh0_B12_B12_B12_E
, which is parsed as follows:
K // const generic
R // &
A // [
h0_ // 0u8
B12_ // backref to "h0_" / `0u8`
B12_ // backref to "h0_" / `0u8`
B12_ // backref to "h0_" / `0u8`
E // ]
Because this is a long enough symbol, the backrefs end up being (at least) 4 bytes each (to encode positions greater than 62), whereas an integer leaf constant with a 0..=15
value gets encoded as 3 bytes, so the backrefs are each adding an extra byte (and making more pointless work for the demangler).
We could avoid this by tracking the range, not just the start, of an encoded path/type/const, and one of:
- skip caching the backref position when encoding that position would require more bytes than the inline size
- record the range and make
print_backref
decide between actually printing the backref, and duplicating the contents of that range, inline, depending on which would be shorter
However, this isn't an urgent concern IMO because of the combination of these aspects:
- the encoder gets to choose whether to use a backref or to repeat the encoding inline, we don't have to update decoders if we change our backref generation heuristic
- most backrefs are either 3 or 4 bytes (for positions up to
62
or62² = 3844
, respectively) - we already special-case one-character leaves, to never record themselves for backrefs, so only less-trivial path/type/const syntax, which will produce 2 bytes or more, is relevant
- paths have to include a crate root leaf (at least ~16 bytes) or a backref (at least 3 bytes), so the shortest possible paths are 5 bytes (e.g. inherent
impl
on a builtin type, using a backref for theimpl
parent module)- this further implies that only types/consts that don't reference paths can be 2-3 bytes (e.g.
&str
isRe
,[u8]
isSh
,fn()
isFEu
, etc.)
- this further implies that only types/consts that don't reference paths can be 2-3 bytes (e.g.
It might be interesting to measure the size difference across some larger projects (including rustc
and Cargo), by making this change (even inefficiently), after v0 becomes the default, but I don't expect a difference bigger than a handful of bytes per symbol, if that.