Skip to content

v0 mangling should avoid backrefs when they're not shorter than their target. #87511

Open
@eddyb

Description

@eddyb

While working on #61486 / #87194 I came across backrefs in byte array values (with repeating elements), which I hadn't thought about (in terms of implications on optimizing the length of a symbol), but which is otherwise entirely expected. In order to demonstrate an edge case, I encoded some...longer...path::<{ &[0, 0, 0, 0] }> to get KRAh0_B12_B12_B12_E, which is parsed as follows:

K       // const generic
 R      //  &
  A     //   [
   h0_  //    0u8
   B12_ //    backref to "h0_" / `0u8`
   B12_ //    backref to "h0_" / `0u8`
   B12_ //    backref to "h0_" / `0u8`
  E     //   ]

Because this is a long enough symbol, the backrefs end up being (at least) 4 bytes each (to encode positions greater than 62), whereas an integer leaf constant with a 0..=15 value gets encoded as 3 bytes, so the backrefs are each adding an extra byte (and making more pointless work for the demangler).

We could avoid this by tracking the range, not just the start, of an encoded path/type/const, and one of:

  • skip caching the backref position when encoding that position would require more bytes than the inline size
  • record the range and make print_backref decide between actually printing the backref, and duplicating the contents of that range, inline, depending on which would be shorter

However, this isn't an urgent concern IMO because of the combination of these aspects:

  • the encoder gets to choose whether to use a backref or to repeat the encoding inline, we don't have to update decoders if we change our backref generation heuristic
  • most backrefs are either 3 or 4 bytes (for positions up to 62 or 62² = 3844, respectively)
  • we already special-case one-character leaves, to never record themselves for backrefs, so only less-trivial path/type/const syntax, which will produce 2 bytes or more, is relevant
  • paths have to include a crate root leaf (at least ~16 bytes) or a backref (at least 3 bytes), so the shortest possible paths are 5 bytes (e.g. inherent impl on a builtin type, using a backref for the impl parent module)
    • this further implies that only types/consts that don't reference paths can be 2-3 bytes (e.g. &str is Re, [u8] is Sh, fn() is FEu, etc.)

It might be interesting to measure the size difference across some larger projects (including rustc and Cargo), by making this change (even inefficiently), after v0 becomes the default, but I don't expect a difference bigger than a handful of bytes per symbol, if that.

cc @michaelwoerister

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-linkageArea: linking into static, shared libraries and binariesC-enhancementCategory: An issue proposing an enhancement or a PR with one.I-heavyIssue: Problems and improvements with respect to binary size of generated code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions