Skip to content

Streamline Symbol, InternedString, and LocalInternedString. #60869

Closed
@nnethercote

Description

We currently have three closely-related symbol types.

Symbol is the fundamental type. A Symbol is an index. All operations work on that index. StableHash is not implemented for it, but there's no reason why it couldn't be. A Symbol can be a gensym, which gets special treatment -- it's a guaranteed unique index, even if its chars have been seen before.

InternedString is a thin wrapper around Symbol. You can convert a Symbol to an InternedString. It has two differences with Symbol.

  • Its PartialOrd/Ord/Hash impls use the chars, rather than the index.
  • Gensym-ness is ignored/irrelevant.

LocalInternedString is an alternative that contains a &str. You can convert both Symbol and InternedString to LocalInternedString. Its PartialOrd/Ord/Hash impls (plus PartialEq/Eq) naturally work on chars. Its main use is to provide a way to look some or all of the individual chars within a Symbol or InternedString, which is sometimes necessary.

I have always found the differences between these types confusing and hard to remember. Furthermore, the distinction between Symbol and InternedString is subtle and has caused
bugs.

Also, gensyms in general make things a lot more complicated, and it would be great to eliminate them.

Here's what I would like as a final state.

  • Symbol exists.
  • InternedString does not exist.
  • LocalInternedString perhaps exists, but is only used temporarily when code needs access to the chars within a Symbol. Alternatively, Symbol could provide a with() method (like InternedString currently has) that provides access to the chars, and then LocalInternedString wouldn't be needed.
  • Symbol's impl of Hash uses the index, and its impl of StableHash uses the chars.
  • Not sure about Symbol's impl of PartialOrd/Ord. If a stable ordering is really needed (perhaps for error messages?) we could introduce a StableOrd trait and use that in the relevant places, or do a custom sort, or something.
  • Gensyms don't really exist. They are simulated: when you call gensym(), it just appends a unique suffix. It's worth noting that gensyms are always identifiers, and so the unique suffix can use a non-identifier char. And Interner could keep a counter. So "foo" would gensym to something lke "foo$1", "foo$2", etc. Once the suffix is added, they would just be treated as normal symbols (in terms of hashing, etc.) I would hope that identifier gensyms would never be compared with non-identifier symbols, so a false positive equality match should be impossible. (Different types for identifier symbols and non-identifier symbols would protect against that, but might cause other difficulties.) Alternatively, syntax_pos::symbol::Symbol::gensym() is incompatible with stable hashing. #49300 talks about other ways of dealing with gensyms.
  • All this should also help performance, because we'd end up with more operations on indexes, and only the necessary ones on chars (which require TLS lookups).

I haven't even touched on the way lifetimes work in the interner, which are subtle and error-prone. But having fewer types would only make improvements on that front simpler.

Thoughts?

CC @petrochenkov @Zoxc @eddyb @Mark-Simulacrum @michaelwoerister

Metadata

Assignees

No one assigned

    Labels

    C-enhancementCategory: An issue proposing an enhancement or a PR with one.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions