Skip to content

Use pointer hash rather than str hash when interning strings #969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

tgnottingham
Copy link
Contributor

Guarantee that ArenaStrs are interned so that hashing can be
pointer-based and interning can be more efficient.

Rename ArenaStr to InternedArenaStr.

Guarantee that `ArenaStr`s are interned so that hashing can be
pointer-based and interning can be more efficient.
@tgnottingham
Copy link
Contributor Author

The difference this makes is small, admittedly, but since we're already interning for efficiency, may as well make it a little more so.

This mainly moves the bulk of preloaded and intern into InternedArenaStr static methods to guarantee that any instances are interned, and updates Hash, PartialEq, and Eq implementations appropriately. The Hash implementation for InternedArenaStr is the one that was changed from being str-based to pointer-based.

@tgnottingham
Copy link
Contributor Author

CI issues seem to be unrelated to the change, as the same thing is happening on another PR. A bunch of No such file errors.

[2021-08-15T23:31:57Z INFO  cargo::core::compiler::fingerprint] fingerprint error for await-call-tree v0.1.0 (/tmp/.tmpDpl4hN)/Doc { deps: false }/TargetInner { ..: lib_target("await-call-tree", ["lib"], "/tmp/.tmpDpl4hN/src/lib.rs", Edition2018) }
[2021-08-15T23:31:57Z INFO  cargo::core::compiler::fingerprint]     err: failed to read `/tmp/.tmpDpl4hN/target/debug/.fingerprint/await-call-tree-f7705a294dd743d7/doc-lib-await-call-tree`
    
    Caused by:
        No such file or directory (os error 2)

@Mark-Simulacrum
Copy link
Member

Those fingerprint errors are normal; the real problem is in piston-image somehow. My guess is a recent change to fingerprints/depinfo in rustc or cargo is causing this (it's also failing on master builds - https://perf.rust-lang.org/status.html)

Copy link
Member

@rylev rylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally feel a little uncomfortable with the usage of unsafe in this crate especially considering the performance implications aren't really clear to me. However, that predates this PR so we shouldn't block on that. In the meantime, I'd much prefer if we could thoroughly document the usage of unsafe.

@@ -9,29 +9,21 @@ use std::ptr::NonNull;
use std::sync::Arc;

pub trait InternString {
unsafe fn to_interned(s: ArenaStr) -> Self;
unsafe fn to_interned(s: InternedArenaStr) -> Self;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this wasn't documented before, but can we add documentation on the safety requirements for this unsafe function?

= (ArcSwap::new(Arc::new(HashSet::new())), Mutex::new((HashSet::new(), Bump::new())));
}

pub fn preloaded<T: InternString>(value: &str) -> Option<T> {
let set = INTERNED.0.load();
unsafe { Some(T::to_interned(*set.get(value)?)) }
InternedArenaStr::preloaded(value).map(|s| unsafe { T::to_interned(s) })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, it would be great to document the use of unsafe and why this particular usage is safe. We should make sure we document all usages of unsafe thoroughly.

@rylev
Copy link
Member

rylev commented Aug 16, 2021

@Mark-Simulacrum @tgnottingham Should we file an issue in rust-lang/rust for the fingerprinting issue?

@Mark-Simulacrum
Copy link
Member

Yeah, that seems good -- worth additional investigation.

@tgnottingham
Copy link
Contributor Author

I just noticed that this change is incorrect due to InternedArenaStr implementing Borrow<str> but not implementing the Hash, etc. traits equivalently to str. This breaks interning in the sense that it results in multiple copies of the same string being allocated.

I think it was also incorrect before this change, because PartialEq wasn't equivalent with str's, plus PartialEq and Hash were inconsistent (former was pointer-based, latter was str-based). Interestingly, this didn't seem to break interning. I assume this is just due to implementation details, and that the inconsistency should be fixed, but please correct me if I'm misunderstanding.

I'll close this and address the above issue in another PR if necessary. I'll also look at documenting the unsafe code.

@Mark-Simulacrum
Copy link
Member

Yeah I think I agree with @rylev that we should see about dropping the interning -- it was added back when we were still loading all of our data on each start up, and I'd guess is significantly much less necessary/effective today. I think some quick experiments to check whether it's necessary would be good.

@tgnottingham
Copy link
Contributor Author

Opened #974 for reevaluating interning.

@tgnottingham tgnottingham deleted the interned-arena-str-hashing-perf branch August 16, 2021 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants