Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching hashes #591

Closed
Solumin opened this issue Nov 21, 2023 · 6 comments · Fixed by #596
Closed

Caching hashes #591

Solumin opened this issue Nov 21, 2023 · 6 comments · Fixed by #596

Comments

@Solumin
Copy link

Solumin commented Nov 21, 2023

Description

The Use Case

We have an AST that needs to be serialized. Currently we're defining the AST classes using attrs and serializing it with pickle, and we're in the middle of moving to msgspec for both aspects. (We really want the speed of Struct, and we need tagged unions.)

When processing the AST, we do a lot of hashing. Hundreds of thousands of calls. This is easily solved by using attrs's cache_hash option, cutting the number of __hash__() calls by a significant amount and saving us a good chunk of time.

The Feature

We'd really like msgspec to provide something like cache_hash. I think it should be easily implementable once private fields (#199) are available: _hash (or similar) would be a private field that's calculated when the struct is initialized, and __hash__() would just return _hash. (Or it could be calculated the first time __hash__() is called. Doesn't matter much, I think.)

It would also be nice if this was usable with a custom __hash__() implementation, as the default implementation does not work well for us.

I think this would be another configuration value, like frozen and the like. Alternatively, it could be enabled by default when frozen=True.

This is almost implementable by users already, except that frozen Structs can't be mutated (even in __post_init__) and the cached hash shouldn't be encoded/decoded.

@jcrist
Copy link
Owner

jcrist commented Nov 25, 2023

Thanks for opening this, this use case makes sense to me. I think we should be able to implement cache_hash matching attrs semantics in a fairly straightforward way.

@Solumin
Copy link
Author

Solumin commented Nov 25, 2023

I feel like I should expand on this a bit more:

It would also be nice if this was usable with a custom hash() implementation, as the default implementation does not work well for us.

We have some Structs that have the same structure, e.g.:

class A(msgspec.Struct):
  name: str
  position: int
 
class B(msgspec.Struct):
  name: str
  offset: int

assert hash(A("example", 0)) == hash(B("example", 0))

The two hashes are identical, which is not ideal. It's not a hard requirement that two objects that are not equal have different hashes, but it's something we try to avoid for performance reasons. We've gotten around this with custom __hash__ functions that also hash self.__class__.__name__ alongside the instance's attributes. (This is also more or less what attrs does.)

@jcrist
Copy link
Owner

jcrist commented Nov 27, 2023

The two hashes are identical, which is not ideal. It's not a hard requirement that two objects that are not equal have different hashes, but it's something we try to avoid for performance reasons.

Makes sense. Using disparate types as keys all in the same set/dict is not a use case I had in mind when I wrote the original hash implementation, but it's one we can support fairly easily. This should be fixed by #595 (I'll address cache_hash in a later PR).

@Solumin
Copy link
Author

Solumin commented Nov 27, 2023

Thank you for addressing this so quickly! I really appreciate it.

@jcrist
Copy link
Owner

jcrist commented Dec 13, 2023

This has now been released as part of version 0.18.5.

@Solumin
Copy link
Author

Solumin commented Dec 21, 2023

Some quick benchmarks show that our tool is now 5-10% faster with msgspec than with attrs+pickle. Thank you again!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants