fingerprinting base case handles empty __dict__ #1243
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1242; copying reply from the issues thread
Problem
Using the cache with these two functions would create a collision and both would return
pd.Timestamp("2021-01-01")on the 2nd execution (i.e., when retrieving values from cache)Solution
The source of the bug is an oddity of
pandas. Some objects have an empty__dict__attached. A one line condition check now properly displays the intended warning messages.When encountering this case, Hamilton gives a random UUID instead of hashing the value. Caching can still work because the cache key is
NODE_NAME-CODE_VERSION-DATA_VERSIONwhere data version is now a random UUIDInvestigation
dr.cache.code_versionsshow thatfirst()andsecond()are produce differentcode_versionhashdr.cache.data_versionsproduce the samedata_versionhash (source of the bug)hamilton.caching.fingerprintinghash_pandas_obj()expectspd.Seriesorpd.DataFrameand doesn't handlepd.Timestamp(it would handle a series ofpd.Timestampvalues without collisions though)data_versionsmatch the result of the default implementationhash_value()and falls underhash_value(pd.Timestamp(...).__dict__)pd.Timestamp(...).__dict__ == {}, which is an odd behavior by pandas (it shouldn't be defined instead of empty)hash_value()handles objects without a__dict__(i.e., uses slots), but doesn't account for empty.__dict__attributeside notes