-
Notifications
You must be signed in to change notification settings - Fork 1
Use numpy representation for hashing #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
|
ChristianGeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review
As we need a way to identify objects, and the introduction of parquet favoring non terministic compression we no longer can use md5sums "on disk" for that purpose.
In order not to depend on pandas specific implementation details in pd.util.hash_pandas_object, this MR introduces an md5 sum based hashing using numpy representations and the python builtin hashlib.md5 representation directly.
I checked the oldest numpy version compatible with pandas 2.2.2 (1.22.4) , and the
line
md5.update(bytes(str(y.to_numpy()), "utf-8"))does not introduce any changes. So my understanding is that this is safe, and therefore I will approve.
* Ensure correct boolean dtype in misc table index * Remove unneeded code * Use pyarrow to read CSV files * Start debugging * Continue debugging * Fix tests * Remove unneeded code * Improve code * Fix test for older pandas versions * Exclude benchmark folder from tests * Test other implementation * Remove support for Python 3.8 * Store tables as PARQUET * Cleanup code + Table.levels * Use dict for CSV dtype mappings * Rename helper function * Simplify code * Add helper function for CSV schema * Fix typo in docstring * Remove levels attribute * Merge stash * Remove levels from doctest output * Convert method to property * Add comment * Simplify code * Simplify code * Add test for md5sum of parquet file * Switch back to snappy compression * Fix linter * Store hash inside parquet file * Fix code coverage * Stay with CSV as default table format * Test pyarrow==15.0.2 * Test pyarrow==14.0.2 * Test pyarrow==13.0 * Test pyarrow==12.0 * Test pyarrow==11.0 * Test pyarrow==10.0 * Test pyarrow==10.0.1 * Require pyarrow>=10.0.1 * Test pandas<2.1.0 * Add explanations for requirements * Add test using minimum pip requirements * Fix alphabetical order of requirements * Enhance test matrix definition * Debug failing test * Test different hash method * Use different hashing approach * Require pandas>=2.2.0 and fix hashes * CI: re-enable all minimal requriements * Hashing algorithm to respect row order * Clean up tests * Fix minimum install of audiofile * Fix docstring of Table.load() * Fix docstring of Database.load() * Ensure correct order in time when storing tables * Simplify comment * Add docstring to _load_pickle() * Fix _save_parquet() docstring * Improve comment in _dataframe_hash() * Document arguments of test_table_update... * Relax test for table saving order * Update audformat/core/table.py Co-authored-by: ChristianGeng <christian.c.geng@gmail.com> * Revert "Update audformat/core/table.py" This reverts commit 3f21e3c. * Use numpy representation for hashing (#436) * Use numpy representation for hashing * Enable tests and require pandas>=1.4.1 * Use numpy<2.0 in minimum test * Skip doctests in minimum * Require pandas>=2.1.0 * Require numpy<=2.0.0 in minimum test * Remove print statements * Fix numpy<2.0.0 for minimum test * Remove max_rows argument * Simplify code * Use test class * CI: remove pyarrow from branch to start test --------- Co-authored-by: ChristianGeng <christian.c.geng@gmail.com>
This change is in response to
from #419 (review) and has as target branch
pyarrowfrom #419.Response to the question
The result of the hashing has indeed changed recently in
pandas, see pandas-dev/pandas#58999. That is the reason why I requirepandas>2.2.0in #419 as with lower versions ofpandasyou get a different hash.I also find this worrisome as this might indicate that they might change it again in the future.
I also checked what changed between
pandas2.1.x and 2.2.x insidepandas/core/util/hashing.py, but it's only related to the documentation:Which means the change in the resulting hash with
pandas2.2.x must be related to some restructuring/fixes of the dataframes themselves.Description of proposed changes
Instead of relying on the solution introduced in #419 with
pandas.util.hash_pandas_object()andpickle.dumps(), which might also change, I suggest here instead to hash each column separately, by first converting it to anumpyarray withpandas.Series.to_numpy()and then hashing its string representation.The behavior of
pandas.Series.to_numpy()might change over time, and did already forpandas2.2.x). We can handle with that by adding custom code, as I do here forInt64dtypes. The same goes with the string representation of anumpyarray, which might also change over time.I could also lower the dependency for
pandastopandas >=2.1.0.The new hash calculation is slightly slower than before. Storing the big table to CSV now takes 5.0 s instead of 4.2 s in #419.