Use numpy representation for hashing #436

hagenw · 2024-06-19T12:05:31Z

This change is in response to

The hash key's value seems kind of generic but I wonder whether it only had been introduced recently, or whether it is changed sometimes?

from #419 (review) and has as target branch pyarrow from #419.

Response to the question

The result of the hashing has indeed changed recently in pandas, see pandas-dev/pandas#58999. That is the reason why I require pandas>2.2.0 in #419 as with lower versions of pandas you get a different hash.
I also find this worrisome as this might indicate that they might change it again in the future.

I also checked what changed between pandas 2.1.x and 2.2.x inside pandas/core/util/hashing.py, but it's only related to the documentation:

3a4
> 
108c109,110
<     Series of uint64, same length as the object
---
>     Series of uint64
>         Same length as the object.
244a247
>         The input array to hash.
256a260,264
> 
>     See Also
>     --------
>     util.hash_pandas_object : Return a data hash of the Index/Series/DataFrame.
>     util.hash_tuples : Hash an MultiIndex / listlike-of-tuples efficiently.

Which means the change in the resulting hash with pandas 2.2.x must be related to some restructuring/fixes of the dataframes themselves.

Description of proposed changes

Instead of relying on the solution introduced in #419 with pandas.util.hash_pandas_object() and pickle.dumps(), which might also change, I suggest here instead to hash each column separately, by first converting it to a numpy array with pandas.Series.to_numpy() and then hashing its string representation.

The behavior of pandas.Series.to_numpy() might change over time, and did already for pandas 2.2.x). We can handle with that by adding custom code, as I do here for Int64 dtypes. The same goes with the string representation of a numpy array, which might also change over time.

I could also lower the dependency for pandas to pandas >=2.1.0.

The new hash calculation is slightly slower than before. Storing the big table to CSV now takes 5.0 s instead of 4.2 s in #419.

codecov · 2024-06-19T12:21:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.0%. Comparing base (2912f76) to head (575d2fe).

Additional details and impacted files

Files	Coverage Δ
audformat/core/table.py	`100.0% <100.0%> (ø)`

ChristianGeng

Review

As we need a way to identify objects, and the introduction of parquet favoring non terministic compression we no longer can use md5sums "on disk" for that purpose.

In order not to depend on pandas specific implementation details in pd.util.hash_pandas_object, this MR introduces an md5 sum based hashing using numpy representations and the python builtin hashlib.md5 representation directly.

I checked the oldest numpy version compatible with pandas 2.2.2 (1.22.4) , and the

line

md5.update(bytes(str(y.to_numpy()), "utf-8"))

does not introduce any changes. So my understanding is that this is safe, and therefore I will approve.

* Ensure correct boolean dtype in misc table index * Remove unneeded code * Use pyarrow to read CSV files * Start debugging * Continue debugging * Fix tests * Remove unneeded code * Improve code * Fix test for older pandas versions * Exclude benchmark folder from tests * Test other implementation * Remove support for Python 3.8 * Store tables as PARQUET * Cleanup code + Table.levels * Use dict for CSV dtype mappings * Rename helper function * Simplify code * Add helper function for CSV schema * Fix typo in docstring * Remove levels attribute * Merge stash * Remove levels from doctest output * Convert method to property * Add comment * Simplify code * Simplify code * Add test for md5sum of parquet file * Switch back to snappy compression * Fix linter * Store hash inside parquet file * Fix code coverage * Stay with CSV as default table format * Test pyarrow==15.0.2 * Test pyarrow==14.0.2 * Test pyarrow==13.0 * Test pyarrow==12.0 * Test pyarrow==11.0 * Test pyarrow==10.0 * Test pyarrow==10.0.1 * Require pyarrow>=10.0.1 * Test pandas<2.1.0 * Add explanations for requirements * Add test using minimum pip requirements * Fix alphabetical order of requirements * Enhance test matrix definition * Debug failing test * Test different hash method * Use different hashing approach * Require pandas>=2.2.0 and fix hashes * CI: re-enable all minimal requriements * Hashing algorithm to respect row order * Clean up tests * Fix minimum install of audiofile * Fix docstring of Table.load() * Fix docstring of Database.load() * Ensure correct order in time when storing tables * Simplify comment * Add docstring to _load_pickle() * Fix _save_parquet() docstring * Improve comment in _dataframe_hash() * Document arguments of test_table_update... * Relax test for table saving order * Update audformat/core/table.py Co-authored-by: ChristianGeng <christian.c.geng@gmail.com> * Revert "Update audformat/core/table.py" This reverts commit 3f21e3c. * Use numpy representation for hashing (#436) * Use numpy representation for hashing * Enable tests and require pandas>=1.4.1 * Use numpy<2.0 in minimum test * Skip doctests in minimum * Require pandas>=2.1.0 * Require numpy<=2.0.0 in minimum test * Remove print statements * Fix numpy<2.0.0 for minimum test * Remove max_rows argument * Simplify code * Use test class * CI: remove pyarrow from branch to start test --------- Co-authored-by: ChristianGeng <christian.c.geng@gmail.com>

Use numpy representation for hashing

cbeca0f

hagenw marked this pull request as draft June 19, 2024 12:05

hagenw added 2 commits June 19, 2024 14:15

Enable tests and require pandas>=1.4.1

a4ae33f

Use numpy<2.0 in minimum test

ac8a3ef

hagenw added 6 commits June 19, 2024 14:32

Skip doctests in minimum

c0c71d4

Require pandas>=2.1.0

5af2d7d

Require numpy<=2.0.0 in minimum test

0fcab74

Remove print statements

94b934c

Fix numpy<2.0.0 for minimum test

a1d4922

Remove max_rows argument

6bf2e05

hagenw marked this pull request as ready for review June 19, 2024 13:14

Simplify code

575d2fe

hagenw requested a review from ChristianGeng June 19, 2024 13:24

hagenw mentioned this pull request Jun 19, 2024

Store tables as PARQUET files #419

Merged

ChristianGeng approved these changes Jun 19, 2024

View reviewed changes

hagenw merged commit c4c41ff into pyarrow Jun 19, 2024

hagenw deleted the pyarrow-better-hashing branch June 19, 2024 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use numpy representation for hashing #436

Use numpy representation for hashing #436

Uh oh!

hagenw commented Jun 19, 2024 •

edited

Loading

Uh oh!

codecov bot commented Jun 19, 2024 •

edited

Loading

Uh oh!

ChristianGeng left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use numpy representation for hashing #436

Use numpy representation for hashing #436

Uh oh!

Conversation

hagenw commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Response to the question

Description of proposed changes

Uh oh!

codecov bot commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChristianGeng left a comment

Choose a reason for hiding this comment

Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hagenw commented Jun 19, 2024 •

edited

Loading

codecov bot commented Jun 19, 2024 •

edited

Loading