Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jun 19, 2024

This change is in response to

The hash key's value seems kind of generic but I wonder whether it only had been introduced recently, or whether it is changed sometimes?

from #419 (review) and has as target branch pyarrow from #419.

Response to the question

The result of the hashing has indeed changed recently in pandas, see pandas-dev/pandas#58999. That is the reason why I require pandas>2.2.0 in #419 as with lower versions of pandas you get a different hash.
I also find this worrisome as this might indicate that they might change it again in the future.

I also checked what changed between pandas 2.1.x and 2.2.x inside pandas/core/util/hashing.py, but it's only related to the documentation:

3a4
> 
108c109,110
<     Series of uint64, same length as the object
---
>     Series of uint64
>         Same length as the object.
244a247
>         The input array to hash.
256a260,264
> 
>     See Also
>     --------
>     util.hash_pandas_object : Return a data hash of the Index/Series/DataFrame.
>     util.hash_tuples : Hash an MultiIndex / listlike-of-tuples efficiently.

Which means the change in the resulting hash with pandas 2.2.x must be related to some restructuring/fixes of the dataframes themselves.

Description of proposed changes

Instead of relying on the solution introduced in #419 with pandas.util.hash_pandas_object() and pickle.dumps(), which might also change, I suggest here instead to hash each column separately, by first converting it to a numpy array with pandas.Series.to_numpy() and then hashing its string representation.

The behavior of pandas.Series.to_numpy() might change over time, and did already for pandas 2.2.x). We can handle with that by adding custom code, as I do here for Int64 dtypes. The same goes with the string representation of a numpy array, which might also change over time.

I could also lower the dependency for pandas to pandas >=2.1.0.

The new hash calculation is slightly slower than before. Storing the big table to CSV now takes 5.0 s instead of 4.2 s in #419.

@hagenw hagenw marked this pull request as draft June 19, 2024 12:05
@codecov
Copy link

codecov bot commented Jun 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.0%. Comparing base (2912f76) to head (575d2fe).

Additional details and impacted files
Files Coverage Δ
audformat/core/table.py 100.0% <100.0%> (ø)

@hagenw hagenw marked this pull request as ready for review June 19, 2024 13:14
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

As we need a way to identify objects, and the introduction of parquet favoring non terministic compression we no longer can use md5sums "on disk" for that purpose.

In order not to depend on pandas specific implementation details in pd.util.hash_pandas_object, this MR introduces an md5 sum based hashing using numpy representations and the python builtin hashlib.md5 representation directly.

I checked the oldest numpy version compatible with pandas 2.2.2 (1.22.4) , and the

line

md5.update(bytes(str(y.to_numpy()), "utf-8"))

does not introduce any changes. So my understanding is that this is safe, and therefore I will approve.

@hagenw hagenw merged commit c4c41ff into pyarrow Jun 19, 2024
@hagenw hagenw deleted the pyarrow-better-hashing branch June 19, 2024 14:23
hagenw added a commit that referenced this pull request Jun 19, 2024
* Ensure correct boolean dtype in misc table index

* Remove unneeded code

* Use pyarrow to read CSV files

* Start debugging

* Continue debugging

* Fix tests

* Remove unneeded code

* Improve code

* Fix test for older pandas versions

* Exclude benchmark folder from tests

* Test other implementation

* Remove support for Python 3.8

* Store tables as PARQUET

* Cleanup code + Table.levels

* Use dict for CSV dtype mappings

* Rename helper function

* Simplify code

* Add helper function for CSV schema

* Fix typo in docstring

* Remove levels attribute

* Merge stash

* Remove levels from doctest output

* Convert method to property

* Add comment

* Simplify code

* Simplify code

* Add test for md5sum of parquet file

* Switch back to snappy compression

* Fix linter

* Store hash inside parquet file

* Fix code coverage

* Stay with CSV as default table format

* Test pyarrow==15.0.2

* Test pyarrow==14.0.2

* Test pyarrow==13.0

* Test pyarrow==12.0

* Test pyarrow==11.0

* Test pyarrow==10.0

* Test pyarrow==10.0.1

* Require pyarrow>=10.0.1

* Test pandas<2.1.0

* Add explanations for requirements

* Add test using minimum pip requirements

* Fix alphabetical order of requirements

* Enhance test matrix definition

* Debug failing test

* Test different hash method

* Use different hashing approach

* Require pandas>=2.2.0 and fix hashes

* CI: re-enable all minimal requriements

* Hashing algorithm to respect row order

* Clean up tests

* Fix minimum install of audiofile

* Fix docstring of Table.load()

* Fix docstring of Database.load()

* Ensure correct order in time when storing tables

* Simplify comment

* Add docstring to _load_pickle()

* Fix _save_parquet() docstring

* Improve comment in _dataframe_hash()

* Document arguments of test_table_update...

* Relax test for table saving order

* Update audformat/core/table.py

Co-authored-by: ChristianGeng <christian.c.geng@gmail.com>

* Revert "Update audformat/core/table.py"

This reverts commit 3f21e3c.

* Use numpy representation for hashing (#436)

* Use numpy representation for hashing

* Enable tests and require pandas>=1.4.1

* Use numpy<2.0 in minimum test

* Skip doctests in minimum

* Require pandas>=2.1.0

* Require numpy<=2.0.0 in minimum test

* Remove print statements

* Fix numpy<2.0.0 for minimum test

* Remove max_rows argument

* Simplify code

* Use test class

* CI: remove pyarrow from branch to start test

---------

Co-authored-by: ChristianGeng <christian.c.geng@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants