Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 23, 2026

Ensure consisting hashing behavior with pandas 3.0.

Summary by Sourcery

Align utils.hash() behavior with pandas 3.0 by normalizing string-like and categorical data before hashing and updating the hashing of string columns for stable results.

Bug Fixes:

  • Fix inconsistent hash values caused by pandas 3.0 changes to string and categorical dtypes, including empty DataFrames.

Enhancements:

  • Normalize string and categorical columns and explicitly handle schemas for empty DataFrames to ensure stable, version-independent hashing behavior.

Tests:

  • Extend hash tests to cover various categorical configurations and validate consistent hash values across dtype representations.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 23, 2026

Reviewer's Guide

Adjusts the audformat hashing utility to normalize string and categorical dtypes for stable hashes across pandas 3.0 (and PyArrow) versions, and extends tests to cover categorical cases and the new behavior.

File-Level Changes

Change Details Files
Normalize DataFrame string and categorical columns before converting to a PyArrow table to produce stable hashes across pandas/pyarrow versions, including for empty frames.
  • Convert pandas string-like dtypes to object dtype prior to building the PyArrow table and track their explicit PyArrow string type in a schema_fields list
  • For categorical columns, detect string-like categories and rebuild the column with object-typed categories via a new CategoricalDtype, while recording schema information
  • For empty DataFrames with any explicitly-typed columns, construct a PyArrow schema that maps each column either to its specified type (e.g., pa.string()) or to a type inferred via pa.from_numpy_dtype, and pass this schema to pa.Table.from_pandas
  • Fallback to the original pa.Table.from_pandas(df, preserve_index=False) path when no explicit schema is needed
audformat/core/utils.py
Ensure that data hashing treats string columns consistently by forcing object representation when computing the data MD5 digest.
  • For each column in the grouped data, when the dtype is string-like, convert to a NumPy array with dtype=object before stringifying and updating the MD5 digest
  • Preserve the existing special handling that converts nullable integer types with to float for stable hashing across pandas versions
audformat/core/utils.py
Extend hash utility tests to cover categorical DataFrames and confirm consistent hashes for various categorical and index configurations under pandas 3.0.
  • Add multiple parametrized test cases for DataFrames with categorical columns backed by different category index dtypes (plain list, object Index, string Index) to assert identical hash outputs
  • Add a parametrized test case for an ordered integer categorical column to verify its distinct hash value
  • Keep the tests integrated into the existing test_hash parametrization for unified coverage
tests/test_utils.py

Possibly linked issues

  • #(not provided): PR changes utils.hash() normalization (string, categorical, schema) to make hashes stable across pandas versions as in issue.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • When normalizing categorical columns, you recreate the CategoricalDtype with just the new categories and drop the original ordered flag, which can change semantics; consider preserving ordered=df[col].dtype.ordered when constructing the new dtype.
  • In the empty-DataFrame schema path, pa.from_numpy_dtype(df[name].dtype) will be invoked for dtypes like category or object, which pyarrow may not support; it would be safer to special-case categoricals and objects there (e.g., map to pa.string() or derive the underlying categories dtype) instead of relying on from_numpy_dtype.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- When normalizing categorical columns, you recreate the CategoricalDtype with just the new categories and drop the original `ordered` flag, which can change semantics; consider preserving `ordered=df[col].dtype.ordered` when constructing the new dtype.
- In the empty-DataFrame schema path, `pa.from_numpy_dtype(df[name].dtype)` will be invoked for dtypes like `category` or `object`, which pyarrow may not support; it would be safer to special-case categoricals and objects there (e.g., map to `pa.string()` or derive the underlying categories dtype) instead of relying on `from_numpy_dtype`.

## Individual Comments

### Comment 1
<location> `audformat/core/utils.py:739-744` </location>
<code_context>
+            if pd.api.types.is_string_dtype(df[col].dtype):
+                df[col] = df[col].astype("object")
+                schema_fields.append((col, pa.string()))
+            elif isinstance(df[col].dtype, pd.CategoricalDtype):
+                # Normalize categorical with string categories to object
+                cat_dtype = df[col].dtype.categories.dtype
+                if pd.api.types.is_string_dtype(cat_dtype):
+                    new_categories = df[col].dtype.categories.astype("object")
+                    df[col] = df[col].astype(pd.CategoricalDtype(new_categories))
+                schema_fields.append((col, None))
+            else:
</code_context>

<issue_to_address>
**issue:** Preserving categorical ordering when rebuilding CategoricalDtype

Using `pd.CategoricalDtype(new_categories)` discards the original `ordered=True/False` setting, which changes the semantics of ordered categoricals. To preserve behavior, pass the original flag: `pd.CategoricalDtype(new_categories, ordered=df[col].dtype.ordered)` when rebuilding the dtype.
</issue_to_address>

### Comment 2
<location> `audformat/core/utils.py:749-756` </location>
<code_context>
+            else:
+                # Let pyarrow infer
+                schema_fields.append((col, None))
+        # Build schema for columns that need explicit types
+        if len(df) == 0 and any(f[1] is not None for f in schema_fields):
+            # For empty DataFrames, specify schema explicitly
+            schema = pa.schema(
+                [
+                    (
+                        name,
+                        typ if typ is not None else pa.from_numpy_dtype(df[name].dtype),
+                    )
+                    for name, typ in schema_fields
</code_context>

<issue_to_address>
**issue (bug_risk):** Using pa.from_numpy_dtype on non-NumPy / extension dtypes can raise for empty DataFrames

For empty DataFrames, this will call `pa.from_numpy_dtype(df[name].dtype)` for all columns with `typ is None`. For extension dtypes (nullable `Int64`, `boolean`, `CategoricalDtype`, etc.), `df[name].dtype` is not a NumPy dtype and `pa.from_numpy_dtype` can raise. In particular, the categorical branch above adds `(col, None)` to `schema_fields`, so this path is exercised. Consider restricting `pa.from_numpy_dtype` to actual NumPy dtypes (e.g. `isinstance(..., np.dtype)`), and either omitting the schema for extension dtypes or mapping them explicitly to Arrow types.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (fef5542) to head (6cf325c).
⚠️ Report is 1 commits behind head on dev.

Additional details and impacted files
Files with missing lines Coverage Δ
audformat/core/utils.py 100.0% <100.0%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hagenw hagenw merged commit 347fc69 into dev Jan 23, 2026
13 checks passed
@hagenw hagenw deleted the fix-hash branch January 23, 2026 13:28
hagenw added a commit that referenced this pull request Jan 24, 2026
* pandas 3.0: fix utils.hash()

* Fix comment

* Remove unneeded code

* Add more tests

* Preserve ordered setting

* Update comment
hagenw added a commit that referenced this pull request Jan 24, 2026
* pandas 3.0: segmented_index() and set_index_dtypes() (#490)

* Add failing test

* Make test pandas 3.0.0 compatible

* Fix set_index_dtypes() for pandas 3.0

* Add comment

* Fix doctests

* Update segmented_index()

* Use segmented_index in test

* Add test for segmented_index

* Avoid warning in testing.add_table() (#491)

* pandas 3.0: fix utils.hash() (#492)

* pandas 3.0: fix utils.hash()

* Fix comment

* Remove unneeded code

* Add more tests

* Preserve ordered setting

* Update comment

* Fix categorical dtype with Database.get() (#493)

* Fix categorical dtype with Database.get()

* Update tests

* Add additional test

* Improve code

* Clean up comment

* We converted to categorical data

* Simplify test

* Simplify string test

* Require timedelta64[ns] in assert_index() (#494)

* Require timedelta64[ns] in assert_index()

* Add tests for mixed cases

* pandas 3.0: fix doctests output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants