pandas 3.0: fix utils.hash() #492

hagenw · 2026-01-23T12:27:25Z

Ensure consisting hashing behavior with pandas 3.0.

Summary by Sourcery

Align utils.hash() behavior with pandas 3.0 by normalizing string-like and categorical data before hashing and updating the hashing of string columns for stable results.

Bug Fixes:

Fix inconsistent hash values caused by pandas 3.0 changes to string and categorical dtypes, including empty DataFrames.

Enhancements:

Normalize string and categorical columns and explicitly handle schemas for empty DataFrames to ensure stable, version-independent hashing behavior.

Tests:

Extend hash tests to cover various categorical configurations and validate consistent hash values across dtype representations.

sourcery-ai · 2026-01-23T12:27:32Z

Reviewer's Guide

Adjusts the audformat hashing utility to normalize string and categorical dtypes for stable hashes across pandas 3.0 (and PyArrow) versions, and extends tests to cover categorical cases and the new behavior.

File-Level Changes

Change	Details	Files
Normalize DataFrame string and categorical columns before converting to a PyArrow table to produce stable hashes across pandas/pyarrow versions, including for empty frames.	Convert pandas string-like dtypes to object dtype prior to building the PyArrow table and track their explicit PyArrow string type in a schema_fields list For categorical columns, detect string-like categories and rebuild the column with object-typed categories via a new CategoricalDtype, while recording schema information For empty DataFrames with any explicitly-typed columns, construct a PyArrow schema that maps each column either to its specified type (e.g., pa.string()) or to a type inferred via pa.from_numpy_dtype, and pass this schema to pa.Table.from_pandas Fallback to the original pa.Table.from_pandas(df, preserve_index=False) path when no explicit schema is needed	`audformat/core/utils.py`
Ensure that data hashing treats string columns consistently by forcing object representation when computing the data MD5 digest.	For each column in the grouped data, when the dtype is string-like, convert to a NumPy array with dtype=object before stringifying and updating the MD5 digest Preserve the existing special handling that converts nullable integer types with to float for stable hashing across pandas versions	`audformat/core/utils.py`
Extend hash utility tests to cover categorical DataFrames and confirm consistent hashes for various categorical and index configurations under pandas 3.0.	Add multiple parametrized test cases for DataFrames with categorical columns backed by different category index dtypes (plain list, object Index, string Index) to assert identical hash outputs Add a parametrized test case for an ordered integer categorical column to verify its distinct hash value Keep the tests integrated into the existing test_hash parametrization for unified coverage	`tests/test_utils.py`

Possibly linked issues

#(not provided): PR changes utils.hash() normalization (string, categorical, schema) to make hashes stable across pandas versions as in issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

When normalizing categorical columns, you recreate the CategoricalDtype with just the new categories and drop the original ordered flag, which can change semantics; consider preserving ordered=df[col].dtype.ordered when constructing the new dtype.
In the empty-DataFrame schema path, pa.from_numpy_dtype(df[name].dtype) will be invoked for dtypes like category or object, which pyarrow may not support; it would be safer to special-case categoricals and objects there (e.g., map to pa.string() or derive the underlying categories dtype) instead of relying on from_numpy_dtype.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- When normalizing categorical columns, you recreate the CategoricalDtype with just the new categories and drop the original `ordered` flag, which can change semantics; consider preserving `ordered=df[col].dtype.ordered` when constructing the new dtype.
- In the empty-DataFrame schema path, `pa.from_numpy_dtype(df[name].dtype)` will be invoked for dtypes like `category` or `object`, which pyarrow may not support; it would be safer to special-case categoricals and objects there (e.g., map to `pa.string()` or derive the underlying categories dtype) instead of relying on `from_numpy_dtype`.

## Individual Comments

### Comment 1
<location> `audformat/core/utils.py:739-744` </location>
<code_context>
+            if pd.api.types.is_string_dtype(df[col].dtype):
+                df[col] = df[col].astype("object")
+                schema_fields.append((col, pa.string()))
+            elif isinstance(df[col].dtype, pd.CategoricalDtype):
+                # Normalize categorical with string categories to object
+                cat_dtype = df[col].dtype.categories.dtype
+                if pd.api.types.is_string_dtype(cat_dtype):
+                    new_categories = df[col].dtype.categories.astype("object")
+                    df[col] = df[col].astype(pd.CategoricalDtype(new_categories))
+                schema_fields.append((col, None))
+            else:
</code_context>

<issue_to_address>
**issue:** Preserving categorical ordering when rebuilding CategoricalDtype

Using `pd.CategoricalDtype(new_categories)` discards the original `ordered=True/False` setting, which changes the semantics of ordered categoricals. To preserve behavior, pass the original flag: `pd.CategoricalDtype(new_categories, ordered=df[col].dtype.ordered)` when rebuilding the dtype.
</issue_to_address>

### Comment 2
<location> `audformat/core/utils.py:749-756` </location>
<code_context>
+            else:
+                # Let pyarrow infer
+                schema_fields.append((col, None))
+        # Build schema for columns that need explicit types
+        if len(df) == 0 and any(f[1] is not None for f in schema_fields):
+            # For empty DataFrames, specify schema explicitly
+            schema = pa.schema(
+                [
+                    (
+                        name,
+                        typ if typ is not None else pa.from_numpy_dtype(df[name].dtype),
+                    )
+                    for name, typ in schema_fields
</code_context>

<issue_to_address>
**issue (bug_risk):** Using pa.from_numpy_dtype on non-NumPy / extension dtypes can raise for empty DataFrames

For empty DataFrames, this will call `pa.from_numpy_dtype(df[name].dtype)` for all columns with `typ is None`. For extension dtypes (nullable `Int64`, `boolean`, `CategoricalDtype`, etc.), `df[name].dtype` is not a NumPy dtype and `pa.from_numpy_dtype` can raise. In particular, the categorical branch above adds `(col, None)` to `schema_fields`, so this path is exercised. Consider restricting `pa.from_numpy_dtype` to actual NumPy dtypes (e.g. `isinstance(..., np.dtype)`), and either omitting the schema for extension dtypes or mapping them explicitly to Arrow types.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

audformat/core/utils.py

codecov · 2026-01-23T12:29:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (fef5542) to head (6cf325c).
⚠️ Report is 1 commits behind head on dev.

Additional details and impacted files

Files with missing lines	Coverage Δ
audformat/core/utils.py	`100.0% <100.0%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment

* pandas 3.0: segmented_index() and set_index_dtypes() (#490) * Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index * Avoid warning in testing.add_table() (#491) * pandas 3.0: fix utils.hash() (#492) * pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment * Fix categorical dtype with Database.get() (#493) * Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test * Require timedelta64[ns] in assert_index() (#494) * Require timedelta64[ns] in assert_index() * Add tests for mixed cases * pandas 3.0: fix doctests output

pandas 3.0: fix utils.hash()

e513bb3

sourcery-ai bot reviewed Jan 23, 2026

View reviewed changes

audformat/core/utils.py Outdated Show resolved Hide resolved

audformat/core/utils.py Show resolved Hide resolved

Fix comment

9706f1b

hagenw added 4 commits January 23, 2026 13:31

Remove unneeded code

88b064f

Add more tests

2553ea7

Preserve ordered setting

9b0de90

Update comment

6cf325c

hagenw merged commit 347fc69 into dev Jan 23, 2026
13 checks passed

hagenw deleted the fix-hash branch January 23, 2026 13:28

hagenw added a commit that referenced this pull request Jan 24, 2026

Fix utils.hash() to return old value (#492)

f915774

* pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas 3.0: fix utils.hash() #492

pandas 3.0: fix utils.hash() #492

Uh oh!

hagenw commented Jan 23, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jan 23, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pandas 3.0: fix utils.hash() #492

pandas 3.0: fix utils.hash() #492

Uh oh!

Conversation

hagenw commented Jan 23, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Jan 23, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jan 23, 2026 •

edited

Loading

codecov bot commented Jan 23, 2026 •

edited

Loading