-
Notifications
You must be signed in to change notification settings - Fork 1
pandas 3.0: fix utils.hash() #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideAdjusts the audformat hashing utility to normalize string and categorical dtypes for stable hashes across pandas 3.0 (and PyArrow) versions, and extends tests to cover categorical cases and the new behavior. File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues, and left some high level feedback:
- When normalizing categorical columns, you recreate the CategoricalDtype with just the new categories and drop the original
orderedflag, which can change semantics; consider preservingordered=df[col].dtype.orderedwhen constructing the new dtype. - In the empty-DataFrame schema path,
pa.from_numpy_dtype(df[name].dtype)will be invoked for dtypes likecategoryorobject, which pyarrow may not support; it would be safer to special-case categoricals and objects there (e.g., map topa.string()or derive the underlying categories dtype) instead of relying onfrom_numpy_dtype.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- When normalizing categorical columns, you recreate the CategoricalDtype with just the new categories and drop the original `ordered` flag, which can change semantics; consider preserving `ordered=df[col].dtype.ordered` when constructing the new dtype.
- In the empty-DataFrame schema path, `pa.from_numpy_dtype(df[name].dtype)` will be invoked for dtypes like `category` or `object`, which pyarrow may not support; it would be safer to special-case categoricals and objects there (e.g., map to `pa.string()` or derive the underlying categories dtype) instead of relying on `from_numpy_dtype`.
## Individual Comments
### Comment 1
<location> `audformat/core/utils.py:739-744` </location>
<code_context>
+ if pd.api.types.is_string_dtype(df[col].dtype):
+ df[col] = df[col].astype("object")
+ schema_fields.append((col, pa.string()))
+ elif isinstance(df[col].dtype, pd.CategoricalDtype):
+ # Normalize categorical with string categories to object
+ cat_dtype = df[col].dtype.categories.dtype
+ if pd.api.types.is_string_dtype(cat_dtype):
+ new_categories = df[col].dtype.categories.astype("object")
+ df[col] = df[col].astype(pd.CategoricalDtype(new_categories))
+ schema_fields.append((col, None))
+ else:
</code_context>
<issue_to_address>
**issue:** Preserving categorical ordering when rebuilding CategoricalDtype
Using `pd.CategoricalDtype(new_categories)` discards the original `ordered=True/False` setting, which changes the semantics of ordered categoricals. To preserve behavior, pass the original flag: `pd.CategoricalDtype(new_categories, ordered=df[col].dtype.ordered)` when rebuilding the dtype.
</issue_to_address>
### Comment 2
<location> `audformat/core/utils.py:749-756` </location>
<code_context>
+ else:
+ # Let pyarrow infer
+ schema_fields.append((col, None))
+ # Build schema for columns that need explicit types
+ if len(df) == 0 and any(f[1] is not None for f in schema_fields):
+ # For empty DataFrames, specify schema explicitly
+ schema = pa.schema(
+ [
+ (
+ name,
+ typ if typ is not None else pa.from_numpy_dtype(df[name].dtype),
+ )
+ for name, typ in schema_fields
</code_context>
<issue_to_address>
**issue (bug_risk):** Using pa.from_numpy_dtype on non-NumPy / extension dtypes can raise for empty DataFrames
For empty DataFrames, this will call `pa.from_numpy_dtype(df[name].dtype)` for all columns with `typ is None`. For extension dtypes (nullable `Int64`, `boolean`, `CategoricalDtype`, etc.), `df[name].dtype` is not a NumPy dtype and `pa.from_numpy_dtype` can raise. In particular, the categorical branch above adds `(col, None)` to `schema_fields`, so this path is exercised. Consider restricting `pa.from_numpy_dtype` to actual NumPy dtypes (e.g. `isinstance(..., np.dtype)`), and either omitting the schema for extension dtypes or mapping them explicitly to Arrow types.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
🚀 New features to boost your workflow:
|
* pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment
* pandas 3.0: segmented_index() and set_index_dtypes() (#490) * Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index * Avoid warning in testing.add_table() (#491) * pandas 3.0: fix utils.hash() (#492) * pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment * Fix categorical dtype with Database.get() (#493) * Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test * Require timedelta64[ns] in assert_index() (#494) * Require timedelta64[ns] in assert_index() * Add tests for mixed cases * pandas 3.0: fix doctests output
Ensure consisting hashing behavior with pandas 3.0.
Summary by Sourcery
Align utils.hash() behavior with pandas 3.0 by normalizing string-like and categorical data before hashing and updating the hashing of string columns for stable results.
Bug Fixes:
Enhancements:
Tests: