-
Notifications
You must be signed in to change notification settings - Fork 1
Fix categorical dtype with Database.get() #493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideNormalizes categorical dtypes to consistently use object-backed categories when retrieving data from the database, adding compatibility handling for pandas 3.0’s possible use of Flow diagram for dtypes_of_categories categorical dtype normalizationflowchart TD
A["Start dtypes_of_categories(objs)"] --> B["Initialize empty list dtypes"]
B --> C["Iterate over each obj in objs"]
C --> D{Is obj.dtype a CategoricalDtype?}
D -- No --> E["Skip obj"]
E --> F{More objs?}
D -- Yes --> G["Set dtype to obj.dtype.categories.dtype"]
G --> H{"Is str(dtype) equal to str?"}
H -- Yes --> I["Set dtype to pd.Series(dtype=object).dtype"]
H -- No --> J["Keep dtype unchanged"]
I --> K["Append dtype to dtypes"]
J --> K["Append dtype to dtypes"]
K --> F{More objs?}
F -- Yes --> C
F -- No --> L["Convert dtypes to set to deduplicate"]
L --> M["Sort unique dtypes using key=str"]
M --> N["Return sorted list of unique dtypes"]
Flow diagram for scheme_in_column categorical dtype normalizationflowchart TD
A["Start scheme_in_column with ys"] --> B["First pass: for each y in ys adjust dtype to match inferred dtype"]
B --> C["Second pass: normalize categorical dtypes"]
C --> D["Iterate over each y in ys with index n"]
D --> E["Get cat_dtype as y.dtype.categories.dtype"]
E --> F{"Is string representation of cat_dtype equal to 'str'?"}
F -- No --> G["Leave ys[n] unchanged"]
F -- Yes --> H["Convert categories to new_categories as y.dtype.categories.astype(object)"]
H --> I["Set ys[n] to y.astype(pd.CategoricalDtype(new_categories))"]
G --> J{More ys elements?}
I --> J{More ys elements?}
J -- Yes --> D
J -- No --> K["Build data list from y.array for all ys"]
K --> L["Compute union of categorical data from data"]
L --> M["Continue with rest of scheme_in_column processing"]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 3 issues, and left some high level feedback:
- The logic to normalize categorical string dtypes to
objectis duplicated indtypes_of_categories()andscheme_in_column(); consider extracting a small helper (e.g.,_normalize_string_categorical_dtype) to keep this behavior consistent and easier to maintain. - Instead of checking
str(dtype) == 'str'and constructing a dummySeriesto get anobjectdtype, consider usingpandas.api.types.is_string_dtypeandnp.dtype('O')(or similar) to make the dtype normalization more explicit and less reliant on string representations. - In
dtypes_of_categories(),sorted(list(set(dtypes)), key=str)sorts by stringified dtype, which may be surprising for non-string dtypes; if you only intend to normalize and deduplicate string-like categories, consider narrowing the normalization or documenting why ordering bystris acceptable here.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The logic to normalize categorical string dtypes to `object` is duplicated in `dtypes_of_categories()` and `scheme_in_column()`; consider extracting a small helper (e.g., `_normalize_string_categorical_dtype`) to keep this behavior consistent and easier to maintain.
- Instead of checking `str(dtype) == 'str'` and constructing a dummy `Series` to get an `object` dtype, consider using `pandas.api.types.is_string_dtype` and `np.dtype('O')` (or similar) to make the dtype normalization more explicit and less reliant on string representations.
- In `dtypes_of_categories()`, `sorted(list(set(dtypes)), key=str)` sorts by stringified dtype, which may be surprising for non-string dtypes; if you only intend to normalize and deduplicate string-like categories, consider narrowing the normalization or documenting why ordering by `str` is acceptable here.
## Individual Comments
### Comment 1
<location> `audformat/core/database.py:705-706` </location>
<code_context>
+ dtype = obj.dtype.categories.dtype
+ # Normalize string dtypes: treat 'str' and 'object' as equivalent
+ # for string categories (pandas 3.0 compatibility)
+ if str(dtype) == "str":
+ dtype = pd.Series(dtype="object").dtype
+ dtypes.append(dtype)
+ return sorted(list(set(dtypes)), key=str)
</code_context>
<issue_to_address>
**issue (bug_risk):** String-type detection via `str(dtype) == "str"` is brittle and may misclassify other string-like dtypes.
This approach couples behavior to the dtype’s string representation, which can change across pandas/numpy versions and may miss other string-like dtypes (`string[python]`, unicode `<U` variants, etc.). Prefer using pandas/numpy dtype introspection (e.g., `isinstance(dtype, pd.StringDtype)` or `pd.api.types.is_string_dtype(dtype)`) and then normalizing to a canonical object dtype to make this more robust and future-proof.
</issue_to_address>
### Comment 2
<location> `audformat/core/database.py:841-845` </location>
<code_context>
)
+ # Normalize all categorical dtypes to "object" for consistency
+ # (pandas 3.0 may infer "str" dtype for string categories)
+ for n, y in enumerate(ys):
+ cat_dtype = y.dtype.categories.dtype
+ if str(cat_dtype) == "str":
+ new_categories = y.dtype.categories.astype("object")
+ ys[n] = y.astype(pd.CategoricalDtype(new_categories))
# Find union of categorical data
data = [y.array for y in ys]
</code_context>
<issue_to_address>
**issue:** Normalization loop assumes all `ys` elements are categorical and may break if a non-categorical series slips through.
This code assumes `y.dtype` is categorical and will raise when a non-categorical series is present. To avoid hard-to-diagnose failures when mixed dtypes are passed, guard the normalization with a check like `isinstance(y.dtype, pd.CategoricalDtype)` before accessing `.categories` or skip normalization for non-categorical inputs.
</issue_to_address>
### Comment 3
<location> `tests/test_database_get.py:514` </location>
<code_context>
),
dtype=pd.CategoricalDtype(
- ["w1", "w2", "w3"],
+ pd.Index(["w1", "w2", "w3"], dtype="object"),
ordered=False,
),
</code_context>
<issue_to_address>
**suggestion (testing):** Add an explicit regression test that mixes `str` and `object` categorical dtypes and verifies the normalized output dtype
These changes only update existing expectations; they don’t directly test mixing categorical columns where one has `categories.dtype == 'str'` and the other `== 'object'`. Please add (or parametrize) a test that:
- Creates two categorical series for the same logical field, one with `str` categories and one with `object` categories
- Combines them via `db.get()`
- Asserts the resulting categorical has `categories.dtype == 'object'` and preserves labels
This will validate the new normalization behavior and protect against regressions.
Suggested implementation:
```python
dtype=pd.CategoricalDtype(
pd.Index(["w1", "w2", "w3"], dtype="object"),
ordered=False,
),
),
[0.2, 0.2, 0.5, 0.7],
),
dtype=pd.CategoricalDtype(
),
),
],
)
def test_get_mixed_str_and_object_categorical_dtype(db):
# Left side: categorical with default `str` categories.dtype
left = pd.DataFrame(
{
"id": [1, 2],
"w": pd.Categorical(
["w1", "w2"],
categories=["w1", "w2", "w3"],
ordered=False,
),
}
)
# Right side: categorical with explicitly `object` categories.dtype
right = pd.DataFrame(
{
"id": [3, 4],
"w": pd.Categorical(
pd.Index(["w1", "w3"], dtype="object"),
categories=pd.Index(["w1", "w2", "w3"], dtype="object"),
ordered=False,
),
}
)
# Register both tables in the test database and combine them via db.get()
db["left_mixed_cat"] = left
db["right_mixed_cat"] = right
result = db.get(["left_mixed_cat", "right_mixed_cat"])
# Ensure we didn't lose any labels
assert list(result["w"]) == ["w1", "w2", "w1", "w3"]
# Ensure the resulting categorical is normalized to object categories
result_cat = result["w"].dtype
assert isinstance(result_cat, pd.CategoricalDtype)
assert result_cat.categories.dtype == "object"
```
Depending on how `db` is implemented in your test suite, you may need to:
1. Replace the lines
```python
db["left_mixed_cat"] = left
db["right_mixed_cat"] = right
```
with the appropriate API for registering DataFrames as tables, e.g.:
```python
db.create_table("left_mixed_cat", left)
db.create_table("right_mixed_cat", right)
```
or similar helper functions already used elsewhere in `tests/test_database_get.py`.
2. Adjust the `db.get(...)` call to match the existing calling convention. For example, some implementations might require:
```python
result = db.get("left_mixed_cat", "right_mixed_cat")
```
or a different argument shape, rather than a list of table names.
3. If the existing tests use a specific schema or join key instead of a simple vertical concatenation, update the construction of `left` and `right` (column names and any extra columns) to follow that pattern while still ensuring:
- One side uses a `str`-typed categories index
- The other uses an `object`-typed categories index
- The combined result is asserted to have `result["w"].dtype.categories.dtype == "object"` and preserves the expected label order.
These adjustments will align the new regression test with the rest of the test harness while preserving the core intent of explicitly testing mixed `str`/`object` categorical normalization.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
🚀 New features to boost your workflow:
|
* Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test
* pandas 3.0: segmented_index() and set_index_dtypes() (#490) * Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index * Avoid warning in testing.add_table() (#491) * pandas 3.0: fix utils.hash() (#492) * pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment * Fix categorical dtype with Database.get() (#493) * Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test * Require timedelta64[ns] in assert_index() (#494) * Require timedelta64[ns] in assert_index() * Add tests for mixed cases * pandas 3.0: fix doctests output
Ensures consistency when requesting data with categorical labels by converting string always to object.
Summary by Sourcery
Normalize categorical dtypes for string categories to ensure consistent behavior of Database.get() across pandas versions.
Bug Fixes:
Tests: