Fix categorical dtype with Database.get() #493

hagenw · 2026-01-23T13:55:18Z

Ensures consistency when requesting data with categorical labels by converting string always to object.

Summary by Sourcery

Normalize categorical dtypes for string categories to ensure consistent behavior of Database.get() across pandas versions.

Bug Fixes:

Treat string and object dtypes as equivalent for categorical labels to avoid mismatches when combining series in Database.get().

Tests:

Update categorical test fixtures to use explicit index dtypes for string categories, covering both object and str representations.

sourcery-ai · 2026-01-23T13:55:24Z

Reviewer's Guide

Normalizes categorical dtypes to consistently use object-backed categories when retrieving data from the database, adding compatibility handling for pandas 3.0’s possible use of str dtypes and adjusting tests accordingly.

Flow diagram for dtypes_of_categories categorical dtype normalization

flowchart TD
    A["Start dtypes_of_categories(objs)"] --> B["Initialize empty list dtypes"]
    B --> C["Iterate over each obj in objs"]
    C --> D{Is obj.dtype a CategoricalDtype?}
    D -- No --> E["Skip obj"]
    E --> F{More objs?}
    D -- Yes --> G["Set dtype to obj.dtype.categories.dtype"]
    G --> H{"Is str(dtype) equal to str?"}
    H -- Yes --> I["Set dtype to pd.Series(dtype=object).dtype"]
    H -- No --> J["Keep dtype unchanged"]
    I --> K["Append dtype to dtypes"]
    J --> K["Append dtype to dtypes"]
    K --> F{More objs?}
    F -- Yes --> C
    F -- No --> L["Convert dtypes to set to deduplicate"]
    L --> M["Sort unique dtypes using key=str"]
    M --> N["Return sorted list of unique dtypes"]

Flow diagram for scheme_in_column categorical dtype normalization

flowchart TD
    A["Start scheme_in_column with ys"] --> B["First pass: for each y in ys adjust dtype to match inferred dtype"]
    B --> C["Second pass: normalize categorical dtypes"]
    C --> D["Iterate over each y in ys with index n"]
    D --> E["Get cat_dtype as y.dtype.categories.dtype"]
    E --> F{"Is string representation of cat_dtype equal to 'str'?"}
    F -- No --> G["Leave ys[n] unchanged"]
    F -- Yes --> H["Convert categories to new_categories as y.dtype.categories.astype(object)"]
    H --> I["Set ys[n] to y.astype(pd.CategoricalDtype(new_categories))"]
    G --> J{More ys elements?}
    I --> J{More ys elements?}
    J -- Yes --> D
    J -- No --> K["Build data list from y.array for all ys"]
    K --> L["Compute union of categorical data from data"]
    L --> M["Continue with rest of scheme_in_column processing"]

File-Level Changes

Change	Details	Files
Normalize categorical category dtypes so that string categories backed by `str` are treated as equivalent to `object`, ensuring consistent behavior in Database.get() and related logic.	Refactored categorical dtype collection to iterate explicitly, normalize any category dtype equal to `str` to an `object` Series dtype, and return the unique dtypes sorted by their string representation. Adjusted post-processing in `scheme_in_column` to re-cast categorical series whose category dtype is `str` to use categories with `object` dtype for consistency across series before computing their union.	`audformat/core/database.py`
Align tests with new categorical dtype normalization behavior by explicitly specifying the index dtype in categorical definitions.	Updated test categorical dtypes to construct categories via `pd.Index(..., dtype="object")` or `dtype="str"` to cover both representations now normalized by the implementation. Ensured tests for wrong scheme labels use explicit category index dtypes matching the new normalization semantics.	`tests/test_database_get.py`

Possibly linked issues

#: PR normalizes categorical dtypes in Database.get, directly addressing loss of category dtype described in issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 3 issues, and left some high level feedback:

The logic to normalize categorical string dtypes to object is duplicated in dtypes_of_categories() and scheme_in_column(); consider extracting a small helper (e.g., _normalize_string_categorical_dtype) to keep this behavior consistent and easier to maintain.
Instead of checking str(dtype) == 'str' and constructing a dummy Series to get an object dtype, consider using pandas.api.types.is_string_dtype and np.dtype('O') (or similar) to make the dtype normalization more explicit and less reliant on string representations.
In dtypes_of_categories(), sorted(list(set(dtypes)), key=str) sorts by stringified dtype, which may be surprising for non-string dtypes; if you only intend to normalize and deduplicate string-like categories, consider narrowing the normalization or documenting why ordering by str is acceptable here.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The logic to normalize categorical string dtypes to `object` is duplicated in `dtypes_of_categories()` and `scheme_in_column()`; consider extracting a small helper (e.g., `_normalize_string_categorical_dtype`) to keep this behavior consistent and easier to maintain.
- Instead of checking `str(dtype) == 'str'` and constructing a dummy `Series` to get an `object` dtype, consider using `pandas.api.types.is_string_dtype` and `np.dtype('O')` (or similar) to make the dtype normalization more explicit and less reliant on string representations.
- In `dtypes_of_categories()`, `sorted(list(set(dtypes)), key=str)` sorts by stringified dtype, which may be surprising for non-string dtypes; if you only intend to normalize and deduplicate string-like categories, consider narrowing the normalization or documenting why ordering by `str` is acceptable here.

## Individual Comments

### Comment 1
<location> `audformat/core/database.py:705-706` </location>
<code_context>
+                    dtype = obj.dtype.categories.dtype
+                    # Normalize string dtypes: treat 'str' and 'object' as equivalent
+                    # for string categories (pandas 3.0 compatibility)
+                    if str(dtype) == "str":
+                        dtype = pd.Series(dtype="object").dtype
+                    dtypes.append(dtype)
+            return sorted(list(set(dtypes)), key=str)
</code_context>

<issue_to_address>
**issue (bug_risk):** String-type detection via `str(dtype) == "str"` is brittle and may misclassify other string-like dtypes.

This approach couples behavior to the dtype’s string representation, which can change across pandas/numpy versions and may miss other string-like dtypes (`string[python]`, unicode `<U` variants, etc.). Prefer using pandas/numpy dtype introspection (e.g., `isinstance(dtype, pd.StringDtype)` or `pd.api.types.is_string_dtype(dtype)`) and then normalizing to a canonical object dtype to make this more robust and future-proof.
</issue_to_address>

### Comment 2
<location> `audformat/core/database.py:841-845` </location>
<code_context>
                     )
+            # Normalize all categorical dtypes to "object" for consistency
+            # (pandas 3.0 may infer "str" dtype for string categories)
+            for n, y in enumerate(ys):
+                cat_dtype = y.dtype.categories.dtype
+                if str(cat_dtype) == "str":
+                    new_categories = y.dtype.categories.astype("object")
+                    ys[n] = y.astype(pd.CategoricalDtype(new_categories))
             # Find union of categorical data
             data = [y.array for y in ys]
</code_context>

<issue_to_address>
**issue:** Normalization loop assumes all `ys` elements are categorical and may break if a non-categorical series slips through.

This code assumes `y.dtype` is categorical and will raise when a non-categorical series is present. To avoid hard-to-diagnose failures when mixed dtypes are passed, guard the normalization with a check like `isinstance(y.dtype, pd.CategoricalDtype)` before accessing `.categories` or skip normalization for non-categorical inputs.
</issue_to_address>

### Comment 3
<location> `tests/test_database_get.py:514` </location>
<code_context>
                 ),
                 dtype=pd.CategoricalDtype(
-                    ["w1", "w2", "w3"],
+                    pd.Index(["w1", "w2", "w3"], dtype="object"),
                     ordered=False,
                 ),
</code_context>

<issue_to_address>
**suggestion (testing):** Add an explicit regression test that mixes `str` and `object` categorical dtypes and verifies the normalized output dtype

These changes only update existing expectations; they don’t directly test mixing categorical columns where one has `categories.dtype == 'str'` and the other `== 'object'`. Please add (or parametrize) a test that:

- Creates two categorical series for the same logical field, one with `str` categories and one with `object` categories
- Combines them via `db.get()`
- Asserts the resulting categorical has `categories.dtype == 'object'` and preserves labels

This will validate the new normalization behavior and protect against regressions.

Suggested implementation:

```python
                dtype=pd.CategoricalDtype(
                    pd.Index(["w1", "w2", "w3"], dtype="object"),
                    ordered=False,
                ),
            ),
                            [0.2, 0.2, 0.5, 0.7],
                        ),
                        dtype=pd.CategoricalDtype(
                    ),
                ),
            ],
        )



def test_get_mixed_str_and_object_categorical_dtype(db):
    # Left side: categorical with default `str` categories.dtype
    left = pd.DataFrame(
        {
            "id": [1, 2],
            "w": pd.Categorical(
                ["w1", "w2"],
                categories=["w1", "w2", "w3"],
                ordered=False,
            ),
        }
    )

    # Right side: categorical with explicitly `object` categories.dtype
    right = pd.DataFrame(
        {
            "id": [3, 4],
            "w": pd.Categorical(
                pd.Index(["w1", "w3"], dtype="object"),
                categories=pd.Index(["w1", "w2", "w3"], dtype="object"),
                ordered=False,
            ),
        }
    )

    # Register both tables in the test database and combine them via db.get()
    db["left_mixed_cat"] = left
    db["right_mixed_cat"] = right

    result = db.get(["left_mixed_cat", "right_mixed_cat"])

    # Ensure we didn't lose any labels
    assert list(result["w"]) == ["w1", "w2", "w1", "w3"]

    # Ensure the resulting categorical is normalized to object categories
    result_cat = result["w"].dtype
    assert isinstance(result_cat, pd.CategoricalDtype)
    assert result_cat.categories.dtype == "object"

```

Depending on how `db` is implemented in your test suite, you may need to:

1. Replace the lines

   ```python
   db["left_mixed_cat"] = left
   db["right_mixed_cat"] = right
   ```

   with the appropriate API for registering DataFrames as tables, e.g.:

   ```python
   db.create_table("left_mixed_cat", left)
   db.create_table("right_mixed_cat", right)
   ```

   or similar helper functions already used elsewhere in `tests/test_database_get.py`.

2. Adjust the `db.get(...)` call to match the existing calling convention. For example, some implementations might require:

   ```python
   result = db.get("left_mixed_cat", "right_mixed_cat")
   ```

   or a different argument shape, rather than a list of table names.

3. If the existing tests use a specific schema or join key instead of a simple vertical concatenation, update the construction of `left` and `right` (column names and any extra columns) to follow that pattern while still ensuring:
   - One side uses a `str`-typed categories index
   - The other uses an `object`-typed categories index
   - The combined result is asserted to have `result["w"].dtype.categories.dtype == "object"` and preserves the expected label order.

These adjustments will align the new regression test with the rest of the test harness while preserving the core intent of explicitly testing mixed `str`/`object` categorical normalization.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

audformat/core/database.py

tests/test_database_get.py

codecov · 2026-01-24T09:23:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (347fc69) to head (63674d1).
⚠️ Report is 1 commits behind head on dev.

Additional details and impacted files

Files with missing lines	Coverage Δ
audformat/core/database.py	`100.0% <100.0%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test

* pandas 3.0: segmented_index() and set_index_dtypes() (#490) * Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index * Avoid warning in testing.add_table() (#491) * pandas 3.0: fix utils.hash() (#492) * pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment * Fix categorical dtype with Database.get() (#493) * Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test * Require timedelta64[ns] in assert_index() (#494) * Require timedelta64[ns] in assert_index() * Add tests for mixed cases * pandas 3.0: fix doctests output

hagenw added 2 commits January 23, 2026 14:46

Fix categorical dtype with Database.get()

5eadb18

Update tests

6f680a7

sourcery-ai bot reviewed Jan 23, 2026

View reviewed changes

audformat/core/database.py Outdated Show resolved Hide resolved

audformat/core/database.py Show resolved Hide resolved

tests/test_database_get.py Show resolved Hide resolved

hagenw added 6 commits January 23, 2026 15:11

Add additional test

04fb2ea

Improve code

719cafe

Clean up comment

3190654

We converted to categorical data

5b09598

Simplify test

34e919a

Simplify string test

63674d1

hagenw merged commit 1fb9889 into dev Jan 24, 2026
13 checks passed

hagenw deleted the fix-database-get branch January 24, 2026 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix categorical dtype with Database.get() #493

Fix categorical dtype with Database.get() #493

Uh oh!

hagenw commented Jan 23, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jan 23, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix categorical dtype with Database.get() #493

Fix categorical dtype with Database.get() #493

Uh oh!

Conversation

hagenw commented Jan 23, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for dtypes_of_categories categorical dtype normalization

Flow diagram for scheme_in_column categorical dtype normalization

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Jan 23, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jan 23, 2026 •

edited

Loading

codecov bot commented Jan 24, 2026 •

edited

Loading