Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 23, 2026

Ensures consistency when requesting data with categorical labels by converting string always to object.

Summary by Sourcery

Normalize categorical dtypes for string categories to ensure consistent behavior of Database.get() across pandas versions.

Bug Fixes:

  • Treat string and object dtypes as equivalent for categorical labels to avoid mismatches when combining series in Database.get().

Tests:

  • Update categorical test fixtures to use explicit index dtypes for string categories, covering both object and str representations.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 23, 2026

Reviewer's Guide

Normalizes categorical dtypes to consistently use object-backed categories when retrieving data from the database, adding compatibility handling for pandas 3.0’s possible use of str dtypes and adjusting tests accordingly.

Flow diagram for dtypes_of_categories categorical dtype normalization

flowchart TD
    A["Start dtypes_of_categories(objs)"] --> B["Initialize empty list dtypes"]
    B --> C["Iterate over each obj in objs"]
    C --> D{Is obj.dtype a CategoricalDtype?}
    D -- No --> E["Skip obj"]
    E --> F{More objs?}
    D -- Yes --> G["Set dtype to obj.dtype.categories.dtype"]
    G --> H{"Is str(dtype) equal to str?"}
    H -- Yes --> I["Set dtype to pd.Series(dtype=object).dtype"]
    H -- No --> J["Keep dtype unchanged"]
    I --> K["Append dtype to dtypes"]
    J --> K["Append dtype to dtypes"]
    K --> F{More objs?}
    F -- Yes --> C
    F -- No --> L["Convert dtypes to set to deduplicate"]
    L --> M["Sort unique dtypes using key=str"]
    M --> N["Return sorted list of unique dtypes"]
Loading

Flow diagram for scheme_in_column categorical dtype normalization

flowchart TD
    A["Start scheme_in_column with ys"] --> B["First pass: for each y in ys adjust dtype to match inferred dtype"]
    B --> C["Second pass: normalize categorical dtypes"]
    C --> D["Iterate over each y in ys with index n"]
    D --> E["Get cat_dtype as y.dtype.categories.dtype"]
    E --> F{"Is string representation of cat_dtype equal to 'str'?"}
    F -- No --> G["Leave ys[n] unchanged"]
    F -- Yes --> H["Convert categories to new_categories as y.dtype.categories.astype(object)"]
    H --> I["Set ys[n] to y.astype(pd.CategoricalDtype(new_categories))"]
    G --> J{More ys elements?}
    I --> J{More ys elements?}
    J -- Yes --> D
    J -- No --> K["Build data list from y.array for all ys"]
    K --> L["Compute union of categorical data from data"]
    L --> M["Continue with rest of scheme_in_column processing"]
Loading

File-Level Changes

Change Details Files
Normalize categorical category dtypes so that string categories backed by str are treated as equivalent to object, ensuring consistent behavior in Database.get() and related logic.
  • Refactored categorical dtype collection to iterate explicitly, normalize any category dtype equal to str to an object Series dtype, and return the unique dtypes sorted by their string representation.
  • Adjusted post-processing in scheme_in_column to re-cast categorical series whose category dtype is str to use categories with object dtype for consistency across series before computing their union.
audformat/core/database.py
Align tests with new categorical dtype normalization behavior by explicitly specifying the index dtype in categorical definitions.
  • Updated test categorical dtypes to construct categories via pd.Index(..., dtype="object") or dtype="str" to cover both representations now normalized by the implementation.
  • Ensured tests for wrong scheme labels use explicit category index dtypes matching the new normalization semantics.
tests/test_database_get.py

Possibly linked issues

  • #: PR normalizes categorical dtypes in Database.get, directly addressing loss of category dtype described in issue.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • The logic to normalize categorical string dtypes to object is duplicated in dtypes_of_categories() and scheme_in_column(); consider extracting a small helper (e.g., _normalize_string_categorical_dtype) to keep this behavior consistent and easier to maintain.
  • Instead of checking str(dtype) == 'str' and constructing a dummy Series to get an object dtype, consider using pandas.api.types.is_string_dtype and np.dtype('O') (or similar) to make the dtype normalization more explicit and less reliant on string representations.
  • In dtypes_of_categories(), sorted(list(set(dtypes)), key=str) sorts by stringified dtype, which may be surprising for non-string dtypes; if you only intend to normalize and deduplicate string-like categories, consider narrowing the normalization or documenting why ordering by str is acceptable here.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The logic to normalize categorical string dtypes to `object` is duplicated in `dtypes_of_categories()` and `scheme_in_column()`; consider extracting a small helper (e.g., `_normalize_string_categorical_dtype`) to keep this behavior consistent and easier to maintain.
- Instead of checking `str(dtype) == 'str'` and constructing a dummy `Series` to get an `object` dtype, consider using `pandas.api.types.is_string_dtype` and `np.dtype('O')` (or similar) to make the dtype normalization more explicit and less reliant on string representations.
- In `dtypes_of_categories()`, `sorted(list(set(dtypes)), key=str)` sorts by stringified dtype, which may be surprising for non-string dtypes; if you only intend to normalize and deduplicate string-like categories, consider narrowing the normalization or documenting why ordering by `str` is acceptable here.

## Individual Comments

### Comment 1
<location> `audformat/core/database.py:705-706` </location>
<code_context>
+                    dtype = obj.dtype.categories.dtype
+                    # Normalize string dtypes: treat 'str' and 'object' as equivalent
+                    # for string categories (pandas 3.0 compatibility)
+                    if str(dtype) == "str":
+                        dtype = pd.Series(dtype="object").dtype
+                    dtypes.append(dtype)
+            return sorted(list(set(dtypes)), key=str)
</code_context>

<issue_to_address>
**issue (bug_risk):** String-type detection via `str(dtype) == "str"` is brittle and may misclassify other string-like dtypes.

This approach couples behavior to the dtype’s string representation, which can change across pandas/numpy versions and may miss other string-like dtypes (`string[python]`, unicode `<U` variants, etc.). Prefer using pandas/numpy dtype introspection (e.g., `isinstance(dtype, pd.StringDtype)` or `pd.api.types.is_string_dtype(dtype)`) and then normalizing to a canonical object dtype to make this more robust and future-proof.
</issue_to_address>

### Comment 2
<location> `audformat/core/database.py:841-845` </location>
<code_context>
                     )
+            # Normalize all categorical dtypes to "object" for consistency
+            # (pandas 3.0 may infer "str" dtype for string categories)
+            for n, y in enumerate(ys):
+                cat_dtype = y.dtype.categories.dtype
+                if str(cat_dtype) == "str":
+                    new_categories = y.dtype.categories.astype("object")
+                    ys[n] = y.astype(pd.CategoricalDtype(new_categories))
             # Find union of categorical data
             data = [y.array for y in ys]
</code_context>

<issue_to_address>
**issue:** Normalization loop assumes all `ys` elements are categorical and may break if a non-categorical series slips through.

This code assumes `y.dtype` is categorical and will raise when a non-categorical series is present. To avoid hard-to-diagnose failures when mixed dtypes are passed, guard the normalization with a check like `isinstance(y.dtype, pd.CategoricalDtype)` before accessing `.categories` or skip normalization for non-categorical inputs.
</issue_to_address>

### Comment 3
<location> `tests/test_database_get.py:514` </location>
<code_context>
                 ),
                 dtype=pd.CategoricalDtype(
-                    ["w1", "w2", "w3"],
+                    pd.Index(["w1", "w2", "w3"], dtype="object"),
                     ordered=False,
                 ),
</code_context>

<issue_to_address>
**suggestion (testing):** Add an explicit regression test that mixes `str` and `object` categorical dtypes and verifies the normalized output dtype

These changes only update existing expectations; they don’t directly test mixing categorical columns where one has `categories.dtype == 'str'` and the other `== 'object'`. Please add (or parametrize) a test that:

- Creates two categorical series for the same logical field, one with `str` categories and one with `object` categories
- Combines them via `db.get()`
- Asserts the resulting categorical has `categories.dtype == 'object'` and preserves labels

This will validate the new normalization behavior and protect against regressions.

Suggested implementation:

```python
                dtype=pd.CategoricalDtype(
                    pd.Index(["w1", "w2", "w3"], dtype="object"),
                    ordered=False,
                ),
            ),
                            [0.2, 0.2, 0.5, 0.7],
                        ),
                        dtype=pd.CategoricalDtype(
                    ),
                ),
            ],
        )



def test_get_mixed_str_and_object_categorical_dtype(db):
    # Left side: categorical with default `str` categories.dtype
    left = pd.DataFrame(
        {
            "id": [1, 2],
            "w": pd.Categorical(
                ["w1", "w2"],
                categories=["w1", "w2", "w3"],
                ordered=False,
            ),
        }
    )

    # Right side: categorical with explicitly `object` categories.dtype
    right = pd.DataFrame(
        {
            "id": [3, 4],
            "w": pd.Categorical(
                pd.Index(["w1", "w3"], dtype="object"),
                categories=pd.Index(["w1", "w2", "w3"], dtype="object"),
                ordered=False,
            ),
        }
    )

    # Register both tables in the test database and combine them via db.get()
    db["left_mixed_cat"] = left
    db["right_mixed_cat"] = right

    result = db.get(["left_mixed_cat", "right_mixed_cat"])

    # Ensure we didn't lose any labels
    assert list(result["w"]) == ["w1", "w2", "w1", "w3"]

    # Ensure the resulting categorical is normalized to object categories
    result_cat = result["w"].dtype
    assert isinstance(result_cat, pd.CategoricalDtype)
    assert result_cat.categories.dtype == "object"

```

Depending on how `db` is implemented in your test suite, you may need to:

1. Replace the lines

   ```python
   db["left_mixed_cat"] = left
   db["right_mixed_cat"] = right
   ```

   with the appropriate API for registering DataFrames as tables, e.g.:

   ```python
   db.create_table("left_mixed_cat", left)
   db.create_table("right_mixed_cat", right)
   ```

   or similar helper functions already used elsewhere in `tests/test_database_get.py`.

2. Adjust the `db.get(...)` call to match the existing calling convention. For example, some implementations might require:

   ```python
   result = db.get("left_mixed_cat", "right_mixed_cat")
   ```

   or a different argument shape, rather than a list of table names.

3. If the existing tests use a specific schema or join key instead of a simple vertical concatenation, update the construction of `left` and `right` (column names and any extra columns) to follow that pattern while still ensuring:
   - One side uses a `str`-typed categories index
   - The other uses an `object`-typed categories index
   - The combined result is asserted to have `result["w"].dtype.categories.dtype == "object"` and preserves the expected label order.

These adjustments will align the new regression test with the rest of the test harness while preserving the core intent of explicitly testing mixed `str`/`object` categorical normalization.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@codecov
Copy link

codecov bot commented Jan 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (347fc69) to head (63674d1).
⚠️ Report is 1 commits behind head on dev.

Additional details and impacted files
Files with missing lines Coverage Δ
audformat/core/database.py 100.0% <100.0%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hagenw hagenw merged commit 1fb9889 into dev Jan 24, 2026
13 checks passed
@hagenw hagenw deleted the fix-database-get branch January 24, 2026 09:28
hagenw added a commit that referenced this pull request Jan 24, 2026
* Fix categorical dtype with Database.get()

* Update tests

* Add additional test

* Improve code

* Clean up comment

* We converted to categorical data

* Simplify test

* Simplify string test
hagenw added a commit that referenced this pull request Jan 24, 2026
* pandas 3.0: segmented_index() and set_index_dtypes() (#490)

* Add failing test

* Make test pandas 3.0.0 compatible

* Fix set_index_dtypes() for pandas 3.0

* Add comment

* Fix doctests

* Update segmented_index()

* Use segmented_index in test

* Add test for segmented_index

* Avoid warning in testing.add_table() (#491)

* pandas 3.0: fix utils.hash() (#492)

* pandas 3.0: fix utils.hash()

* Fix comment

* Remove unneeded code

* Add more tests

* Preserve ordered setting

* Update comment

* Fix categorical dtype with Database.get() (#493)

* Fix categorical dtype with Database.get()

* Update tests

* Add additional test

* Improve code

* Clean up comment

* We converted to categorical data

* Simplify test

* Simplify string test

* Require timedelta64[ns] in assert_index() (#494)

* Require timedelta64[ns] in assert_index()

* Add tests for mixed cases

* pandas 3.0: fix doctests output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants