Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 27, 2026

Closes #487

...

String updates

Changes in behavior

Output of print(obj.dtype)

Command pandas 2.3.3 pandas 3.0.0
pd.Series([]) object object
pd.Series(["a"]) object str
pd.Series(["a", pd.NA]) object str
pd.Series(["a", np.nan]) object str
pd.Series(["a"], dtype="string") string string
pd.Series(["a"], dtype=str) object str
pd.Series(["a"], dtype=str) object str

Output of obj.dtype

Command pandas 2.3.3 pandas 3.0.0
pd.Series([]) dtype('O') dtype('O')
pd.Series(["a"]) dtype('O') <StringDtype(na_value=nan)>
pd.Series(["a", pd.NA]) dtype('O') <StringDtype(na_value=nan)>
pd.Series(["a", np.nan]) dtype('O') <StringDtype(na_value=nan)>
pd.Series(["a"], dtype="string") string[python] <StringDtype(na_value=<NA>)>
pd.Series(["a"], dtype=str) dtype('O') <StringDtype(na_value=nan)>
pd.Series(["a"], dtype="str") dtype('O') <StringDtype(na_value=nan)>

Code to create a test table

import audformat

def check_dtype(scheme):
    db = audformat.Database("test")
    db["table"] = audformat.Table(audformat.filewise_index("f1"))
    db.schemes["scheme"] = scheme
    db["table"]["column"] = audformat.Column(scheme_id="scheme")
    db["table"]["column"].set("a")
    return db["table"]["column"].get().dtype

Data type of column (db["table"]["column"].get().dtype).

For pandas 2.3.3 I checked that main and this branch produce the same results.

Scheme pandas 2.3.3 pandas 3.0.0
Scheme("object") dtype('O') dtype('O')
Scheme("str") string[python] <StringDtype(na_value=<NA>)>
Scheme("str", labels=["a", "b"]) CategoricalDtype(categories=['a', 'b'], ordered=False, categories_dtype=object) CategoricalDtype(categories=['a', 'b'], ordered=False, categories_dtype=str)

Summary by Sourcery

Add compatibility adjustments for pandas 3.0, ensuring stable dtypes, hashing, and index behavior across pandas versions.

Enhancements:

  • Normalize string and categorical dtypes (including scheme categories) to consistent object/string forms for stable behavior across pandas versions.
  • Enforce segmented index start/end levels to use timedelta64[ns] and file levels to use string dtype, and adjust timedelta conversions accordingly.
  • Make hashing of pandas objects robust to pandas 3.0 string/categorical changes by normalizing column dtypes before converting to pyarrow tables.
  • Relax pandas upper bound in project configuration to allow pandas 3.x.

CI:

  • Run documentation, linter, and test workflows on both main and dev branches.

Tests:

  • Update and extend tests to account for pandas 3.0 dtype and index changes, including new coverage for categorical dtype normalization, segmented index timedelta dtypes, and various index dtype scenarios.

hagenw added 12 commits January 23, 2026 11:31
* Add failing test

* Make test pandas 3.0.0 compatible

* Fix set_index_dtypes() for pandas 3.0

* Add comment

* Fix doctests

* Update segmented_index()

* Use segmented_index in test

* Add test for segmented_index
* pandas 3.0: fix utils.hash()

* Fix comment

* Remove unneeded code

* Add more tests

* Preserve ordered setting

* Update comment
* Fix categorical dtype with Database.get()

* Update tests

* Add additional test

* Improve code

* Clean up comment

* We converted to categorical data

* Simplify test

* Simplify string test
* Require timedelta64[ns] in assert_index()

* Add tests for mixed cases
* pandas 3.0: segmented_index() and set_index_dtypes() (#490)

* Add failing test

* Make test pandas 3.0.0 compatible

* Fix set_index_dtypes() for pandas 3.0

* Add comment

* Fix doctests

* Update segmented_index()

* Use segmented_index in test

* Add test for segmented_index

* Avoid warning in testing.add_table() (#491)

* pandas 3.0: fix utils.hash() (#492)

* pandas 3.0: fix utils.hash()

* Fix comment

* Remove unneeded code

* Add more tests

* Preserve ordered setting

* Update comment

* Fix categorical dtype with Database.get() (#493)

* Fix categorical dtype with Database.get()

* Update tests

* Add additional test

* Improve code

* Clean up comment

* We converted to categorical data

* Simplify test

* Simplify string test

* Require timedelta64[ns] in assert_index() (#494)

* Require timedelta64[ns] in assert_index()

* Add tests for mixed cases

* pandas 3.0: fix doctests output
* Update test_utils.py

* Update test_misc_table

* Set index dtypes directly

* Fix test_table

* Update to_timedelta in index.py

* Fix conversion to timedelta in testing.py

* Update test_utils_concat.py

* Add comment

* Update to_timedelta()
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 27, 2026

Reviewer's Guide

Adjust core index, database, table utilities and tests to be compatible with pandas 3.0’s stricter dtypes (string vs object, timedelta64[ns], categorical categories) and relaxed string dtypes, update hashing logic for stable pyarrow schemas, and update CI to run on the dev branch and allow pandas 3.x.

Class diagram for updated Database string and categorical handling

classDiagram
    class Database {
        +append_series(ys, y, column_id)
        +scheme_in_column(scheme_id, column, column_id)
    }

    class _is_string_like_dtype {
        <<function>>
        +_is_string_like_dtype(dtype) bool
    }

    class CategoricalDtype {
        +categories
        +ordered
    }

    class numpy_dtype {
    }

    class pandas_StringDtype {
    }

    Database ..> _is_string_like_dtype : uses
    Database ..> CategoricalDtype : normalizes_categories
    _is_string_like_dtype ..> pandas_StringDtype : checks_instance
    _is_string_like_dtype ..> numpy_dtype : returns_object_dtype
Loading

Flow diagram for updated hash DataFrame normalization

flowchart TD
    A["Start hash(obj)"] --> B["Convert obj to DataFrame df with reset_index"]
    B --> C["Init schema_fields as empty list"]
    C --> D{"For each column col in df.columns"}
    D -->|string dtype| E["Cast df[col] to object dtype"]
    E --> F["Append (col, pa.string()) to schema_fields"]
    D -->|categorical dtype| G["cat_dtype = df[col].dtype.categories.dtype"]
    G --> H{"cat_dtype is string dtype"}
    H -->|yes| I["new_categories = categories.astype(object)"]
    I --> J["Rebuild categorical with new_categories and same ordered"]
    J --> K["Append (col, None) to schema_fields"]
    H -->|no| K
    D -->|other dtype| L["Append (col, None) to schema_fields"]
    F --> D
    K --> D
    L --> D
    D -->|done| M{"len(df) == 0 and any schema_fields has explicit type"}
    M -->|yes| N["Build pa.schema from schema_fields
    use explicit type if not None
    else pa.from_numpy_dtype(df[name].dtype)"]
    N --> O["table = pa.Table.from_pandas(df, preserve_index=false, schema=schema)"]
    M -->|no| P["table = pa.Table.from_pandas(df, preserve_index=false)"]
    O --> Q["schema_str = table.schema.to_string(excluding metadata)"]
    P --> Q
    Q --> R["Use schema_str and table content to compute hash"]
    R --> S["Return hash value"]
Loading

File-Level Changes

Change Details Files
Normalize string and categorical dtypes for hashing and scheme handling to remain stable across pandas 2.x and 3.x.
  • In audformat.core.utils.hash, normalize pandas string columns to object dtype before conversion to pyarrow, adjust categorical columns with string categories to use object categories, and construct explicit pyarrow schemas for empty DataFrames as needed.
  • Ensure pyarrow Table creation does not depend on pandas 3.0’s new string/large_string mapping so that hash outputs stay stable.
  • Update doctext example for iter_by_file to create Series with explicit object dtype.
audformat/core/utils.py
Enforce consistent timedelta64[ns] and string index dtypes in index helpers to satisfy new pandas 3.0 dtype behavior.
  • Change to_timedelta in audformat.core.index to always return timedelta64[ns] (using as_unit/astype) regardless of input form.
  • Tighten audformat.core.index.assert_index checks to specifically require timedelta64[ns] for start/end levels.
  • Update segmented_index to construct FILE level as string-typed Index directly and rely on to_timedelta for START/END, then validate via assert_index.
  • Update random segment generation in core.testing.add_table to generate numeric seconds and use to_timedelta, keeping index dtypes consistent.
  • Ensure TimedeltaIndex created in table tests has explicit timedelta64[ns] dtype and MultiIndex timdelta levels in tests use .astype('timedelta64[ns]').
audformat/core/index.py
audformat/core/testing.py
audformat/core/table.py
tests/test_index.py
tests/test_table.py
Normalize string-like categorical dtypes across tables and during scheme handling so mixed str/object/string categories remain compatible under pandas 3.0.
  • Add helper _is_string_like_dtype in audformat.core.database to identify pandas string-like dtypes (StringDtype, str-like).
  • In Database.append_series, normalize categorical category dtypes so any string-like categories are treated as object for type union and error reporting, deduplicating via sorted(unique dtypes).
  • In Database.scheme_in_column, after aligning to the scheme dtype, normalize all categorical columns with string-like categories to use object categories before performing the union of categoricals.
  • Add regression test that mixes categorical dtypes with object vs string categories across tables and asserts combined result uses object categories and preserves label order.
audformat/core/database.py
tests/test_database_get.py
Tighten index/string dtype expectations in misc/table utilities and tests to match pandas 3.0 default string behavior.
  • Update multiple tests in test_misc_table, test_utils, test_utils_concat, and others to construct Index/MultiIndex and Series with explicit dtype='object', dtype='string', or specific numeric dtypes instead of relying on pandas defaults.
  • Remove/adjust parametrized cases that previously expected None/str dtypes when pandas inferred object, now explicitly passing 'object' for index/column dtype expectations in dtype_* tests.
  • Add new dtype conversion test for MultiIndex with empty levels being converted to timedelta64[ns] using set_index_dtypes and expand tests for segmented/timed indexes.
  • Adjust categorical tests to use pd.CategoricalDtype with categories defined via typed pd.Index (object/string) and ensure behavior is stable in pandas 3.0.
tests/test_misc_table.py
tests/test_utils.py
tests/test_utils_concat.py
Ensure CSV reading and index utilities stay compatible with pandas 3.0’s column dtype changes.
  • In test_read_csv, when expecting a DataFrame with Index result, cast columns to str if pandas>=3.0 to match new column dtype behavior before comparing.
  • In tests for index intersection/overlap and set_index_dtypes, construct indices using typed pd.Index levels (object, Int64, string, timedelta64[ns], datetime64[ns]) instead of relying on untyped lists, and adjust expected MultiIndex construction accordingly.
  • Extend set_index_dtypes to cast timedelta levels using pd.to_timedelta(...).astype(dtype) to avoid type errors when targeting timedelta64[ns] and to work under newer pandas behavior.
  • Add new test case to ensure converting empty MultiIndex with START/END int levels to timedelta64[ns] works and yields the correct typed empty index.
tests/test_utils.py
audformat/core/utils.py
Broaden CI and dependency constraints to test pandas 3.x, and run workflows on dev.
  • Relax pandas dependency in pyproject.toml from '<3.0' to no upper bound, enabling pandas 3.x in dev/test environments.
  • Update GitHub Actions workflows (doc.yml, linter.yml, test.yml) to trigger on both main and dev branches for push and pull_request.
pyproject.toml
.github/workflows/doc.yml
.github/workflows/linter.yml
.github/workflows/test.yml

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (56a8268) to head (51ea81e).

Additional details and impacted files
Files with missing lines Coverage Δ
audformat/core/database.py 100.0% <100.0%> (ø)
audformat/core/index.py 100.0% <100.0%> (ø)
audformat/core/table.py 100.0% <100.0%> (ø)
audformat/core/testing.py 100.0% <100.0%> (ø)
audformat/core/utils.py 100.0% <100.0%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pandas 3.0.0 breaks timedelta precision

2 participants