Add support for pandas 3.0 #500

hagenw · 2026-01-27T12:57:27Z

Closes #487

...

String updates

Changes in behavior

Output of print(obj.dtype)

Command	pandas 2.3.3	pandas 3.0.0
pd.Series([])	object	object
pd.Series(["a"])	object	str
pd.Series(["a", pd.NA])	object	str
pd.Series(["a", np.nan])	object	str
pd.Series(["a"], dtype="string")	string	string
pd.Series(["a"], dtype=str)	object	str
pd.Series(["a"], dtype=str)	object	str

Output of obj.dtype

Command	pandas 2.3.3	pandas 3.0.0
pd.Series([])	`dtype('O')`	`dtype('O')`
pd.Series(["a"])	`dtype('O')`	`<StringDtype(na_value=nan)>`
pd.Series(["a", pd.NA])	`dtype('O')`	`<StringDtype(na_value=nan)>`
pd.Series(["a", np.nan])	`dtype('O')`	`<StringDtype(na_value=nan)>`
pd.Series(["a"], dtype="string")	`string[python]`	`<StringDtype(na_value=<NA>)>`
pd.Series(["a"], dtype=str)	`dtype('O')`	`<StringDtype(na_value=nan)>`
pd.Series(["a"], dtype="str")	`dtype('O')`	`<StringDtype(na_value=nan)>`

Code to create a test table

import audformat

def check_dtype(scheme):
    db = audformat.Database("test")
    db["table"] = audformat.Table(audformat.filewise_index("f1"))
    db.schemes["scheme"] = scheme
    db["table"]["column"] = audformat.Column(scheme_id="scheme")
    db["table"]["column"].set("a")
    return db["table"]["column"].get().dtype

Data type of column (db["table"]["column"].get().dtype).

For pandas 2.3.3 I checked that main and this branch produce the same results.

Scheme	pandas 2.3.3	pandas 3.0.0
`Scheme("object")`	`dtype('O')`	`dtype('O')`
`Scheme("str")`	`string[python]`	`<StringDtype(na_value=<NA>)>`
`Scheme("str", labels=["a", "b"])`	`CategoricalDtype(categories=['a', 'b'], ordered=False, categories_dtype=object)`	`CategoricalDtype(categories=['a', 'b'], ordered=False, categories_dtype=str)`

Summary by Sourcery

Add compatibility adjustments for pandas 3.0, ensuring stable dtypes, hashing, and index behavior across pandas versions.

Enhancements:

Normalize string and categorical dtypes (including scheme categories) to consistent object/string forms for stable behavior across pandas versions.
Enforce segmented index start/end levels to use timedelta64[ns] and file levels to use string dtype, and adjust timedelta conversions accordingly.
Make hashing of pandas objects robust to pandas 3.0 string/categorical changes by normalizing column dtypes before converting to pyarrow tables.
Relax pandas upper bound in project configuration to allow pandas 3.x.

CI:

Run documentation, linter, and test workflows on both main and dev branches.

Tests:

Update and extend tests to account for pandas 3.0 dtype and index changes, including new coverage for categorical dtype normalization, segmented index timedelta dtypes, and various index dtype scenarios.

* Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index

* pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment

* Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test

* Require timedelta64[ns] in assert_index() * Add tests for mixed cases

* pandas 3.0: segmented_index() and set_index_dtypes() (#490) * Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index * Avoid warning in testing.add_table() (#491) * pandas 3.0: fix utils.hash() (#492) * pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment * Fix categorical dtype with Database.get() (#493) * Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test * Require timedelta64[ns] in assert_index() (#494) * Require timedelta64[ns] in assert_index() * Add tests for mixed cases * pandas 3.0: fix doctests output

* Update test_utils.py * Update test_misc_table * Set index dtypes directly * Fix test_table * Update to_timedelta in index.py * Fix conversion to timedelta in testing.py * Update test_utils_concat.py * Add comment * Update to_timedelta()

sourcery-ai · 2026-01-27T12:57:33Z

Reviewer's Guide

Adjust core index, database, table utilities and tests to be compatible with pandas 3.0’s stricter dtypes (string vs object, timedelta64[ns], categorical categories) and relaxed string dtypes, update hashing logic for stable pyarrow schemas, and update CI to run on the dev branch and allow pandas 3.x.

Class diagram for updated Database string and categorical handling

classDiagram
    class Database {
        +append_series(ys, y, column_id)
        +scheme_in_column(scheme_id, column, column_id)
    }

    class _is_string_like_dtype {
        <<function>>
        +_is_string_like_dtype(dtype) bool
    }

    class CategoricalDtype {
        +categories
        +ordered
    }

    class numpy_dtype {
    }

    class pandas_StringDtype {
    }

    Database ..> _is_string_like_dtype : uses
    Database ..> CategoricalDtype : normalizes_categories
    _is_string_like_dtype ..> pandas_StringDtype : checks_instance
    _is_string_like_dtype ..> numpy_dtype : returns_object_dtype

Flow diagram for updated hash DataFrame normalization

flowchart TD
    A["Start hash(obj)"] --> B["Convert obj to DataFrame df with reset_index"]
    B --> C["Init schema_fields as empty list"]
    C --> D{"For each column col in df.columns"}
    D -->|string dtype| E["Cast df[col] to object dtype"]
    E --> F["Append (col, pa.string()) to schema_fields"]
    D -->|categorical dtype| G["cat_dtype = df[col].dtype.categories.dtype"]
    G --> H{"cat_dtype is string dtype"}
    H -->|yes| I["new_categories = categories.astype(object)"]
    I --> J["Rebuild categorical with new_categories and same ordered"]
    J --> K["Append (col, None) to schema_fields"]
    H -->|no| K
    D -->|other dtype| L["Append (col, None) to schema_fields"]
    F --> D
    K --> D
    L --> D
    D -->|done| M{"len(df) == 0 and any schema_fields has explicit type"}
    M -->|yes| N["Build pa.schema from schema_fields
    use explicit type if not None
    else pa.from_numpy_dtype(df[name].dtype)"]
    N --> O["table = pa.Table.from_pandas(df, preserve_index=false, schema=schema)"]
    M -->|no| P["table = pa.Table.from_pandas(df, preserve_index=false)"]
    O --> Q["schema_str = table.schema.to_string(excluding metadata)"]
    P --> Q
    Q --> R["Use schema_str and table content to compute hash"]
    R --> S["Return hash value"]

File-Level Changes

Change	Details	Files
Normalize string and categorical dtypes for hashing and scheme handling to remain stable across pandas 2.x and 3.x.	In audformat.core.utils.hash, normalize pandas string columns to object dtype before conversion to pyarrow, adjust categorical columns with string categories to use object categories, and construct explicit pyarrow schemas for empty DataFrames as needed. Ensure pyarrow Table creation does not depend on pandas 3.0’s new string/large_string mapping so that hash outputs stay stable. Update doctext example for iter_by_file to create Series with explicit object dtype.	`audformat/core/utils.py`
Enforce consistent timedelta64[ns] and string index dtypes in index helpers to satisfy new pandas 3.0 dtype behavior.	Change to_timedelta in audformat.core.index to always return timedelta64[ns] (using as_unit/astype) regardless of input form. Tighten audformat.core.index.assert_index checks to specifically require timedelta64[ns] for start/end levels. Update segmented_index to construct FILE level as string-typed Index directly and rely on to_timedelta for START/END, then validate via assert_index. Update random segment generation in core.testing.add_table to generate numeric seconds and use to_timedelta, keeping index dtypes consistent. Ensure TimedeltaIndex created in table tests has explicit timedelta64[ns] dtype and MultiIndex timdelta levels in tests use .astype('timedelta64[ns]').	`audformat/core/index.py` `audformat/core/testing.py` `audformat/core/table.py` `tests/test_index.py` `tests/test_table.py`
Normalize string-like categorical dtypes across tables and during scheme handling so mixed str/object/string categories remain compatible under pandas 3.0.	Add helper _is_string_like_dtype in audformat.core.database to identify pandas string-like dtypes (StringDtype, str-like). In Database.append_series, normalize categorical category dtypes so any string-like categories are treated as object for type union and error reporting, deduplicating via sorted(unique dtypes). In Database.scheme_in_column, after aligning to the scheme dtype, normalize all categorical columns with string-like categories to use object categories before performing the union of categoricals. Add regression test that mixes categorical dtypes with object vs string categories across tables and asserts combined result uses object categories and preserves label order.	`audformat/core/database.py` `tests/test_database_get.py`
Tighten index/string dtype expectations in misc/table utilities and tests to match pandas 3.0 default string behavior.	Update multiple tests in test_misc_table, test_utils, test_utils_concat, and others to construct Index/MultiIndex and Series with explicit dtype='object', dtype='string', or specific numeric dtypes instead of relying on pandas defaults. Remove/adjust parametrized cases that previously expected None/str dtypes when pandas inferred object, now explicitly passing 'object' for index/column dtype expectations in dtype_* tests. Add new dtype conversion test for MultiIndex with empty levels being converted to timedelta64[ns] using set_index_dtypes and expand tests for segmented/timed indexes. Adjust categorical tests to use pd.CategoricalDtype with categories defined via typed pd.Index (object/string) and ensure behavior is stable in pandas 3.0.	`tests/test_misc_table.py` `tests/test_utils.py` `tests/test_utils_concat.py`
Ensure CSV reading and index utilities stay compatible with pandas 3.0’s column dtype changes.	In test_read_csv, when expecting a DataFrame with Index result, cast columns to str if pandas>=3.0 to match new column dtype behavior before comparing. In tests for index intersection/overlap and set_index_dtypes, construct indices using typed pd.Index levels (object, Int64, string, timedelta64[ns], datetime64[ns]) instead of relying on untyped lists, and adjust expected MultiIndex construction accordingly. Extend set_index_dtypes to cast timedelta levels using pd.to_timedelta(...).astype(dtype) to avoid type errors when targeting timedelta64[ns] and to work under newer pandas behavior. Add new test case to ensure converting empty MultiIndex with START/END int levels to timedelta64[ns] works and yields the correct typed empty index.	`tests/test_utils.py` `audformat/core/utils.py`
Broaden CI and dependency constraints to test pandas 3.x, and run workflows on dev.	Relax pandas dependency in pyproject.toml from '<3.0' to no upper bound, enabling pandas 3.x in dev/test environments. Update GitHub Actions workflows (doc.yml, linter.yml, test.yml) to trigger on both main and dev branches for push and pull_request.	`pyproject.toml` `.github/workflows/doc.yml` `.github/workflows/linter.yml` `.github/workflows/test.yml`

Possibly linked issues

Pandas 3.0.0 breaks timedelta precision #487: PR updates timedelta handling, segmented indexes, and dtype normalization so audformat works correctly with pandas 3.0.0.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

codecov · 2026-01-27T12:59:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (56a8268) to head (51ea81e).

Additional details and impacted files

Files with missing lines	Coverage Δ
audformat/core/database.py	`100.0% <100.0%> (ø)`
audformat/core/index.py	`100.0% <100.0%> (ø)`
audformat/core/table.py	`100.0% <100.0%> (ø)`
audformat/core/testing.py	`100.0% <100.0%> (ø)`
audformat/core/utils.py	`100.0% <100.0%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hagenw added 12 commits January 23, 2026 11:31

CI: run tests on dev branch

3c88176

Fix segmented_index() and set_index_dtypes() (#490)

f732b5d

* Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index

Avoid warning in testing.add_table() (#491)

8ce1358

Fix utils.hash() to return old value (#492)

f915774

* pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment

Fix categorical dtype with Database.get() (#493)

9bff331

* Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test

Require timedelta64[ns] in assert_index() (#494)

5c9b7c7

* Require timedelta64[ns] in assert_index() * Add tests for mixed cases

TST: fix misc table tests (#496)

639d29d

TST: fix remaining tests (#497)

591af86

* Update test_utils.py * Update test_misc_table * Set index dtypes directly * Fix test_table * Update to_timedelta in index.py * Fix conversion to timedelta in testing.py * Update test_utils_concat.py * Add comment * Update to_timedelta()

DOC: show again full table output (#498)

f0a4f35

Remove deprecated copy from astype() (#499)

5990e02

TST: enable pandas 3.0 in tests

0ea46b8

hagenw added 2 commits January 27, 2026 16:32

Fix error message for non-matching categories

d18c884

Improve comments

51ea81e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pandas 3.0 #500

Add support for pandas 3.0 #500

hagenw commented Jan 27, 2026 •

edited

Loading

Uh oh!

sourcery-ai bot commented Jan 27, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

codecov bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for pandas 3.0 #500

Are you sure you want to change the base?

Add support for pandas 3.0 #500

Conversation

hagenw commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

String updates

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for updated Database string and categorical handling

Flow diagram for updated hash DataFrame normalization

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Jan 27, 2026 •

edited

Loading

sourcery-ai bot commented Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading