-
Notifications
You must be signed in to change notification settings - Fork 1
Add support for pandas 3.0 #500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index
* pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment
* Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test
* Require timedelta64[ns] in assert_index() * Add tests for mixed cases
* pandas 3.0: segmented_index() and set_index_dtypes() (#490) * Add failing test * Make test pandas 3.0.0 compatible * Fix set_index_dtypes() for pandas 3.0 * Add comment * Fix doctests * Update segmented_index() * Use segmented_index in test * Add test for segmented_index * Avoid warning in testing.add_table() (#491) * pandas 3.0: fix utils.hash() (#492) * pandas 3.0: fix utils.hash() * Fix comment * Remove unneeded code * Add more tests * Preserve ordered setting * Update comment * Fix categorical dtype with Database.get() (#493) * Fix categorical dtype with Database.get() * Update tests * Add additional test * Improve code * Clean up comment * We converted to categorical data * Simplify test * Simplify string test * Require timedelta64[ns] in assert_index() (#494) * Require timedelta64[ns] in assert_index() * Add tests for mixed cases * pandas 3.0: fix doctests output
* Update test_utils.py * Update test_misc_table * Set index dtypes directly * Fix test_table * Update to_timedelta in index.py * Fix conversion to timedelta in testing.py * Update test_utils_concat.py * Add comment * Update to_timedelta()
Reviewer's GuideAdjust core index, database, table utilities and tests to be compatible with pandas 3.0’s stricter dtypes (string vs object, timedelta64[ns], categorical categories) and relaxed string dtypes, update hashing logic for stable pyarrow schemas, and update CI to run on the dev branch and allow pandas 3.x. Class diagram for updated Database string and categorical handlingclassDiagram
class Database {
+append_series(ys, y, column_id)
+scheme_in_column(scheme_id, column, column_id)
}
class _is_string_like_dtype {
<<function>>
+_is_string_like_dtype(dtype) bool
}
class CategoricalDtype {
+categories
+ordered
}
class numpy_dtype {
}
class pandas_StringDtype {
}
Database ..> _is_string_like_dtype : uses
Database ..> CategoricalDtype : normalizes_categories
_is_string_like_dtype ..> pandas_StringDtype : checks_instance
_is_string_like_dtype ..> numpy_dtype : returns_object_dtype
Flow diagram for updated hash DataFrame normalizationflowchart TD
A["Start hash(obj)"] --> B["Convert obj to DataFrame df with reset_index"]
B --> C["Init schema_fields as empty list"]
C --> D{"For each column col in df.columns"}
D -->|string dtype| E["Cast df[col] to object dtype"]
E --> F["Append (col, pa.string()) to schema_fields"]
D -->|categorical dtype| G["cat_dtype = df[col].dtype.categories.dtype"]
G --> H{"cat_dtype is string dtype"}
H -->|yes| I["new_categories = categories.astype(object)"]
I --> J["Rebuild categorical with new_categories and same ordered"]
J --> K["Append (col, None) to schema_fields"]
H -->|no| K
D -->|other dtype| L["Append (col, None) to schema_fields"]
F --> D
K --> D
L --> D
D -->|done| M{"len(df) == 0 and any schema_fields has explicit type"}
M -->|yes| N["Build pa.schema from schema_fields
use explicit type if not None
else pa.from_numpy_dtype(df[name].dtype)"]
N --> O["table = pa.Table.from_pandas(df, preserve_index=false, schema=schema)"]
M -->|no| P["table = pa.Table.from_pandas(df, preserve_index=false)"]
O --> Q["schema_str = table.schema.to_string(excluding metadata)"]
P --> Q
Q --> R["Use schema_str and table content to compute hash"]
R --> S["Return hash value"]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
🚀 New features to boost your workflow:
|
Closes #487
...
String updates
Changes in behavior
Output of
print(obj.dtype)Output of
obj.dtypedtype('O')dtype('O')dtype('O')<StringDtype(na_value=nan)>dtype('O')<StringDtype(na_value=nan)>dtype('O')<StringDtype(na_value=nan)>string[python]<StringDtype(na_value=<NA>)>dtype('O')<StringDtype(na_value=nan)>dtype('O')<StringDtype(na_value=nan)>Code to create a test table
Data type of column (
db["table"]["column"].get().dtype).For
pandas2.3.3 I checked thatmainand this branch produce the same results.Scheme("object")dtype('O')dtype('O')Scheme("str")string[python]<StringDtype(na_value=<NA>)>Scheme("str", labels=["a", "b"])CategoricalDtype(categories=['a', 'b'], ordered=False, categories_dtype=object)CategoricalDtype(categories=['a', 'b'], ordered=False, categories_dtype=str)Summary by Sourcery
Add compatibility adjustments for pandas 3.0, ensuring stable dtypes, hashing, and index behavior across pandas versions.
Enhancements:
CI:
Tests: