Skip to content

POC: ArrayManager -- array-based data manager for columnar store #36010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Jan 13, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
a51835b
POC: ArrayManager -- array-based data manager for columnar store
jorisvandenbossche Jun 1, 2020
591579b
Update with latest master + some fixes
jorisvandenbossche Aug 27, 2020
896080a
add pd.options.mode.data_manager to switch
jorisvandenbossche Sep 4, 2020
f9c4dda
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 5, 2020
d18082a
add apply_with_block workaround
jorisvandenbossche Sep 5, 2020
cf3c07a
fix alignment in apply
jorisvandenbossche Sep 5, 2020
b252c6d
reorder methods to match BlockManager
jorisvandenbossche Sep 5, 2020
0fb645e
skip json tests for now
jorisvandenbossche Sep 5, 2020
eb55fef
skip more json tests + to_csv with to_native_types
jorisvandenbossche Sep 5, 2020
d241f31
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 6, 2020
47c3ee3
support both ndarrays and ExtensionArrays
jorisvandenbossche Sep 17, 2020
75f7de2
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 17, 2020
f36e395
add unstack
jorisvandenbossche Sep 17, 2020
be20816
fix native types, skip quantile, hdf, stata tests
jorisvandenbossche Sep 17, 2020
8b7cc81
remove skip in the benchmarks
jorisvandenbossche Sep 17, 2020
a239f50
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 17, 2020
a0ccf9a
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 22, 2020
dc1b190
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Oct 16, 2020
55d38be
remove manager keyword from DataFrame constructor, add _as_manager in…
jorisvandenbossche Oct 16, 2020
3dea0d7
move new ArrayManager code to separate file
jorisvandenbossche Oct 16, 2020
1a61333
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Nov 10, 2020
9751d33
de-privatize
jbrockmendel Nov 10, 2020
e45b645
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 11, 2020
3749c7d
try fix up typing
jorisvandenbossche Dec 11, 2020
af53040
add pytest option + add one github actions build to run them
jorisvandenbossche Dec 11, 2020
cc45673
fix pytest marks for skipping when using array-manager
jorisvandenbossche Dec 12, 2020
27cf215
several fixes - get tests/frame/methods tests passing
jorisvandenbossche Dec 12, 2020
f6a97df
ci - only run the tests/frame/methods tests
jorisvandenbossche Dec 12, 2020
67c4c2b
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 12, 2020
670ed76
mypy fix
jorisvandenbossche Dec 12, 2020
5128ad1
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 18, 2020
5c73688
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 8, 2021
a9a8c2d
move to internals/construction.py
jorisvandenbossche Jan 8, 2021
c7898fb
update for latest changes - fix tests/mypy
jorisvandenbossche Jan 8, 2021
3430307
fix todo
jorisvandenbossche Jan 8, 2021
1a30013
fix import in tests
jorisvandenbossche Jan 8, 2021
ef86b1e
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 10, 2021
c5548d9
add union alias to typing
jorisvandenbossche Jan 10, 2021
afe8f80
updates based on review
jorisvandenbossche Jan 10, 2021
b88c757
skip json tests to avoid segfaults
jorisvandenbossche Jan 10, 2021
ddc51d0
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 12, 2021
9dc5600
fix for Label -> Hashable change in master
jorisvandenbossche Jan 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
remove skip in the benchmarks
  • Loading branch information
jorisvandenbossche committed Sep 17, 2020
commit 8b7cc8157a3a8959f48c007f808a6198927ea9b3
3 changes: 0 additions & 3 deletions asv_bench/benchmarks/stat_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@ class FrameOps:
param_names = ["op", "dtype", "axis"]

def setup(self, op, dtype, axis):
if dtype == "Int64":
# XXX only dealing with numpy arrays in ArrayManager right now
raise NotImplementedError
if op == "mad" and dtype == "Int64":
# GH-33036, GH#33600
raise NotImplementedError
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -484,6 +484,8 @@ def use_inf_as_na_cb(key):
)
cf.register_option(
"data_manager",
# TODO switch back to default of "block" before merging
# "block",
"array",
"internal manager type",
validator=is_one_of_factory(["block", "array"]),
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -437,8 +437,8 @@ def __init__(
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
# TODO setting default to "array" for testing purposes (the actual default
# needs to stay "block" initially of course for backwards compatibility)
# TODO do we want to keep this as a keyword as well? (I think it can be handy)
# can we somehow make it a "private" keyword? (`_manager` ?)
manager: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i'd be ok with ```flags`` being an added keyword to the constructor (and you can then make manager a flag)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just for convenience (to easily change back and forth) I'd prefer a private method on DataFrame to change the manager (or return a new DataFrame with a new manager).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keyword right now makes it a bit more convenient to pick a specific one in the tests, or to compare both versions side by side.

Eg instead of

with pd.options_context("mode.data_manager", "block"):
    df = pd.DataFrame(...)

you can do

df = pd.DataFrame(..., manager="block")

if you want eg a certain test to only run using a specific manager, regardless of the global setting.

I fully agree we should be careful with making this a public keyword, but I think that also for internal use eg in tests, it would be good to have a convenient way to do this.

With a private method, you are thinking of a class method like pd.DataFrame._construct_with_array_manager(...) / pd.DataFrame._construct_with_block_manager(..) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a private method, you are thinking of a class method like pd.DataFrame._construct_with_array_manager(...) / pd.DataFrame._construct_with_block_manager(..) ?

I was thinking pd.DataFrame(...)._as_array_manager(). But whatever is easiest for testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. For testing that should also be fine (as long as it's not testing the constructor ;)). It will be less efficient as it's only converting to the other format after initial construction (which I do now anyway, but the goal is to fix that at some point of course), but for testing purposes that doesn't matter.

Could also be a pd.DataFrame(..)._as_manager("block"/"array") ?
Because we want to have both. Eg some test might be testing specifically aspects of block -based dataframe, so even when running the tests with global option to use array manager, that test should still use block manager.

):
if data is None:
Expand Down