Practical steps towards a simplified BlockManager

In the mailing list discussion on a simplified (non-consolidating) BlockManager, @jbrockmendel brought up the relevant question of how we could get there (incrementally?): https://mail.python.org/pipermail/pandas-dev/2020-May/001223.html

Since that is a more practical discussion, moving that to an issue here. 

**Longer term**, there are multiple options of how such a blockmanager could be enabled by the user (listing 2 here, but there are probably other options as well):

- Go "all in" on extension arrays. We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing code paths specifically for the case of having all 1D ExtensionBlocks.
- A "consolidation policy" option similarly as in the branch discussed in https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So that might not be fully ideal. We could add 1D versions of our numeric blocks as well, but that would probably add a lot of complexity, although
temporary, to the Blocks, so maybe not an ideal path forward.

For the "all extension arrays" option, we could also use light-weight EAs to store numpy arrays (like the "PandasArrays" we have now, only then actually using it), as long as we don't yet have actual extension arrays for all dtypes. 
For both, we could have a constructor keyword to enabled it, and/or a global config option to enable it.

Now, **on the shorter term**, there are probably some work items we could tackle to **reduce the API surface / usage of blocks outside of the internals**, to make things like the above more realistic to implement (and make it easier to experiment with alternative BlockManager implementations):

- Continue removing cases where pandas code outside of the internals work directly on blocks (@jbrockmendel you already did a lot of work on this recently, do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases?)
- Remove any "index label" related code in the internals, as the BlockManager in principle only needs to concern about integer locations. Although, @jbrockmendel, with your work in https://github.com/pandas-dev/pandas/pull/33052, https://github.com/pandas-dev/pandas/pull/33347, 
https://github.com/pandas-dev/pandas/pull/33332 (removal of BlockManager.get/set/delete) this might actually already be done? 
- Limit the API surface of the BlockManager. It might be good to get an idea of which methods on the BlockManager are used outside of the internals, clearly list those (and make others private), and see whether we can reduce this list. 
  (Maybe actually reducing might not be possible, but the exercise can still be useful to get an explicit overview of what is needed to implement a BlockManager).

It might be that there are already more concrete open issues about some of those aspects.

cc @pandas-dev/pandas-core 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Practical steps towards a simplified BlockManager #34669

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Practical steps towards a simplified BlockManager #34669

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions