Skip to content

Practical steps towards a simplified BlockManager #34669

Open
@jorisvandenbossche

Description

In the mailing list discussion on a simplified (non-consolidating) BlockManager, @jbrockmendel brought up the relevant question of how we could get there (incrementally?): https://mail.python.org/pipermail/pandas-dev/2020-May/001223.html

Since that is a more practical discussion, moving that to an issue here.

Longer term, there are multiple options of how such a blockmanager could be enabled by the user (listing 2 here, but there are probably other options as well):

  • Go "all in" on extension arrays. We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing code paths specifically for the case of having all 1D ExtensionBlocks.
  • A "consolidation policy" option similarly as in the branch discussed in Expose the blocks API and disable automatic consolidation #10556. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So that might not be fully ideal. We could add 1D versions of our numeric blocks as well, but that would probably add a lot of complexity, although
    temporary, to the Blocks, so maybe not an ideal path forward.

For the "all extension arrays" option, we could also use light-weight EAs to store numpy arrays (like the "PandasArrays" we have now, only then actually using it), as long as we don't yet have actual extension arrays for all dtypes.
For both, we could have a constructor keyword to enabled it, and/or a global config option to enable it.

Now, on the shorter term, there are probably some work items we could tackle to reduce the API surface / usage of blocks outside of the internals, to make things like the above more realistic to implement (and make it easier to experiment with alternative BlockManager implementations):

  • Continue removing cases where pandas code outside of the internals work directly on blocks (@jbrockmendel you already did a lot of work on this recently, do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases?)
  • Remove any "index label" related code in the internals, as the BlockManager in principle only needs to concern about integer locations. Although, @jbrockmendel, with your work in CLN: remove BlockManager.get #33052, REF: remove BlockManager.set #33347,
    REF: BlockManager.delete -> idelete #33332 (removal of BlockManager.get/set/delete) this might actually already be done?
  • Limit the API surface of the BlockManager. It might be good to get an idea of which methods on the BlockManager are used outside of the internals, clearly list those (and make others private), and see whether we can reduce this list.
    (Maybe actually reducing might not be possible, but the exercise can still be useful to get an explicit overview of what is needed to implement a BlockManager).

It might be that there are already more concrete open issues about some of those aspects.

cc @pandas-dev/pandas-core

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsInternalsRelated to non-user accessible pandas implementationNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions