Skip to content

Moving to PyArrow dtypes by default #61618

Open
@datapythonista

Description

@datapythonista

There have been some discussions about this before, but as far as I know there is not a plan or any decision made.

My understanding is (feel free to disagree):

  1. If we were starting pandas today, we would only use Arrow as storage for DataFrame columns / Series
  2. After all the great work that has been done on building new pandas types based on PyArrow, we are not considering other Arrow implementations
  3. Based on 1 and 2, we are moving towards pandas based on PyArrow, and the main question is what's the transition path

@jbrockmendel commented this, and I think many others share this point of view, based on past interactions:

There's a path to making it feasible to use PyArrow types by default, but that path probably takes multiple major release cycles. Doing it for 3.0 would be, frankly, insane.

It would be interesting to know why exactly, I guess it's mainly because of two main reasons:

  • Finish PyArrow types and making operations with them as reliable and fast as the original pandas types
  • Giving users time to adapt

I don't know the exact state of the PyArrow types, and how often users will face problems if using them instead of the original ones. From my perception, there aren't any major efforts to make them better at this point. So, I'm unsure if the situation in that regard will be very different if we make the PyArrow types the default ones tomorrow, or if we make them the default ones in two years.

My understanding is that the only person who is paid consistently to work on pandas is Matt, and he's doing an amazing job at keeping the project going, reviewing most of the PRs, keeping the CI in good shape... But I don't think not him not anyone else if being able to put hours into developing new things as it used to be. For reference, this is the GitHub chart of pandas activity (commits) since pandas 2.0:

Image

So, in my opinion, the existing problems with PyArrow will start to be addressed significantly, whenever they become the default ones.

So, in my opinion our two main options are:

  • We move forward with the PyArrow transition. pandas 3.0 will surely not be the best pandas version ever if we start using PyArrow types, but pandas 3.1 will be much better, and pandas 3.2 may be as good as pandas 2 in reliability and speed, but much closer to what we would like pandas to be.

Of course not all users are ready for pandas 3.0 with Arrow types. They can surely pin to pandas=2 until pandas 3 is more mature and they made the required changes to their code. We can surely add a flag pandas.options.mode.use_arrow = False that reverts the new default to the old status quo. So users can actually move to pandas 3.0 but stay with the old types until we (pandas) and them are ready to get into the new default types. The transition from Python 2 to 3 (which is surely an example of what not to do) took more than 10 years. I don't think in our case we need as much. And if there is interest (aka money) we can also support the pandas 2 series while needed.

  • The other option is to continue with the transition to the new nullable types, that my understanding is that we implemented because PyArrow didn't exist at that time. Continue to put our little resources on them. Making users adapt their code to a new temporary status quo, not the final one we envision, and stay in this transition period and delay the move to PyArrow I assume around 6 years (Brock mentioned multiple major release cycles, so I assume something like 3 at a rate of one major release every 2 years).

It will be great to know what are other people's thoughts and ideal plans, and see what makes more sense. But to me personally, based on the above information, it doesn't sound more insane to move to PyArrow in pandas 3, than to move in pandas 6.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions