Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add whatsnew for arrow #54476

Merged
merged 7 commits into from
Aug 9, 2023
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,46 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

.. _whatsnew_210.enhancements.pyarrow_dependency:

PyArrow will become a required dependency with pandas 3.0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

`PyArrow <https://arrow.apache.org/docs/python/index.html>`_ will become a required
dependency of pandas starting with pandas 3.0. This decision was made based on
[PDEP 12](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html).
phofl marked this conversation as resolved.
Show resolved Hide resolved

This will enable more changes that are hugely beneficial to pandas users, including
but not limited to:

- inferring strings as PyArrow backed strings by default enabling a significant
reduction of the memory footprint and huge performance improvements.
- inferring more complex dtypes with PyArrow by default, like ``Decimal``, ``lists``,
``bytes``, ``structured data`` and more.
- Better interoperability with other libraries that depend on Apache Arrow.

We are collecting feedback on this decision [here](https://github.com/pandas-dev/pandas/issues/54466).
phofl marked this conversation as resolved.
Show resolved Hide resolved

.. _whatsnew_210.enhancements.infer_strings:

Avoid NumPy object dtype for strings by default
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously, all strings were stored in columns with NumPy object dtype.
This release introduces an option ``future.infer_string`` that infers all
strings as PyArrow backed strings with dtype ``pd.ArrowDtype(pa.string())`` instead.
This option only works if PyArrow is installed. PyArrow backed strings have a
significantly reduced memory footprint and provide a big performance improvement
compared to NumPy object.

The option can be enabled with:

.. code-block:: python

pd.options.future.infer_string = True

This behavior will become the default with pandas 3.0.

.. _whatsnew_210.enhancements.reduction_extension_dtypes:

DataFrame reductions preserve extension dtypes
Expand Down
Loading