Skip to content

API: Public data for Series and Index: .array and .to_numpy() #23623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Nov 29, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7959eb6
API: Public data attributes for EA-backed containers
TomAugspurger Oct 30, 2018
5b15894
update
TomAugspurger Nov 6, 2018
4781a36
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 11, 2018
15cc0b7
more notes
TomAugspurger Nov 11, 2018
888853f
update
TomAugspurger Nov 11, 2018
2cfca30
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 11, 2018
3e76f02
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 13, 2018
7e43cf0
Squashed commit of the following:
TomAugspurger Nov 13, 2018
bceb612
DOC: updated docs
TomAugspurger Nov 13, 2018
c19c9bb
Added DataFrame.to_numpy
TomAugspurger Nov 17, 2018
fe813ff
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 17, 2018
8619790
clean
TomAugspurger Nov 17, 2018
639b6fb
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 21, 2018
95f19bc
doc update
TomAugspurger Nov 21, 2018
3292e43
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 21, 2018
5a905ab
update
TomAugspurger Nov 21, 2018
1e6eed4
fixed doctest
TomAugspurger Nov 21, 2018
4545d93
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 26, 2018
2d7abb4
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 27, 2018
a7a13a0
Fixed array error in docs
TomAugspurger Nov 27, 2018
c0a63c0
update docs
TomAugspurger Nov 27, 2018
661b9eb
Fixup for feedback
TomAugspurger Nov 28, 2018
52f5407
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 28, 2018
566a027
skip only on index box
TomAugspurger Nov 28, 2018
062c49f
Series.values
TomAugspurger Nov 28, 2018
78e5824
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 28, 2018
e805c26
remove stale todo
TomAugspurger Nov 28, 2018
f9eee65
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
doc update
  • Loading branch information
TomAugspurger committed Nov 21, 2018
commit 95f19bc41d69ff74f13e24b2da88f8aa7887d62a
30 changes: 23 additions & 7 deletions doc/source/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,17 +121,33 @@ Display the index, columns:
df.index
df.columns

:attr:`DataFrame.values` gives a NumPy representation of the underlying data.
However, this can be an expensive operation when your :class:`DataFrame` has
columns with different data types. **NumPy arrays have a single dtype for
the entire array, so accessing ``df.values`` may have to coerce data**. We
recommend using ``df.values`` only when you know that your data has a single
data type.
:meth:`DataFrame.to_numpy` gives a NumPy representation of the underlying data.
Note that his can be an expensive operation when your :class:`DataFrame` has
columns with different data types, which comes down to a fundamental difference
between pandas and NumPy: **NumPy arrays have one dtype for the entire array,
while pandas DataFrames have one dtype per column**. When you call
:meth:`DataFrame.to_numpy`, pandas will find the NumPy dtype that can hold *all*
of the dtypes in the DataFrame. This may end up being ``object``, which requires
casting every value to a Python object.

For ``df``, our :class:`DataFrame` of all floating-point values,
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, should we have a copy keyword to be able to force a copy? (can be added later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. Don't care whether we do it here or later.

I think we'll also want (type-specific?) keywords for controlling how the conversion is done (ndarray of Timestamps vs. datetime64[ns] for example). I'm not sure what the eventual signature should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we decide to go for object array of Timestamps for datetimetz as default, it would be good to have the option to return datetime64

Regarding copy, would it actually make sense to have copy=True the default? Then you have at least a consistent default (it is never a view on the data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think copy=True is a good default since it's the only one that can be ensured for all cases.


.. ipython:: python

df.values
df.to_numpy()

For ``df2``, the :class:`DataFrame` with multiple dtypes,
:meth:`DataFrame.to_numpy` is relatively expensive.

.. ipython:: python

df2.to_numpy()

.. note::

:meth:`DataFrame.to_numpy` does *not* include the index or column
labels in the output.

:func:`~DataFrame.describe` shows a quick statistic summary of your data:

Expand Down
36 changes: 24 additions & 12 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ thought of as containers for arrays, which hold the actual data and do the
actual computation. For many types, the underlying array is a
:class:`numpy.ndarray`. However, pandas and 3rd party libraries may *extend*
NumPy's type system to add support for custom arrays
(see :ref:`dsintro.data_types`).
(see :ref:`basics.dtypes`).

To get the actual data inside a :class:`Index` or :class:`Series`, use
the **array** property
Expand Down Expand Up @@ -1951,17 +1951,29 @@ dtypes
------

For the most part, pandas uses NumPy arrays and dtypes for Series or individual
columns of a DataFrame. The main types allowed in pandas objects are ``float``,
``int``, ``bool``, and ``datetime64[ns]`` (note that NumPy does not support
timezone-aware datetimes).

In addition to NumPy's types, pandas :ref:`extends <extending.extension-types>`
NumPy's type-system for a few cases.

* :ref:`Categorical <categorical>`
* :ref:`Datetime with Timezone <timeseries.timezone_series>`
* :ref:`Period <timeseries.periods>`
* :ref:`Interval <indexing.intervallindex>`
columns of a DataFrame. NumPy provides support for ``float``,
``int``, ``bool``, ``timedelta64[ns]`` and ``datetime64[ns]`` (note that NumPy
does not support timezone-aware datetimes).

Pandas and third-party libraries *extend* NumPy's type system in a few places.
This section describes the extensions pandas has made internally.
See :ref:`extending.extension-types` for how to write your own extension that
works with pandas. See :ref:`ecosystem.extensions` for a list of third-party
libraries that have implemented an extension.

The following table lists all of pandas extension types. See the respective
documentation sections for more on each type.

=================== ========================= ================== ============================= =============================
Kind of Data Data Type Scalar Array Documentation
=================== ========================= ================== ============================= =============================
tz-aware datetime :class:`DatetimeArray` :class:`Timestamp` :class:`arrays.DatetimeArray` :ref:`timeseries.timezone`
Categorical :class:`CategoricalDtype` (none) :class:`Categorical` :ref:`categorical`
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods`
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this 'integer_na' point to? (I don't seem to find it in the docs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23617. I'm aiming for eventual consistency on the docs :)

=================== ========================= ================== ============================= =============================

Pandas uses the ``object`` dtype for storing strings.

Expand Down
36 changes: 4 additions & 32 deletions doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ However, operations such as slicing will also slice the index.
We will address array-based indexing like ``s[[4, 3, 1]]``
in :ref:`section <indexing>`.

Like a NumPy array, a pandas Series as a :attr:`Series.dtype`.
Like a NumPy array, a pandas Series has a :attr:`~Series.dtype`.

.. ipython:: python

Expand All @@ -151,7 +151,8 @@ Like a NumPy array, a pandas Series as a :attr:`Series.dtype`.
This is often a NumPy dtype. However, pandas and 3rd-party libraries
extend NumPy's type system in a few places, in which case the dtype would
be a :class:`~pandas.api.extensions.ExtensionDtype`. Some examples within
pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`dsintro.data_type` for more.
pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`basics.dtypes`
for more.

If you need the actual array backing a ``Series``, use :attr:`Series.array`.

Expand All @@ -160,7 +161,7 @@ If you need the actual array backing a ``Series``, use :attr:`Series.array`.
s.array

Again, this is often a NumPy array, but may instead be a
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`dsintro.data_type` for more.
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`basics.dtypes` for more.
Accessing the array can be useful when you need to do some operation without the
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).

Expand Down Expand Up @@ -859,35 +860,6 @@ completion mechanism so they can be tab-completed:
In [5]: df.fo<TAB>
df.foo1 df.foo2

.. _dsintro.data_type:

Data Types
----------

Pandas type system is mostly built on top of `NumPy's <https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.dtypes.html>`__.
NumPy provides the basic arrays and data types for numeric
string, *tz-naive* datetime, and others types of data.

Pandas and third-party libraries *extend* NumPy's type system in a few places.
This section describes the extensions pandas has made internally.
See :ref:`extending.extension-types` for how to write your own extension that
works with pandas. See :ref:`ecosystem.extensions` for a list of third-party
libraries that have implemented an extension.

The following table lists all of pandas extension types. See the respective
documentation sections for more on each type.

=================== ========================= ================== ============================= =============================
Kind of Data Data Type Scalar Array Documentation
=================== ========================= ================== ============================= =============================
tz-aware datetime :class:`DatetimeArray` :class:`Timestamp` :class:`arrays.DatetimeArray` :ref:`timeseries.timezone`
Categorical :class:`CategoricalDtype` (none) :class:`Categorical` :ref:`categorical`
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods`
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
=================== ========================= ================== ============================= =============================

.. _basics.panel:

Panel
Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ as ``.values``).
ser.array
ser.to_numpy()

See :ref:`basics.dtypes` and :ref:`dsintro.attrs` for more.

.. _whatsnew_0240.enhancements.extension_array_operators:

``ExtensionArray`` operator support
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -778,7 +778,7 @@ def array(self):
Union[ndarray, ExtensionArray]
This is the actual array stored within this object. This differs
from ``.values`` which may require converting the data
to a different form. We recommend using :
to a different form.

Notes
-----
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1144,7 +1144,7 @@ def to_numpy(self):
>>> df = pd.DataFrame({"A": [1, 2], "B": [3.0, 4.5]})
>>> df.to_numpy()

When numeric and non-numeric types, the output array will
For a mix of numeric and non-numeric types, the output array will
have object dtype.

>>> df['C'] = pd.date_range('2000', periods=2)
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -4928,6 +4928,10 @@ def values(self):
"""
Return a Numpy representation of the DataFrame.

.. warning::

We recommend using :meth:`DataFrame.to_numpy` instead.

Only the values in the DataFrame will be returned, the axes labels
will be removed.

Expand Down
5 changes: 3 additions & 2 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -724,8 +724,9 @@ def values(self):

.. warning::

We recommend you use :attr:`Index.array` or
:meth:`Index.to_numpy` instead of ``.values``.
We recommend using :attr:`Index.array` or
:meth:`Index.to_numpy`, depending on whether you need
a reference to the underlying data or a NumPy array.

Returns
-------
Expand Down
9 changes: 7 additions & 2 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,8 +410,13 @@ def ftypes(self):
@property
def values(self):
"""
Return Series as ndarray or ndarray-like
depending on the dtype
Return Series as ndarray or ndarray-like depending on the dtype.

.. warning::

We recommend using :attr:`Series.array` or
:meth:`Series.to_numpy`, depending on whether you need
a reference to the underlying data or a NumPy array.

Returns
-------
Expand Down