Skip to content

Commit 1d75816

Browse files
GH-33980: [Docs][Python] Document DataFrame Interchange Protocol implementation and usage (#35835)
_edit: just added something_ * Closes: #33980 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
1 parent dd26757 commit 1d75816

File tree

5 files changed

+154
-0
lines changed

5 files changed

+154
-0
lines changed

docs/source/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@
7979
# Show members for classes in .. autosummary
8080
autodoc_default_options = {
8181
'members': None,
82+
'special-members': '__dataframe__',
8283
'undoc-members': None,
8384
'show-inheritance': None,
8485
'inherited-members': None

docs/source/python/api/tables.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,14 @@ Classes
4646
TableGroupBy
4747
RecordBatchReader
4848

49+
Dataframe Interchange Protocol
50+
------------------------------
51+
52+
.. autosummary::
53+
:toctree: ../generated/
54+
55+
interchange.from_dataframe
56+
4957
.. _api.tensor:
5058

5159
Tensors

docs/source/python/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ files into Arrow structures.
4747
filesystems_deprecated
4848
numpy
4949
pandas
50+
interchange_protocol
5051
timestamps
5152
orc
5253
csv
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
Dataframe Interchange Protocol
19+
==============================
20+
21+
The interchange protocol is implemented for ``pa.Table`` and
22+
``pa.RecordBatch`` and is used to interchange data between
23+
PyArrow and other dataframe libraries that also have the
24+
protocol implemented. The data structures that are supported
25+
in the protocol are primitive data types plus the dictionary
26+
data type. The protocol also has missing data support and
27+
it supports chunking, meaning accessing the
28+
data in “batches” of rows.
29+
30+
31+
The Python dataframe interchange protocol is designed by the
32+
`Consortium for Python Data API Standards <https://data-apis.org/>`_
33+
in order to enable data interchange between dataframe
34+
libraries in the Python ecosystem. See more about the
35+
standard in the
36+
`protocol documentation <https://data-apis.org/dataframe-protocol/latest/index.html>`_.
37+
38+
From pyarrow to other libraries: ``__dataframe__()`` method
39+
-----------------------------------------------------------
40+
41+
The ``__dataframe__()`` method creates a new exchange object that
42+
the consumer library can take and construct an object of it's own.
43+
44+
.. code-block::
45+
46+
>>> import pyarrow as pa
47+
>>> table = pa.table({"n_atendees": [100, 10, 1]})
48+
>>> table.__dataframe__()
49+
<pyarrow.interchange.dataframe._PyArrowDataFrame object at ...>
50+
51+
This is meant to be used by the consumer library when calling
52+
the ``from_dataframe()`` function and is not meant to be used manually
53+
by the user.
54+
55+
From other libraries to pyarrow: ``from_dataframe()``
56+
-----------------------------------------------------
57+
58+
With the ``from_dataframe()`` function, we can construct a :class:`pyarrow.Table`
59+
from any dataframe object that implements the
60+
``__dataframe__()`` method via the dataframe interchange
61+
protocol.
62+
63+
We can for example take a pandas dataframe and construct a
64+
pyarrow table with the use of the interchange protocol:
65+
66+
.. code-block::
67+
68+
>>> import pyarrow
69+
>>> from pyarrow.interchange import from_dataframe
70+
71+
>>> import pandas as pd
72+
>>> df = pd.DataFrame({
73+
... "n_atendees": [100, 10, 1],
74+
... "country": ["Italy", "Spain", "Slovenia"],
75+
... })
76+
>>> df
77+
n_atendees country
78+
0 100 Italy
79+
1 10 Spain
80+
2 1 Slovenia
81+
>>> from_dataframe(df)
82+
pyarrow.Table
83+
n_atendees: int64
84+
country: large_string
85+
----
86+
n_atendees: [[100,10,1]]
87+
country: [["Italy","Spain","Slovenia"]]
88+
89+
We can do the same with a polars dataframe:
90+
91+
.. code-block::
92+
93+
>>> import polars as pl
94+
>>> from datetime import datetime
95+
>>> arr = [datetime(2023, 5, 20, 10, 0),
96+
... datetime(2023, 5, 20, 11, 0),
97+
... datetime(2023, 5, 20, 13, 30)]
98+
>>> df = pl.DataFrame({
99+
... 'Talk': ['About Polars','Intro into PyArrow','Coding in Rust'],
100+
... 'Time': arr,
101+
... })
102+
>>> df
103+
shape: (3, 2)
104+
┌────────────────────┬─────────────────────┐
105+
│ Talk ┆ Time │
106+
│ --- ┆ --- │
107+
│ str ┆ datetime[μs] │
108+
╞════════════════════╪═════════════════════╡
109+
│ About Polars ┆ 2023-05-20 10:00:00 │
110+
│ Intro into PyArrow ┆ 2023-05-20 11:00:00 │
111+
│ Coding in Rust ┆ 2023-05-20 13:30:00 │
112+
└────────────────────┴─────────────────────┘
113+
>>> from_dataframe(df)
114+
pyarrow.Table
115+
Talk: large_string
116+
Time: timestamp[us]
117+
----
118+
Talk: [["About Polars","Intro into PyArrow","Coding in Rust"]]
119+
Time: [[2023-05-20 10:00:00.000000,2023-05-20 11:00:00.000000,2023-05-20 13:30:00.000000]]

python/pyarrow/interchange/from_dataframe.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,31 @@ def from_dataframe(df: DataFrameObject, allow_copy=True) -> pa.Table:
7474
Returns
7575
-------
7676
pa.Table
77+
78+
Examples
79+
--------
80+
>>> import pyarrow
81+
>>> from pyarrow.interchange import from_dataframe
82+
83+
Convert a pandas dataframe to a pyarrow table:
84+
85+
>>> import pandas as pd
86+
>>> df = pd.DataFrame({
87+
... "n_atendees": [100, 10, 1],
88+
... "country": ["Italy", "Spain", "Slovenia"],
89+
... })
90+
>>> df
91+
n_atendees country
92+
0 100 Italy
93+
1 10 Spain
94+
2 1 Slovenia
95+
>>> from_dataframe(df)
96+
pyarrow.Table
97+
n_atendees: int64
98+
country: large_string
99+
----
100+
n_atendees: [[100,10,1]]
101+
country: [["Italy","Spain","Slovenia"]]
77102
"""
78103
if isinstance(df, pa.Table):
79104
return df

0 commit comments

Comments
 (0)