-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-33980: [Docs][Python] Document DataFrame Interchange Protocol implementation and usage #35835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jorisvandenbossche
merged 9 commits into
apache:main
from
AlenkaF:gh-33980-df-protocol-docs
Jun 7, 2023
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
c91f385
Add __dataframe__ to the API docs for pa.Table and pa.RecordBatch
AlenkaF 09c0ddb
Add from_dataframe to the API docs and add an example
AlenkaF 3d35cfe
Add a page to the Python User Guide
AlenkaF c3abb8e
Apply suggestions from code review - Joris
AlenkaF f0d5917
Remove polars ex from the docstring
AlenkaF c5afa7b
Remove the shifting of currentmodules in tables.rst
AlenkaF 913ca86
Change the titles
AlenkaF ebc6ac1
Add note about the dunder method and calling it manually
AlenkaF 81f80dd
Apply suggestions from code review
jorisvandenbossche File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| .. Licensed to the Apache Software Foundation (ASF) under one | ||
| .. or more contributor license agreements. See the NOTICE file | ||
| .. distributed with this work for additional information | ||
| .. regarding copyright ownership. The ASF licenses this file | ||
| .. to you under the Apache License, Version 2.0 (the | ||
| .. "License"); you may not use this file except in compliance | ||
| .. with the License. You may obtain a copy of the License at | ||
|
|
||
| .. http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| .. Unless required by applicable law or agreed to in writing, | ||
| .. software distributed under the License is distributed on an | ||
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| .. KIND, either express or implied. See the License for the | ||
| .. specific language governing permissions and limitations | ||
| .. under the License. | ||
|
|
||
| Dataframe Interchange Protocol | ||
| ============================== | ||
|
|
||
| The interchange protocol is implemented for ``pa.Table`` and | ||
| ``pa.RecordBatch`` and is used to interchange data between | ||
| PyArrow and other dataframe libraries that also have the | ||
| protocol implemented. The data structures that are supported | ||
| in the protocol are primitive data types plus the dictionary | ||
| data type. The protocol also has missing data support and | ||
| it supports chunking, meaning accessing the | ||
| data in “batches” of rows. | ||
|
|
||
|
|
||
| The Python dataframe interchange protocol is designed by the | ||
| `Consortium for Python Data API Standards <https://data-apis.org/>`_ | ||
| in order to enable data interchange between dataframe | ||
| libraries in the Python ecosystem. See more about the | ||
| standard in the | ||
| `protocol documentation <https://data-apis.org/dataframe-protocol/latest/index.html>`_. | ||
|
|
||
| From pyarrow to other libraries: ``__dataframe__()`` method | ||
| ----------------------------------------------------------- | ||
|
|
||
| The ``__dataframe__()`` method creates a new exchange object that | ||
| the consumer library can take and construct an object of it's own. | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> import pyarrow as pa | ||
| >>> table = pa.table({"n_atendees": [100, 10, 1]}) | ||
| >>> table.__dataframe__() | ||
| <pyarrow.interchange.dataframe._PyArrowDataFrame object at ...> | ||
|
|
||
| This is meant to be used by the consumer library when calling | ||
| the ``from_dataframe()`` function and is not meant to be used manually | ||
| by the user. | ||
|
|
||
| From other libraries to pyarrow: ``from_dataframe()`` | ||
| ----------------------------------------------------- | ||
|
|
||
| With the ``from_dataframe()`` function, we can construct a :class:`pyarrow.Table` | ||
| from any dataframe object that implements the | ||
| ``__dataframe__()`` method via the dataframe interchange | ||
| protocol. | ||
|
|
||
| We can for example take a pandas dataframe and construct a | ||
| pyarrow table with the use of the interchange protocol: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> import pyarrow | ||
| >>> from pyarrow.interchange import from_dataframe | ||
|
|
||
| >>> import pandas as pd | ||
| >>> df = pd.DataFrame({ | ||
| ... "n_atendees": [100, 10, 1], | ||
| ... "country": ["Italy", "Spain", "Slovenia"], | ||
| ... }) | ||
| >>> df | ||
| n_atendees country | ||
| 0 100 Italy | ||
| 1 10 Spain | ||
| 2 1 Slovenia | ||
| >>> from_dataframe(df) | ||
| pyarrow.Table | ||
| n_atendees: int64 | ||
| country: large_string | ||
| ---- | ||
| n_atendees: [[100,10,1]] | ||
| country: [["Italy","Spain","Slovenia"]] | ||
|
|
||
| We can do the same with a polars dataframe: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> import polars as pl | ||
| >>> from datetime import datetime | ||
| >>> arr = [datetime(2023, 5, 20, 10, 0), | ||
| ... datetime(2023, 5, 20, 11, 0), | ||
| ... datetime(2023, 5, 20, 13, 30)] | ||
| >>> df = pl.DataFrame({ | ||
| ... 'Talk': ['About Polars','Intro into PyArrow','Coding in Rust'], | ||
| ... 'Time': arr, | ||
| ... }) | ||
| >>> df | ||
| shape: (3, 2) | ||
| ┌────────────────────┬─────────────────────┐ | ||
| │ Talk ┆ Time │ | ||
| │ --- ┆ --- │ | ||
| │ str ┆ datetime[μs] │ | ||
| ╞════════════════════╪═════════════════════╡ | ||
| │ About Polars ┆ 2023-05-20 10:00:00 │ | ||
| │ Intro into PyArrow ┆ 2023-05-20 11:00:00 │ | ||
| │ Coding in Rust ┆ 2023-05-20 13:30:00 │ | ||
| └────────────────────┴─────────────────────┘ | ||
| >>> from_dataframe(df) | ||
| pyarrow.Table | ||
| Talk: large_string | ||
| Time: timestamp[us] | ||
| ---- | ||
| Talk: [["About Polars","Intro into PyArrow","Coding in Rust"]] | ||
| Time: [[2023-05-20 10:00:00.000000,2023-05-20 11:00:00.000000,2023-05-20 13:30:00.000000]] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.