|
| 1 | +.. Licensed to the Apache Software Foundation (ASF) under one |
| 2 | +.. or more contributor license agreements. See the NOTICE file |
| 3 | +.. distributed with this work for additional information |
| 4 | +.. regarding copyright ownership. The ASF licenses this file |
| 5 | +.. to you under the Apache License, Version 2.0 (the |
| 6 | +.. "License"); you may not use this file except in compliance |
| 7 | +.. with the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +.. Unless required by applicable law or agreed to in writing, |
| 12 | +.. software distributed under the License is distributed on an |
| 13 | +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 14 | +.. KIND, either express or implied. See the License for the |
| 15 | +.. specific language governing permissions and limitations |
| 16 | +.. under the License. |
| 17 | +
|
| 18 | +Dataframe Interchange Protocol |
| 19 | +============================== |
| 20 | + |
| 21 | +The interchange protocol is implemented for ``pa.Table`` and |
| 22 | +``pa.RecordBatch`` and is used to interchange data between |
| 23 | +PyArrow and other dataframe libraries that also have the |
| 24 | +protocol implemented. The data structures that are supported |
| 25 | +in the protocol are primitive data types plus the dictionary |
| 26 | +data type. The protocol also has missing data support and |
| 27 | +it supports chunking, meaning accessing the |
| 28 | +data in “batches” of rows. |
| 29 | + |
| 30 | + |
| 31 | +The Python dataframe interchange protocol is designed by the |
| 32 | +`Consortium for Python Data API Standards <https://data-apis.org/>`_ |
| 33 | +in order to enable data interchange between dataframe |
| 34 | +libraries in the Python ecosystem. See more about the |
| 35 | +standard in the |
| 36 | +`protocol documentation <https://data-apis.org/dataframe-protocol/latest/index.html>`_. |
| 37 | + |
| 38 | +From pyarrow to other libraries: ``__dataframe__()`` method |
| 39 | +----------------------------------------------------------- |
| 40 | + |
| 41 | +The ``__dataframe__()`` method creates a new exchange object that |
| 42 | +the consumer library can take and construct an object of it's own. |
| 43 | + |
| 44 | +.. code-block:: |
| 45 | +
|
| 46 | + >>> import pyarrow as pa |
| 47 | + >>> table = pa.table({"n_atendees": [100, 10, 1]}) |
| 48 | + >>> table.__dataframe__() |
| 49 | + <pyarrow.interchange.dataframe._PyArrowDataFrame object at ...> |
| 50 | +
|
| 51 | +This is meant to be used by the consumer library when calling |
| 52 | +the ``from_dataframe()`` function and is not meant to be used manually |
| 53 | +by the user. |
| 54 | + |
| 55 | +From other libraries to pyarrow: ``from_dataframe()`` |
| 56 | +----------------------------------------------------- |
| 57 | + |
| 58 | +With the ``from_dataframe()`` function, we can construct a :class:`pyarrow.Table` |
| 59 | +from any dataframe object that implements the |
| 60 | +``__dataframe__()`` method via the dataframe interchange |
| 61 | +protocol. |
| 62 | + |
| 63 | +We can for example take a pandas dataframe and construct a |
| 64 | +pyarrow table with the use of the interchange protocol: |
| 65 | + |
| 66 | +.. code-block:: |
| 67 | +
|
| 68 | + >>> import pyarrow |
| 69 | + >>> from pyarrow.interchange import from_dataframe |
| 70 | +
|
| 71 | + >>> import pandas as pd |
| 72 | + >>> df = pd.DataFrame({ |
| 73 | + ... "n_atendees": [100, 10, 1], |
| 74 | + ... "country": ["Italy", "Spain", "Slovenia"], |
| 75 | + ... }) |
| 76 | + >>> df |
| 77 | + n_atendees country |
| 78 | + 0 100 Italy |
| 79 | + 1 10 Spain |
| 80 | + 2 1 Slovenia |
| 81 | + >>> from_dataframe(df) |
| 82 | + pyarrow.Table |
| 83 | + n_atendees: int64 |
| 84 | + country: large_string |
| 85 | + ---- |
| 86 | + n_atendees: [[100,10,1]] |
| 87 | + country: [["Italy","Spain","Slovenia"]] |
| 88 | +
|
| 89 | +We can do the same with a polars dataframe: |
| 90 | + |
| 91 | +.. code-block:: |
| 92 | +
|
| 93 | + >>> import polars as pl |
| 94 | + >>> from datetime import datetime |
| 95 | + >>> arr = [datetime(2023, 5, 20, 10, 0), |
| 96 | + ... datetime(2023, 5, 20, 11, 0), |
| 97 | + ... datetime(2023, 5, 20, 13, 30)] |
| 98 | + >>> df = pl.DataFrame({ |
| 99 | + ... 'Talk': ['About Polars','Intro into PyArrow','Coding in Rust'], |
| 100 | + ... 'Time': arr, |
| 101 | + ... }) |
| 102 | + >>> df |
| 103 | + shape: (3, 2) |
| 104 | + ┌────────────────────┬─────────────────────┐ |
| 105 | + │ Talk ┆ Time │ |
| 106 | + │ --- ┆ --- │ |
| 107 | + │ str ┆ datetime[μs] │ |
| 108 | + ╞════════════════════╪═════════════════════╡ |
| 109 | + │ About Polars ┆ 2023-05-20 10:00:00 │ |
| 110 | + │ Intro into PyArrow ┆ 2023-05-20 11:00:00 │ |
| 111 | + │ Coding in Rust ┆ 2023-05-20 13:30:00 │ |
| 112 | + └────────────────────┴─────────────────────┘ |
| 113 | + >>> from_dataframe(df) |
| 114 | + pyarrow.Table |
| 115 | + Talk: large_string |
| 116 | + Time: timestamp[us] |
| 117 | + ---- |
| 118 | + Talk: [["About Polars","Intro into PyArrow","Coding in Rust"]] |
| 119 | + Time: [[2023-05-20 10:00:00.000000,2023-05-20 11:00:00.000000,2023-05-20 13:30:00.000000]] |
0 commit comments