-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
PDEP-9: Allow third-party projects to register pandas connectors with a standard API #51799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
5dbdde9
PDEP-9: pandas I/O connectors as extensions
datapythonista 730df18
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista 23b934f
Final draft to be proposed
datapythonista de3a17b
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista da784ec
Address comments from code reviews, mostly by extending the proposal …
datapythonista f475350
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista 4a8ba96
Keep current I/O API and allow pandas as an interface
datapythonista 6ad6a9d
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista 5cb47d9
Rejecting
datapythonista 68ca3de
Reorder interfaces
datapythonista 150d1d1
Update web/pandas/pdeps/0009-io-extensions.md
datapythonista 6eea8a8
Use dataframe interchange protocol
datapythonista 5665dc7
Merge branch 'pdep9' of github.com:datapythonista/pandas into pdep9
datapythonista 40ebacc
typo
datapythonista aed569f
Merge branch 'main' into pdep9
datapythonista eb7c6f0
Make users load modules explicitly
datapythonista 14a2f4a
Merge branch 'pdep9' of github.com:datapythonista/pandas into pdep9
datapythonista 8050853
Update web/pandas/pdeps/0009-io-extensions.md
datapythonista 5cb23dd
Add limitations section
datapythonista 2af8577
Merge remote-tracking branch 'upstream/main' into pdep9
datapythonista ccb9674
Rejecting PDEP
datapythonista File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
PDEP-9: pandas I/O connectors as extensions
- Loading branch information
commit 5dbdde9fb38c56e87c1e576eb555845c82a90754
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
# PDEP-9: Implement pandas I/O connectors as extensions | ||
|
||
- Created: 26 February 2023 | ||
- Status: Draft | ||
- Discussion: [#XXXX](https://github.com/pandas-dev/pandas/pull/XXXX) | ||
- Author: [Marc Garcia](https://github.com/datapythonista) | ||
- Revision: 1 | ||
|
||
## Introduction | ||
|
||
pandas supports importing and exporting data from different formats using | ||
connectors, currently implemented in `pandas/io`. In many cases, those | ||
connectors wrap an existing Python library, while in some others, pandas | ||
implements the format logic. | ||
|
||
In some cases, different engines exist for the same format. The API to use | ||
those connectors is `pandas.read_<format>(engine='<engine-name>', ...)` to | ||
import data, and `DataFrame.to_<format>(engine='<engine-name>', ...)` to | ||
export data. | ||
|
||
For objects exported to memory (like a Python dict) the API is the same as | ||
for I/O, `DataFrame.to_<format>(...)`. For formats imported from objects in | ||
memory, the API is different, `DataFrame.from_<format>(...)`. | ||
|
||
In some cases, the pandas API provides `DataFrame.to_*` methods that are not | ||
used to export the data to a disk or memory object, but instead to transform | ||
the index of a `DataFrame`: `DataFrame.to_period` and `DataFrame.to_timestamp`. | ||
|
||
Dependencies of the I/O connectors are not loaded by default, and will be | ||
imported when the connector is used. If the dependencies are not installed, | ||
an `ImportError` is raised. | ||
|
||
```python | ||
>>> pandas.read_gbq(query) | ||
Traceback (most recent call last): | ||
... | ||
ImportError: Missing optional dependency 'pandas-gbq'. | ||
pandas-gbq is required to load data from Google BigQuery. | ||
See the docs: https://pandas-gbq.readthedocs.io. | ||
Use pip or conda to install pandas-gbq. | ||
``` | ||
|
||
### Supported formats | ||
|
||
The list of formats can be found in the [IO guide](https://pandas.pydata.org/docs/dev/user_guide/io.html). | ||
A more detailed table, including in memory objects, with | ||
the engines and dependencies is presented next. | ||
|
||
| Format | Reader | Writer | Engines | Dependencies | | ||
|--------------|--------|--------|--------------------------|--------------| | ||
| CSV | X | X | `c`, `python`, `pyarrow` | `pyarrow` | | ||
| FWF | X | | | | | ||
| JSON | X | X | | ||
| HTML | X | X | | ||
| LaTeX | | X | | ||
| XML | X | X | | ||
| Clipboard | X | X | | ||
| Excel | X | X | | ||
| HDF5 | X | X | | ||
| Feather | X | X | | ||
| Parquet | X | X | | ||
| ORC | X | X | | ||
| Stata | X | X | | ||
| SAS | X | | | ||
| SPSS | X | | | ||
| Pickle | X | X | | | | ||
| SQL | X | X | | ||
| BigQuery | | | | ||
| dict | X | X | | ||
| records | X | X | | ||
| string | | X | | ||
| markdown | | X | | ||
| xarray | | X | | ||
|
||
### Inclusion criteria | ||
|
||
There is no objective criteria for when a format is included | ||
in pandas, and the list above is mostly the result of developers | ||
being interested in implementing the connectors for a certain | ||
format in pandas. | ||
|
||
The number of existing formats is constantly increasing, and its | ||
difficult for pandas to keep up to date even with popular formats. | ||
It could probably make sense to have connectors to pyarrow, | ||
pyspark, Iceberg, DuckDB, Polars, and others. | ||
|
||
At the same time, some of the formats are not frequently used as | ||
shown in the [2019 user survey](https://pandas.pydata.org//community/blog/2019-user-survey.html). | ||
Those less popular formats include SPSS, SAS, Google BigQuery and | ||
Stata. Note that only I/O formats (and not memory formats like | ||
records or xarray) where included in the survey. | ||
|
||
## Proposal | ||
|
||
The main proposal in this PDEP is to open the development of pandas | ||
connectors to third-parties. This would not only allow the development | ||
of new connectors in a faster and easier way, without the intervention of | ||
the pandas team, but also remove from the pandas code base a number of the | ||
existing connectors, simplifying the code, the CI and the builds. | ||
While a limited set of core connectors could live in the pandas code base, | ||
most of the existing connectors would be moved to third-party projects. | ||
|
||
The user experience would remain similar to the existing one, but making | ||
better use of namespaces, and adding consistency. Any pandas connector | ||
(regardless of being implemented as a third-party module or not) would define | ||
a Python entrypoint specifying the format they connect to, the operations | ||
they support (read and/or write) and the name of the engine to be used. | ||
On load, pandas would access this registry of connectors, and would create | ||
the corresponding import and export methods. | ||
|
||
To use the connectors for the format, users would install the third-party | ||
connector package, instead of installing the required dependencies as they | ||
need to do now. | ||
|
||
### Python API | ||
|
||
The Python API can be improved from the current one to make better use | ||
of namespaces, and avoid inconsistencies. The proposed API is: | ||
|
||
```python | ||
import pandas | ||
|
||
df = pandas.DataFrame.io.read_<format>(engine='<engine>', ...) | ||
|
||
df.io.write_<format>(engine='<engine>', ...) | ||
``` | ||
The `engine` parameter would only be required when more than an engine | ||
is available for a format. This is similar to the the current API, that | ||
would use the default engine if not specified. | ||
|
||
For example: | ||
|
||
```python | ||
import pandas | ||
|
||
df = pandas.DataFrame.io.read_hdf5('input.hdf5') | ||
|
||
df.io.write_parquet('output.parquet') | ||
``` | ||
|
||
All the I/O connectors would be accessed via `DataFrame.io`, significantly | ||
reducing the number of items in the namespace of the `pandas` module, and | ||
the `DataFrame` class. Introspection would make it fast and simple to | ||
list the existing connectors `dir(pandas.DataFrame.io)`. | ||
|
||
The API is more intuitive than the current one, as it would be used for | ||
both in memory formats and disk formats, and does not mix read/to (users | ||
in general would expect read/write, from/to, import/export, input/output, | ||
and not a mix of those pairs). | ||
|
||
### Ecosystem of connectors | ||
|
||
In the same way Python can be extended with third-party modules, pandas | ||
would be extendable with I/O plugins. This has some advantages: | ||
|
||
- **Supression of the pandas maintainers bottleneck.** Everybody would be | ||
able to develop and promote their own I/O connectors, without the | ||
approval or intervention of pandas maintainers. | ||
- **Lower the entry barrier to pandas code.** Since pandas is a huge and | ||
mature project, writing code in pandas itself is complex. Several | ||
linters and autoformatters are required, policies like adding release | ||
notes need to be followed. Proper testing must be implemented. | ||
CI is slow and takes hours to complete. pandas needs to be compiled | ||
due to its C extensions. All those would not be necessary, and | ||
creating new I/O connectors would be faster and simpler. | ||
- **CI and packaging simplification.** pandas has currently around 20 | ||
dependencies required by connectors. And a significant number of | ||
tests, some of them requiring a high level of customization (such as | ||
an available database server to test `read_sql`, or a virtual | ||
clipboard to test `read_clipboard`). Moving connectors out of | ||
pandas would make the CI faster, and the number of problems caused | ||
by updates in dependencies smaller. | ||
- **Competition and alternatives for I/O operations.** Some of the | ||
supported formats allow for different approaches in terms of | ||
implementation. For example, `csv` connectors can be optimized | ||
for performance and reliability, or for easiness of use. When | ||
building a production pipeline, users would often appreciate a | ||
loader that requires an expected schema, loads faster because of | ||
it, and fails if the file contains errors. While Jupyter users | ||
may prefer inference and magic that helps them write code faster. | ||
- **Reusability with other projects.** In some cases, it can make | ||
sense to load a format into for example Apache Arrow, and then | ||
convert it to a pandas `DataFrame` in the connector. It could | ||
also be quite simple when that is implemented to return a Vaex | ||
or a Polars object. Having connectors as third-party packages | ||
would allow to implement this, as opposed as our current | ||
connectors. This reusability would not only benefit other | ||
dataframe projects, but it would also have better maintained | ||
connectors, as they will be shared by a larger ecosystem. | ||
|
||
## Disadvantages | ||
|
||
The main disadvantages to implementing this PDEP are: | ||
|
||
- **Backward compatibility**. | ||
- **More verbose API.** | ||
- **Fragmented documentation.** | ||
|
||
## Transition period | ||
|
||
This proposal involves some important changes regarding user | ||
facing code. | ||
|
||
The implementation of connectors as third-party packages is quite | ||
small for users, who would just need to install `pandas-xarray` | ||
instead of `xarray` to be able to use `DataFrame.to_xarray`. Also, | ||
the `ImportError` message users would get in case it was not | ||
properly installed, can provide the required information for users | ||
to install the right package without issues. | ||
|
||
The part that requires more careful management and a long transition | ||
period is the change to the Python API proposed here. The | ||
new API does not overlap with the old one (everything would be in | ||
the new `DataFrame.io` accessor). This allows to easily implement | ||
both the new and old API in parallel, raising `FutureWarning` | ||
warnings in the old API, so users can slowly adapt their code, | ||
and get used to the new API. Since the changes affect all pandas | ||
users, keeping the old behavior until at least pandas 4.0 seems | ||
a reasonable transition period. | ||
|
||
## PDEP-9 History | ||
|
||
- 26 February 2023: Initial version |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related: ive recently been thinking that DataFrame/Series methods that only operate on the index/columns might make sense to put in a accessor/namespace