CLN: Implement io modules as plugins

(following the discussion in #26710)

Currently, most of the `io` functionality (csv, json, html, pickle, stata...) lives in `pandas/io`. But the `to_*` functions are implemented in `pandas/core/[generic|frame|series].py`. The tests live in `pandas/tests/io`. And their dependencies, are not explicit anywhere afaik, they are imported lazily so the needed libraries are reported once the function is used (some are listed in `environment.yml` but not all).

I propose to move every io type to a directory in `pandas/contrib/io` that will include:
- The `read_*` and `to_*` functions
- The docstrings with the documentation
- The tests
- A file with the dependencies

So, a example structure could be:
```
pandas/contrib/io/stata/__init__.py
pandas/contrib/io/stata/reader.py
pandas/contrib/io/stata/writer.py
pandas/contrib/io/stata/tests/*
pandas/contrib/io/stata/dependencies.yml
```

To call the functionality we could simply have something like this in `Series`, `DataFrame` (also something similar for `read_*`), but other ideas welcome:
```python
def __getattr__(self, name):
    if name.startswith('to_'):
        mod = importlib.import_module('pandas.contrib.io.{}'.format(name[3:]):
        return mod.export_dataframe()
```

I see several advantages here:
- A clearer (more modular and more uniform) structure of the code (`generic.py` has more than 11k lines of code, a significant part are io related, same for `frame.py` with 8k...)
- We can better manage the dependencies (they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way). Around two thirds of our optional dependencies are for IO modules for what I've seen.
- In an easy way we can explicitly decide in every build which io modules we want to test, and avoid having all the `skip_if_no` that cause problems and tests stop being run without noticing
- Third party packages can be developed following a similar structure, so we can potentially add them to pandas if new formats become popular, or we can easily move to a third party project io packages that we consider they're not worth maintaining ourselves anymore.

Not part of this proposal, but I think in the future we could also move to `contrib` other parts that are not IO but could also benefit from being decoupled, like plotting or the extension arrays (that's why I think `contrib/io/` makes sense, so we can have `contrib/plotting`... in the future).

CC: @pandas-dev/pandas-core 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CLN: Implement io modules as plugins #26804

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CLN: Implement io modules as plugins #26804

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions