Description
(following the discussion in #26710)
Currently, most of the io
functionality (csv, json, html, pickle, stata...) lives in pandas/io
. But the to_*
functions are implemented in pandas/core/[generic|frame|series].py
. The tests live in pandas/tests/io
. And their dependencies, are not explicit anywhere afaik, they are imported lazily so the needed libraries are reported once the function is used (some are listed in environment.yml
but not all).
I propose to move every io type to a directory in pandas/contrib/io
that will include:
- The
read_*
andto_*
functions - The docstrings with the documentation
- The tests
- A file with the dependencies
So, a example structure could be:
pandas/contrib/io/stata/__init__.py
pandas/contrib/io/stata/reader.py
pandas/contrib/io/stata/writer.py
pandas/contrib/io/stata/tests/*
pandas/contrib/io/stata/dependencies.yml
To call the functionality we could simply have something like this in Series
, DataFrame
(also something similar for read_*
), but other ideas welcome:
def __getattr__(self, name):
if name.startswith('to_'):
mod = importlib.import_module('pandas.contrib.io.{}'.format(name[3:]):
return mod.export_dataframe()
I see several advantages here:
- A clearer (more modular and more uniform) structure of the code (
generic.py
has more than 11k lines of code, a significant part are io related, same forframe.py
with 8k...) - We can better manage the dependencies (they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way). Around two thirds of our optional dependencies are for IO modules for what I've seen.
- In an easy way we can explicitly decide in every build which io modules we want to test, and avoid having all the
skip_if_no
that cause problems and tests stop being run without noticing - Third party packages can be developed following a similar structure, so we can potentially add them to pandas if new formats become popular, or we can easily move to a third party project io packages that we consider they're not worth maintaining ourselves anymore.
Not part of this proposal, but I think in the future we could also move to contrib
other parts that are not IO but could also benefit from being decoupled, like plotting or the extension arrays (that's why I think contrib/io/
makes sense, so we can have contrib/plotting
... in the future).
CC: @pandas-dev/pandas-core