Skip to content

CLN: Implement io modules as plugins #26804

Open
@datapythonista

Description

@datapythonista

(following the discussion in #26710)

Currently, most of the io functionality (csv, json, html, pickle, stata...) lives in pandas/io. But the to_* functions are implemented in pandas/core/[generic|frame|series].py. The tests live in pandas/tests/io. And their dependencies, are not explicit anywhere afaik, they are imported lazily so the needed libraries are reported once the function is used (some are listed in environment.yml but not all).

I propose to move every io type to a directory in pandas/contrib/io that will include:

  • The read_* and to_* functions
  • The docstrings with the documentation
  • The tests
  • A file with the dependencies

So, a example structure could be:

pandas/contrib/io/stata/__init__.py
pandas/contrib/io/stata/reader.py
pandas/contrib/io/stata/writer.py
pandas/contrib/io/stata/tests/*
pandas/contrib/io/stata/dependencies.yml

To call the functionality we could simply have something like this in Series, DataFrame (also something similar for read_*), but other ideas welcome:

def __getattr__(self, name):
    if name.startswith('to_'):
        mod = importlib.import_module('pandas.contrib.io.{}'.format(name[3:]):
        return mod.export_dataframe()

I see several advantages here:

  • A clearer (more modular and more uniform) structure of the code (generic.py has more than 11k lines of code, a significant part are io related, same for frame.py with 8k...)
  • We can better manage the dependencies (they would be explicit, and we don't need to have lazy dependencies if everything is imported in a lazy way). Around two thirds of our optional dependencies are for IO modules for what I've seen.
  • In an easy way we can explicitly decide in every build which io modules we want to test, and avoid having all the skip_if_no that cause problems and tests stop being run without noticing
  • Third party packages can be developed following a similar structure, so we can potentially add them to pandas if new formats become popular, or we can easily move to a third party project io packages that we consider they're not worth maintaining ourselves anymore.

Not part of this proposal, but I think in the future we could also move to contrib other parts that are not IO but could also benefit from being decoupled, like plotting or the extension arrays (that's why I think contrib/io/ makes sense, so we can have contrib/plotting... in the future).

CC: @pandas-dev/pandas-core

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO DataIO issues that don't fit into a more specific labelNeeds DiscussionRequires discussion from core team before further actionRefactorInternal refactoring of code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions