Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More control over folder structure #2553

Open
fmfreeze opened this issue May 3, 2023 · 5 comments
Open

More control over folder structure #2553

fmfreeze opened this issue May 3, 2023 · 5 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@fmfreeze
Copy link

fmfreeze commented May 3, 2023

On slack channel, @astrojuanlu, @deepyaman and I discussed the possibility to configure kedro so it knows about and works with a custom folder structure.

E.g. the src folder in a kedro repo simply has a different name.
That figured out to be as easy as adding a source_dir = "my_name" to the pyproject.toml file, so kedro successfully runs with that structure:

.
├── README.md
├── conf
│   ├── ...
├── data
│   ├── ...
├── pyproject.toml
├── setup.cfg
└── spaceflights
    ├── __init__.py
    ├── __main__.py
    ├── pipeline_registry.py
    ├── pipelines
    └── settings.py

But it would be even better to have more config control over the folder structure to let kedro work with e.g. such one, with kedro_src folder bundling settings.py and pipeline_registry.py (and maybe even __main__.py?):

.
├── README.md
├── conf
│   ├── ...
├── data
│   ├── ...
├── pyproject.toml
├── setup.cfg
└── spaceflights
    ├── __init__.py
    ├── __main__.py
    ├── pipelines
    └── kedro_src
        ├── settings.py 
        └── pipeline_registry.py

But at the moment those paths are hardcoded:

settings_module = f"{package_name}.settings"
settings.configure(settings_module)
pipelines_module = f"{package_name}.pipeline_registry"

Wouldn't that be great? :)

But why would it be great?
In Organisations there are often already established cookiecutter templates for their whatever data project. Kedro would be easier to integrate into those.
In a scientific environment, as a data-engineer I try to help making the scientists focus on actual tasks, and not (boilerplate) infrastructure. The easier a tool like kedro can be integrated, the more it avoids confusion among our scientists (which are no SW Devs - and don't have to be :)

Minor edits by @astrojuanlu

@AhdraMeraliQB AhdraMeraliQB added Issue: Feature Request New feature or improvement to existing feature Community Issue/PR opened by the open-source community labels May 16, 2023
@notniknot
Copy link

I would like to piggyback on this issue as we are trying to achieve a more "organized" project structure by separating the pipeline code from the model serving code. So a folder structure like the following is possible:

...
├── pyproject.toml
└── src
    ├── requirements.txt
    └── spaceflights
        ├── __init__.py
        ├── __main__.py
        ├── pipelines
        │   ├── __init__.py
        │   ├── data_preprocessing
        │   │   ├── __init__.py
        │   │   ├── nodes.py
        │   │   └── pipeline.py
        │   ├── data_science
        │   │   ├── __init__.py
        │   │   ├── nodes.py
        │   │   └── pipeline.py
        │   └── pipeline_registry.py
        ├── serving
        │   └── ...
        └── settings.py

Customizing the paths for the project structure, such as moving the pipeline_registry.py would definitely be a welcome feature.

@astrojuanlu
Copy link
Member

At the moment the only hardcoded paths are settings.py and pipeline_registry.py, which must exist and must be at the top of the package.

One solution to remove those paths would be to use entry points so that Kedro projects advertise where their pipeline_registry.py and settings.py are. Something like:

# pyproject.toml

[project.entry-points."spaceflights.kedro"]
register_pipelines = "spaceflights.pipelines.pipeline_registry:register_pipelines"  # A function
settings = "spaceflights.settings"  # A module

And then

In [1]: from importlib.metadata import entry_points

In [2]: kedro_eps = entry_points(group="spaceflights.kedro")

In [3]: kedro_eps["register_pipelines"].load()
Out[3]: <function spaceflights.pipelines.pipeline_registry.register_pipelines()>

In [4]: kedro_eps["settings"].load()
Out[4]: <module 'spaceflights.settings' from '/private/tmp/test-eps/src/spaceflights/settings.py'>

(crazy idea, all names are bikesheddable, and probably there are implications I didn't consider)

Another idea would be to write those directly in the [tool.kedro] and not rely on the entry points functionality.

Another idea could be to designate a way to say in spaceflights.__init__ where both settings and the register_pipelines function are. But this "pollutes" the code anyway, so maybe is not a huge improvement over having designated places for those files.

And I don't think there are more possibilities, unless I'm missing something.

@noklam
Copy link
Contributor

noklam commented Aug 22, 2023

@notniknot From what I see your pipelines structure should works already, is the only problem you cannot move pipeline_registry.py one level down?

In Organisations there are often already established cookiecutter templates for their whatever data project. Kedro would be easier to integrate into those.
In a scientific environment, as a data-engineer I try to help making the scientists focus on actual tasks, and not (boilerplate) infrastructure.

@fmfreeze You mentioned two different points.
As for your first comment about cookiecutter, do you happened to have a file name settings.py and pipeline_registry.py in your template? These are both very kedro-specific files. Why moving it inside one folder would make integration easier? If this is purely for hiding kedro configs from DS then I can relate more.

@notniknot
Copy link

@noklam Yes, moving the pipeline_registry.py one level further down would separate the pipeline specifics from non-kedro or "lesser-kedro" related things.

@noklam
Copy link
Contributor

noklam commented Jul 16, 2024

with kedro_src folder bundling settings.py and pipeline_registry.py (and maybe even main.py?):

To communicate this better, the only fix point is settings.py and pipeline_registry.py, __main__.py is not a kedro specific file, this is how Python work and you have the freedom to move it to anywhere (You can modify the entrypoint to any file). For example python -m kedro will look for this __main__.py automatically, this is why the default is in the top level.

@notniknot @fmfreeze Is this something that is still desired? Implementation side this is quite simple and I don't think it will break anything. We already have sourcr_dir in the project metadata, so this requires adding two new field into the section. Ultimately, there has to be at least ONE fix file for Kedro to understand where to look for the pipeline / settings.

def bootstrap_project(project_path: str | Path) -> ProjectMetadata:
    """Run setup required at the beginning of the workflow
    when running in project mode, and return project metadata.
    """

    project_path = Path(project_path).expanduser().resolve()
    metadata = _get_project_metadata(project_path)
    _add_src_to_path(metadata.source_dir, project_path)
    configure_project(metadata.package_name)
    return metadata

@noklam noklam removed the Community Issue/PR opened by the open-source community label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

6 participants