-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: Investigate options to pull a pipeline and other files into a project #2758
Comments
Yes, let's go for startersAfter researching options in more detail I think that the starter approach is indeed the way to go. Together with the boolean logic to let users select which utilities they want we can leverage cookiecutter hooks and conditionally add/remove files . There's an opportunity to merge some of the starters and then conditionally pull files (e.g. ConcernI have one concern about the more advanced examples that would come with adding Similar story for |
Following from this spike we'll have to tackle:
|
Introduction
As part of our utilities and plugins work for a new project creation flow, we want to be able to dynamically add/edit files to a blank project based on a set of user provided parameters. The goal is to provide users with a working project, customised according to their requirements.
The parameters available to users will be: (these will be collected as part of the project creation CLI flow)
This problem can be split into 3 parts, which will be discussed separately. The miro board for this discussion can be found here
Utilities
Problem/Goals:
Proposed Solution:
Requirements:
To enable utilities in a project, we need to be able to:
These steps should also be available after project creation too, so users can add utilities to their project later on.
Proposed Implementation: This can be done using cookie-cutter boolean variables where certain parts of the code base are hidden/shown based on boolean variables. In this way, the base template actually contains all the code required for each utility, but it is only ‘shown’ when the utilities is selected. With this method we can ensure that users can choose any combination of utilities by ensuring the project works with all utilities enabled. ( This is basically the existing blank project template, so it shouldn’t be a problem)
Plugins
Problem/Goals:
Proposed Solution:
Plugins need to be able to:
spark.yml
)settings.py
At the moment we have the following proposed plugins:
Example Pipeline Code
Problem/Goals:
Proposed Solution:
The goal of example pipeline code is to allow users to add a functioning pipeline to their starting project, based on the parameters they selected. For example, if they selected the databricks plugin on setup, the project should come with pipeline that is ready to use with databricks. Complexity occurs when users select more than one plugin (i.e. viz + pyspark, or airflow + databricks)
Using Spaceflights as the base, we want to be able to
catalog.yml
parameters.yml
example_pipeline/nodes.py
example_pipeline/pipeline.py
companies.csv
Proposed Implementation: Requires technical design/research, specification is discussed further below
User Journey
The user will go through the project creation CLI flow and asked
Using these inputs we can generate a set of parameters like this, that we can then use to determine how to create the project
Problem
As part of the requirements for Plugins and Example Pipeline Code we want to be able to
Plugins:
spark.yml
,hooks.py
,databricks_run.py
)requirements.txt
settings.py
Example Pipeline Code:
catalog.yml
,parameters.yml
nodes.py
,pipeline.py
companies.csv
tests/test_pipeline.py
to provide example unit testsrequirements.txt
Proposed Implementation
Option 1
Described by Yetu in her comment below (TLDR: let starters handle example code)
Starters:
blank
blank-pyspark
pandas-spaceflights
pandas-spaceflights-viz
pyspark-spaceflights
pyspark-spaceflights-viz
Plugins:
kedro-databricks
kedro-airflow
Option 2
Plugins:
kedro-databricks
kedro-airflow
kedro-viz
If we find a way to pull in the different parts of a project, for each of the example we’d seek to maintain the base project in 2 versions of the spaceflights pipeline:
Then for each of these pipelines we support making the modifications required to make them work with: airflow, viz, databricks
List of possible output projects:
With this method the open questions are :
In general: the wider open questions are:
requirements.txt
and[settings.py](http://settings.py)
which will be necessary either wayWhere do our current approaches not stack up?
catalog.yml
file, the datasets and updatingsettings.py
(required for Kedro-Viz).Ideas to consider
This task looks at coming up with a way to enable this for users and you have creative freedom to list approaches and the pros and cons of each. Your solution will probably touch a lot of Kedro e.g. plugins, starters, etc.
kedro micropkg pull
to includecatalog.yml
files, datasets and a workflow for updatingsettings.py
(using Rope, likerequirements.txt
).src/tests
)kedro-databricks
will havekedro databricks init
and it will adddatabricks_run.py
andlogging.yml
to a project with PySpark.The text was updated successfully, but these errors were encountered: