Spike: Investigate options to pull a pipeline and other files into a project #2758

amandakys · 2023-07-03T15:14:41Z

Introduction

As part of our utilities and plugins work for a new project creation flow, we want to be able to dynamically add/edit files to a blank project based on a set of user provided parameters. The goal is to provide users with a working project, customised according to their requirements.

The parameters available to users will be: (these will be collected as part of the project creation CLI flow)

utilities: 
	testing: true
	linting: true 
	logging: false
	data_structure: false
	documentation: true 

example_code: true

plugins:
	databricks: true
	pyspark: false
	kedro-viz: true
	airflow: false

This problem can be split into 3 parts, which will be discussed separately. The miro board for this discussion can be found here

Utilities

Problem/Goals:

Simplify project template
Allow project template to be customised to target the needs of different user groups

Proposed Solution:

allow users to add in utilities as and when they need them

Requirements:
To enable utilities in a project, we need to be able to:

add dependencies to requirements.txt
add configuration files
add a directory structure

These steps should also be available after project creation too, so users can add utilities to their project later on.

Proposed Implementation: This can be done using cookie-cutter boolean variables where certain parts of the code base are hidden/shown based on boolean variables. In this way, the base template actually contains all the code required for each utility, but it is only ‘shown’ when the utilities is selected. With this method we can ensure that users can choose any combination of utilities by ensuring the project works with all utilities enabled. ( This is basically the existing blank project template, so it shouldn’t be a problem)

Plugins

Problem/Goals:

Allow projects to better support third party integrations
Allow projects to be adapted to support third party integrations after initial creation
Allow project template to be customised to target the needs of different user groups

Proposed Solution:

allow users to better integrate their projects with third party tools via first party maintained plugins
these plugins should be available for installation after project creation so users can add them to their project at a later stage

Plugins need to be able to:

add configuration files (i.e. spark.yml)
append/edit existing configuration files
- add dependencies to requirements.txt
- add code to settings.py
add example pipeline code (this is discussed in more detail below)

At the moment we have the following proposed plugins:

kedro-viz
databricks
airflow
pyspark (TBD)

Example Pipeline Code

Problem/Goals:

standardise example pipelines provided/managed by the team
allow project template to be customised to target the needs of different user groups

Proposed Solution:

allow users to add an example pipeline into their project that demonstrates the requested features

The goal of example pipeline code is to allow users to add a functioning pipeline to their starting project, based on the parameters they selected. For example, if they selected the databricks plugin on setup, the project should come with pipeline that is ready to use with databricks. Complexity occurs when users select more than one plugin (i.e. viz + pyspark, or airflow + databricks)

Using Spaceflights as the base, we want to be able to

Example pipeline code includes files such as:
- catalog.yml
- parameters.yml
- example_pipeline/nodes.py
- example_pipeline/pipeline.py
- dataset files companies.csv
- example tests (unit tests for the the pipeline)
They represent files needed to create a functioning pipeline.
They need to be treated slightly differently, as it might not make sense to store them in PyPI
Across the different plugins

Proposed Implementation: Requires technical design/research, specification is discussed further below

User Journey

The user will go through the project creation CLI flow and asked

What Utilities they’d like
What Plugins they’d like
Whether they’d like an example pipeline

Using these inputs we can generate a set of parameters like this, that we can then use to determine how to create the project

utilities: 
	testing: true
	linting: true 
	logging: false
	data_structure: false
	documentation: true 

example_code: true

plugins:
	databricks: true
	pyspark: false
	kedro-viz: true
	airflow: false

Problem

As part of the requirements for Plugins and Example Pipeline Code we want to be able to

Plugins:

add configuration files (files needed to make a kedro project work with a third party i.e. spark.yml, hooks.py, databricks_run.py)
append dependencies to requirements.txt
append code to settings.py

Example Pipeline Code:

add pipeline configuration files i.e. catalog.yml, parameters.yml
add pipeline files i.e. nodes.py, pipeline.py
add dataset files i.e. companies.csv
add example tests tests/test_pipeline.py to provide example unit tests
append dependencies to requirements.txt

Proposed Implementation

Option 1

Described by Yetu in her comment below (TLDR: let starters handle example code)

Starters:

blank
blank-pyspark
pandas-spaceflights
pandas-spaceflights-viz
pyspark-spaceflights
pyspark-spaceflights-viz

Plugins:

kedro-databricks
kedro-airflow

Option 2

Plugins:

kedro-databricks
kedro-airflow
kedro-viz

If we find a way to pull in the different parts of a project, for each of the example we’d seek to maintain the base project in 2 versions of the spaceflights pipeline:

pipeline written with pandas
pipeline written with pyspark

Then for each of these pipelines we support making the modifications required to make them work with: airflow, viz, databricks

Kedro-Viz should work out of the box with both pandas/pyspark pipeline with no modifications. i.e. adding Viz to a pandas pipeline is no different from adding it to a pyspark pipeline
Databricks cannot be added to a pandas pipeline

List of possible output projects:

spaceflights (pandas)	spaceflights-pyspark
	spaceflights-databricks (which builds on pyspark)
spaceflights-airflow	spaceflights-airflow-pyspark
spaceflights-viz	spaceflights

spaceflights-airflow-viz	spaceflights-airflow-viz-pyspark
	spaceflights-databricks-viz

With this method the open questions are :

Where would we store the files needed to be pulled down?
How can we pull them and add them to the project?

In general: the wider open questions are:

how should we handle updating existing files like requirements.txt and [settings.py](http://settings.py) which will be necessary either way

Where do our current approaches not stack up?

Feature	Include nodes.py and pipeline.py	Include pipeline unit tests	Include parameters.yml	Include catalog.yml	Include companies.csv etc.	Update requirements.txt	Update settings.py	Add in new files e.g. databricks_run.py etc.
Modular pipeline workflow	Yes, can download from PyPI	Yes, this is default	Yes, can download from PyPI	Not done	Not done	Yes, uses rope but will be affected by the pyproject.toml workstream	Not done	Not done
Starters	Yes	Yes, this is default	Yes	Yes	Yes	Yes	Yes	Yes

Modular pipeline workflow handles some but not all of these requirements. Requirements that are not covered include the catalog.yml file, the datasets and updating settings.py (required for Kedro-Viz).
Starters could handle this workflow but then we'd have many starters to maintain; we can reduce the number of starters potentially by using boolean flags in the Cookiecutter starter (e.g. turn off tests, or include or exclude example)

Ideas to consider

This task looks at coming up with a way to enable this for users and you have creative freedom to list approaches and the pros and cons of each. Your solution will probably touch a lot of Kedro e.g. plugins, starters, etc.

Extend kedro micropkg pull to include catalog.yml files, datasets and a workflow for updating settings.py (using Rope, like requirements.txt).
Keep the starters but use [Cookiecutter Boolean Variables](https://cookiecutter.readthedocs.io/en/stable/advanced/boolean_variables.html) (you'll need to check if this works for directories too e.g. src/tests)
Use plugins to add files and maybe even append files (if using Rope) e.g. kedro-databricks will have kedro databricks init and it will add databricks_run.py and logging.yml to a project with PySpark.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-07-03T15:15:35Z

Related: https://github.com/kedro-org/kedro/milestone/21

yetudada · 2023-07-11T15:53:30Z

Idea for a prototype

My idea which involves using starters and plugins to enable this would look like:

And if the user wants to have Databricks and Kedro-Viz as plugins it would look like:

What assumptions have I made with this design?

The starters will use Cookiecutter's boolean variables to control whether or not utilities are included e.g. data or docs directories and dependencies in requirements.txt. Therefore, we would not need a kedro-utility plugin or things like kedro-test, kedro-lint, etc. because the starters boolean logic would solve this.
The starters will not control whether or not there is example code for spaceflights
pyspark and kedro-viz support is added into the starters and not via plugins
Plugins would only add files or append the requirements.txt to add dependencies
We would not support a blank template with Kedro-Viz e.g. by adding the dependency to requirements.txt

What would this look like in Kedro?

This would mean, that we would have the following starter and plugins:

Starters:
- blank
- blank-pyspark
- pandas-spaceflights
- pandas-spaceflights-viz
- pyspark-spaceflights
- pyspark-spaceflights-viz
Plugins:
- kedro-databricks with a CLI command called kedro databricks init (works with blank-pyspark, pyspark-spaceflights and pyspark-spaceflights-viz to add or modify files related to Databricks (databricks_run.py and either add (if the user did not select to have a logging.yml) or modify logging.yml)
- kedro-airflow with kedro airflow init (works with blank-pyspark, pyspark-spaceflights and pyspark-spaceflights-viz to add or add dependencies in requirements.txt)

What are the pros of this approach?

We do not need to make a plan to add in all pipeline-related code, data and configuration
We have used conditional logic before starters were introduced
We would maintain fewer starters/project templates (9 vs 6)
There's a plan for Kedro-Viz
Alloy uses the current pyspark starter to make Kedro projects; so we're not getting rid of it, just renaming it

What are the cons of this approach?

We're leaving a lot of logic for selecting utilities in the starters, I don't know if this a problem

merelcht · 2023-07-25T11:21:06Z

Yes, let's go for starters

After researching options in more detail I think that the starter approach is indeed the way to go. Together with the boolean logic to let users select which utilities they want we can leverage cookiecutter hooks and conditionally add/remove files .

There's an opportunity to merge some of the starters and then conditionally pull files (e.g. pyspark and databricks are almost the same), but I'd suggest leaving that for a future iteration of this concept.

Concern

I have one concern about the more advanced examples that would come with adding databricks and airflow support. I created spaceflights projects and added all necessary databricks files. It then required quite a lot of additional steps to actually get the project running on a cluster (e.g. authorise Git, activate cluster etc.). I'm guessing it's similar for deploying to project on Databricks.

Similar story for airflow, but for that project I actually didn't even succeed in running it on Airflow...
Even if we fix the airflow guide, I do feel that both of these workflows are significantly harder to get running compared to pyspark and kedro-viz and it makes me wonder if they even fit with the project creation flow. I think any examples/code that comes with the creation flow should be guaranteed to work. I think it's a fair assumption for users to make that no/hardly additional steps are required when the examples are pulled in, which is true for some things (e.g. data structure, test setup, pyspark) but wouldn't be for these advanced setups.

merelcht · 2023-07-27T10:30:35Z

Following from this spike we'll have to tackle:

CLI flow for add-ons (not advanced stuff): Add "add-ons" flow to kedro new CLI command #2850
Use cookiecutter hooks to strip-out not selected add-ons from template: Use cookiecutter hooks to strip-out not selected "add-ons" from base template #2837
Create new “base” starter with add-ons: Is this really necessary? or can we just keep the base with the add-ons and not strip the base template (step 4)?
(0.19.0) Strip base template from add-ons: Strip project template #2756
Create all new spaceflight projects and add them to a repo: Create new spaceflight starters #2838
How do we deal with overlaps in spaceflight projects?

amandakys added this to Kedro Framework Jul 3, 2023

amandakys added this to the Segment template depending on user persona milestone Jul 3, 2023

amandakys added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 3, 2023

yetudada changed the title ~~Spike: Investigate options to pull a pipeline into a project~~ Spike: Investigate options to pull a pipeline and other files into a project Jul 11, 2023

merelcht self-assigned this Jul 14, 2023

merelcht moved this to In Progress in Kedro Framework Jul 14, 2023

This was referenced Jul 25, 2023

Should we remove the data/ folder from the template? #2379

Closed

Should we remove the notebooks/ folder from the template? #2380

Closed

Should we remove the docs/ folder from the template? #2381

Closed

merelcht moved this from In Progress to In Review in Kedro Framework Jul 25, 2023

This was referenced Jul 25, 2023

Use cookiecutter hooks to strip-out not selected "add-ons" from base template #2837

Closed

Create new spaceflight starters #2838

Closed

Add "add-ons" flow to kedro new CLI command #2850

Closed

merelcht closed this as completed Jul 27, 2023

github-project-automation bot moved this from In Review to Done in Kedro Framework Jul 27, 2023

merelcht mentioned this issue Jul 31, 2023

Spike: how to handle overlap in example spaceflights projects? #2874

Closed

github-actions bot mentioned this issue Aug 18, 2023

Monthly issue metrics report #2950

Closed

SajidAlamQB mentioned this issue Aug 30, 2023

MVP for "add-ons" flow within kedro new CLI command #2987

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Investigate options to pull a pipeline and other files into a project #2758

Spike: Investigate options to pull a pipeline and other files into a project #2758

amandakys commented Jul 3, 2023 •

edited

Loading

astrojuanlu commented Jul 3, 2023

yetudada commented Jul 11, 2023 •

edited

Loading

merelcht commented Jul 25, 2023

merelcht commented Jul 27, 2023

Spike: Investigate options to pull a pipeline and other files into a project #2758

Spike: Investigate options to pull a pipeline and other files into a project #2758

Comments

amandakys commented Jul 3, 2023 • edited Loading

Introduction

Utilities

Plugins

Example Pipeline Code

User Journey

Problem

Proposed Implementation

Option 1

Option 2

Where do our current approaches not stack up?

Ideas to consider

astrojuanlu commented Jul 3, 2023

yetudada commented Jul 11, 2023 • edited Loading

Idea for a prototype

What assumptions have I made with this design?

What would this look like in Kedro?

What are the pros of this approach?

What are the cons of this approach?

merelcht commented Jul 25, 2023

Yes, let's go for starters

Concern

merelcht commented Jul 27, 2023

amandakys commented Jul 3, 2023 •

edited

Loading

yetudada commented Jul 11, 2023 •

edited

Loading