Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Investigate options to pull a pipeline and other files into a project #2758

Closed
8 tasks
amandakys opened this issue Jul 3, 2023 · 4 comments
Closed
8 tasks
Assignees
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@amandakys
Copy link

amandakys commented Jul 3, 2023

Introduction

As part of our utilities and plugins work for a new project creation flow, we want to be able to dynamically add/edit files to a blank project based on a set of user provided parameters. The goal is to provide users with a working project, customised according to their requirements.

The parameters available to users will be: (these will be collected as part of the project creation CLI flow)

utilities: 
	testing: true
	linting: true 
	logging: false
	data_structure: false
	documentation: true 

example_code: true

plugins:
	databricks: true
	pyspark: false
	kedro-viz: true
	airflow: false

This problem can be split into 3 parts, which will be discussed separately. The miro board for this discussion can be found here

Utilities

Problem/Goals:

  1. Simplify project template
  2. Allow project template to be customised to target the needs of different user groups

Proposed Solution:

  • allow users to add in utilities as and when they need them

Requirements:
To enable utilities in a project, we need to be able to:

  • add dependencies to requirements.txt
  • add configuration files
  • add a directory structure

These steps should also be available after project creation too, so users can add utilities to their project later on.

Proposed Implementation: This can be done using cookie-cutter boolean variables where certain parts of the code base are hidden/shown based on boolean variables. In this way, the base template actually contains all the code required for each utility, but it is only ‘shown’ when the utilities is selected. With this method we can ensure that users can choose any combination of utilities by ensuring the project works with all utilities enabled. ( This is basically the existing blank project template, so it shouldn’t be a problem)

Plugins

Problem/Goals:

  1. Allow projects to better support third party integrations
  2. Allow projects to be adapted to support third party integrations after initial creation
  3. Allow project template to be customised to target the needs of different user groups

Proposed Solution:

  • allow users to better integrate their projects with third party tools via first party maintained plugins
  • these plugins should be available for installation after project creation so users can add them to their project at a later stage

Plugins need to be able to:

  • add configuration files (i.e. spark.yml)
  • append/edit existing configuration files
    • add dependencies to requirements.txt
    • add code to settings.py
  • add example pipeline code (this is discussed in more detail below)

At the moment we have the following proposed plugins:

  • kedro-viz
  • databricks
  • airflow
  • pyspark (TBD)

Example Pipeline Code

Problem/Goals:

  • standardise example pipelines provided/managed by the team
  • allow project template to be customised to target the needs of different user groups

Proposed Solution:

  • allow users to add an example pipeline into their project that demonstrates the requested features

The goal of example pipeline code is to allow users to add a functioning pipeline to their starting project, based on the parameters they selected. For example, if they selected the databricks plugin on setup, the project should come with pipeline that is ready to use with databricks. Complexity occurs when users select more than one plugin (i.e. viz + pyspark, or airflow + databricks)

Using Spaceflights as the base, we want to be able to

  • Example pipeline code includes files such as:
    • catalog.yml
    • parameters.yml
    • example_pipeline/nodes.py
    • example_pipeline/pipeline.py
    • dataset files companies.csv
    • example tests (unit tests for the the pipeline)
  • They represent files needed to create a functioning pipeline.
  • They need to be treated slightly differently, as it might not make sense to store them in PyPI
  • Across the different plugins

Proposed Implementation: Requires technical design/research, specification is discussed further below

User Journey

The user will go through the project creation CLI flow and asked

  1. What Utilities they’d like
  2. What Plugins they’d like
  3. Whether they’d like an example pipeline

Using these inputs we can generate a set of parameters like this, that we can then use to determine how to create the project

utilities: 
	testing: true
	linting: true 
	logging: false
	data_structure: false
	documentation: true 

example_code: true

plugins:
	databricks: true
	pyspark: false
	kedro-viz: true
	airflow: false

Problem

As part of the requirements for Plugins and Example Pipeline Code we want to be able to

Plugins:

  • add configuration files (files needed to make a kedro project work with a third party i.e. spark.yml, hooks.py, databricks_run.py)
  • append dependencies to requirements.txt
  • append code to settings.py

Example Pipeline Code:

  • add pipeline configuration files i.e. catalog.yml, parameters.yml
  • add pipeline files i.e. nodes.py, pipeline.py
  • add dataset files i.e. companies.csv
  • add example tests tests/test_pipeline.py to provide example unit tests
  • append dependencies to requirements.txt

Proposed Implementation

Option 1

Described by Yetu in her comment below (TLDR: let starters handle example code)

Starters:

  • blank
  • blank-pyspark
  • pandas-spaceflights
  • pandas-spaceflights-viz
  • pyspark-spaceflights
  • pyspark-spaceflights-viz

Plugins:

  • kedro-databricks
  • kedro-airflow

Option 2

Plugins:

  • kedro-databricks
  • kedro-airflow
  • kedro-viz

If we find a way to pull in the different parts of a project, for each of the example we’d seek to maintain the base project in 2 versions of the spaceflights pipeline:

  1. pipeline written with pandas
  2. pipeline written with pyspark

Then for each of these pipelines we support making the modifications required to make them work with: airflow, viz, databricks

  1. Kedro-Viz should work out of the box with both pandas/pyspark pipeline with no modifications. i.e. adding Viz to a pandas pipeline is no different from adding it to a pyspark pipeline
  2. Databricks cannot be added to a pandas pipeline

List of possible output projects:

spaceflights (pandas) spaceflights-pyspark
  spaceflights-databricks (which builds on pyspark)
spaceflights-airflow spaceflights-airflow-pyspark
spaceflights-viz spaceflights
   
spaceflights-airflow-viz spaceflights-airflow-viz-pyspark
  spaceflights-databricks-viz

With this method the open questions are :

  1. Where would we store the files needed to be pulled down?
  2. How can we pull them and add them to the project?

In general: the wider open questions are:

  1. how should we handle updating existing files like requirements.txt and [settings.py](http://settings.py) which will be necessary either way

Where do our current approaches not stack up?

Feature Include nodes.py and pipeline.py Include pipeline unit tests Include parameters.yml Include catalog.yml Include companies.csv etc. Update requirements.txt Update settings.py Add in new files e.g. databricks_run.py etc.
Modular pipeline workflow Yes, can download from PyPI Yes, this is default Yes, can download from PyPI Not done Not done Yes, uses rope but will be affected by the pyproject.toml workstream Not done Not done
Starters Yes Yes, this is default Yes Yes Yes Yes Yes Yes
  • Modular pipeline workflow handles some but not all of these requirements. Requirements that are not covered include the catalog.yml file, the datasets and updating settings.py (required for Kedro-Viz).
  • Starters could handle this workflow but then we'd have many starters to maintain; we can reduce the number of starters potentially by using boolean flags in the Cookiecutter starter (e.g. turn off tests, or include or exclude example)

Ideas to consider

This task looks at coming up with a way to enable this for users and you have creative freedom to list approaches and the pros and cons of each. Your solution will probably touch a lot of Kedro e.g. plugins, starters, etc.

@astrojuanlu
Copy link
Member

@amandakys amandakys added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 3, 2023
@yetudada yetudada changed the title Spike: Investigate options to pull a pipeline into a project Spike: Investigate options to pull a pipeline and other files into a project Jul 11, 2023
@yetudada
Copy link
Contributor

yetudada commented Jul 11, 2023

Idea for a prototype

My idea which involves using starters and plugins to enable this would look like:
Screenshot 2023-07-11 at 16 52 00

And if the user wants to have Databricks and Kedro-Viz as plugins it would look like:
Screenshot 2023-07-11 at 16 52 09

What assumptions have I made with this design?

  • The starters will use Cookiecutter's boolean variables to control whether or not utilities are included e.g. data or docs directories and dependencies in requirements.txt. Therefore, we would not need a kedro-utility plugin or things like kedro-test, kedro-lint, etc. because the starters boolean logic would solve this.
  • The starters will not control whether or not there is example code for spaceflights
  • pyspark and kedro-viz support is added into the starters and not via plugins
  • Plugins would only add files or append the requirements.txt to add dependencies
  • We would not support a blank template with Kedro-Viz e.g. by adding the dependency to requirements.txt

What would this look like in Kedro?

This would mean, that we would have the following starter and plugins:

  • Starters:
    • blank
    • blank-pyspark
    • pandas-spaceflights
    • pandas-spaceflights-viz
    • pyspark-spaceflights
    • pyspark-spaceflights-viz
  • Plugins:
    • kedro-databricks with a CLI command called kedro databricks init (works with blank-pyspark, pyspark-spaceflights and pyspark-spaceflights-viz to add or modify files related to Databricks (databricks_run.py and either add (if the user did not select to have a logging.yml) or modify logging.yml)
    • kedro-airflow with kedro airflow init (works with blank-pyspark, pyspark-spaceflights and pyspark-spaceflights-viz to add or add dependencies in requirements.txt)

What are the pros of this approach?

  • We do not need to make a plan to add in all pipeline-related code, data and configuration
  • We have used conditional logic before starters were introduced
  • We would maintain fewer starters/project templates (9 vs 6)
  • There's a plan for Kedro-Viz
  • Alloy uses the current pyspark starter to make Kedro projects; so we're not getting rid of it, just renaming it

What are the cons of this approach?

  • We're leaving a lot of logic for selecting utilities in the starters, I don't know if this a problem

@merelcht merelcht self-assigned this Jul 14, 2023
@merelcht
Copy link
Member

Yes, let's go for starters

After researching options in more detail I think that the starter approach is indeed the way to go. Together with the boolean logic to let users select which utilities they want we can leverage cookiecutter hooks and conditionally add/remove files .

There's an opportunity to merge some of the starters and then conditionally pull files (e.g. pyspark and databricks are almost the same), but I'd suggest leaving that for a future iteration of this concept.

Concern

I have one concern about the more advanced examples that would come with adding databricks and airflow support. I created spaceflights projects and added all necessary databricks files. It then required quite a lot of additional steps to actually get the project running on a cluster (e.g. authorise Git, activate cluster etc.). I'm guessing it's similar for deploying to project on Databricks.

Similar story for airflow, but for that project I actually didn't even succeed in running it on Airflow...
Even if we fix the airflow guide, I do feel that both of these workflows are significantly harder to get running compared to pyspark and kedro-viz and it makes me wonder if they even fit with the project creation flow. I think any examples/code that comes with the creation flow should be guaranteed to work. I think it's a fair assumption for users to make that no/hardly additional steps are required when the examples are pulled in, which is true for some things (e.g. data structure, test setup, pyspark) but wouldn't be for these advanced setups.

@merelcht
Copy link
Member

Following from this spike we'll have to tackle:

  1. CLI flow for add-ons (not advanced stuff): Add "add-ons" flow to kedro new CLI command #2850
  2. Use cookiecutter hooks to strip-out not selected add-ons from template: Use cookiecutter hooks to strip-out not selected "add-ons" from base template #2837
  3. Create new “base” starter with add-ons: Is this really necessary? or can we just keep the base with the add-ons and not strip the base template (step 4)?
  4. (0.19.0) Strip base template from add-ons: Strip project template  #2756
  5. Create all new spaceflight projects and add them to a repo: Create new spaceflight starters #2838
  6. How do we deal with overlaps in spaceflight projects?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests

4 participants