Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing Utility Modules to Kedro #2388

Closed
amandakys opened this issue Mar 3, 2023 · 8 comments
Closed

Introducing Utility Modules to Kedro #2388

amandakys opened this issue Mar 3, 2023 · 8 comments
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation Type: Parent Issue

Comments

@amandakys
Copy link

amandakys commented Mar 3, 2023

Introduction

The desire to simplify Kedro to make it less intimidating for new users contradicts with the goal of Kedro to provide an opinionated, structured application of SWE best practices to data science code. This conflict represents the guiding arguments on whether users should have to opt-in (simplify) or opt-out (opinionated) of ‘non-essential’ functionality.

Kedro’s user base can be broadly split into 2 groups:

  1. Beginner User
  2. Expert User

Both of these groups need to be targeted to drive adoption

We do not want to deteriorate the experience of our expert users who know and love Kedro at the expense of driving beginner/new user adoption. However, their needs are often contradictory. In maintaining the status quo, we do not move towards addressing our beginner user problem. Therefore, our goal should be to find solutions that improve the beginner user experience without significantly degrading the expert user experience.

Background

Discussion of whether user should opt-in or opt-out of features grey from discussion about removing support project-side logging.yml [#2281] and continued in discussions of how to simplify the default project template [#2149] by removing ‘non-essential’ directories.

The status quo is opt-out. The key unknown for opt-in is how to streamline the opt-in journey. Our default approach is to put the information in the docs. This raises the question of feature discoverability. How will new users known these features exist, if they aren’t indicated by the project template?

image

I generalised the opt-in and opt-out flows as best I can, they will be a little different for each functionality

Related tickets

Concept 1: Simplifying project template by removing ‘unused’ or ‘unnecessary’ features

Concept 2: Improve starter journey to increase accessibility of Kedro

Design

Step 1: Revised Kedro initialisation journey

  • integrate starter selection into kedro new
    • Choose a starter from this list
    • by default none of the starters will contain the removed ‘utilities’
  • allow the configuration of utility options
    • i.e. Do you want linting? Y/N, Do you want Testing Y/N
  • For those who want to skip ‘wizard creation’ a shortcut can be used?
    • kedro new --starter=blank --add-lint --add-test --local-data -add-logs [project-name]
    • kedro new --modules=all
    • kedro new --modules=lint,test,data,logs
(my-virtual-environment) ➜  kedro new 

Project Template
=============
Choose a project template for your new project. 
- astro-airflow-iris: An Iris dataset example project with a minimal setup for deploying the pipeline on Airflow with Astronomer
- blank: A minimal project template
- pandas-iris: An Iris dataset example using Pandas
- pyspark: Configuration and inistialisation for a PySpark pipeline
- pyspark-iris: An Iris dataset example using PySpark 
- spaceflights: Spaceflights tutorial example code
- standalone-datacatalog: A minimum setup to use Kedro's DataCatalog

 [Select your template]: blank

Project Utilities
===========
Here you can select which project utilities you'd like to include. 
Don't worry if you change your mind you can always add/remove these modules later.
To read more about these utilities and what they do visit: kedro.org/ 

Would you like to include linting? [y/n]: 
Would you like to include testing? [y/n]:
Will you be storing data locally? [y/n]:
Would you like custom logging functionality? [y/n]: 

Project Name
============
Please enter a human readable name for your new project.
Spaces, hyphens, and underscores are allowed.

 [New Kedro Project]: My ML pipeline 

The project name 'My ML pipeline' has been applied to: 
- The project title in /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline/README.md 
- The folder created for your project in /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline 
- The project\'s python package in /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline/src/my_ml_pipeline

A best-practice setup includes initialising git and creating a virtual environment before running 'pip install -r src/requirements.txt' to install project-specific dependencies. Refer to the Kedro documentation: https://kedro.readthedocs.io/

Change directory to the project generated in /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline by entering 'cd /Users/yetunde_dada/PycharmProjects/kedro-cli-redesign/my-ml-pipeline'
  • at this point of the flow: we have a lot of options about what the default behaviour will be
  • kedro new could default to all modules or no modules, the module selection journey can be compulsory, we can add convenience options for expert users. (although since project creation is not a frequent action, I don’t think it taking an extra few steps is terrible)

Step 2: Kedro Utility Modules

Testing, Linting, Logging, Data Structure, can be a growing list and a way to add in new features

  • What is a module?
  • They can be a file structure (i.e. data folder), a set of files + file structure + configuration settings (i.e. logging, testing, linting). Primarily their goal is to be easily inserted and removed from a repo. To achieve this, they should be self-sufficient and independent components.

Step 3: Simplify Module insertion journey åfter initialisation

  • an initial implementation should focus on allowing users to select and ‘plug in’ modules on initialisation.

  • the later touchpoint of users realising that they want a module, and plugging that into a ‘unclean’ project repo will be more complex as we will have to deal with more unknowns and code clashes.

  • at this point a try-catch approach could be a utility command that ‘tries’ to insert logging, given certain prerequisites that the command or the user is responsible for checking.

    • You already have a logging.yml file are you sure you want to overwrite it?
    • If fails, users should be directed to a step by step walkthrough of the steps the utility command tries to perform. Article: How to add logging manually.
    kedro add --modules=logs
    You already have a logging.yml file are you sure you want to overwrite it? [y/n]:
    
    .
    .
    .
    
    We were unable to automatically add logging. For step by step instructions visit: xxx.com 
    

Step 4: Simplify Module deletion journey after initialisation

  • if we know what we supply by default, we can know whether users modify those files.
  • With some regularity, our CLI can prompt the users to delete modules they aren’t using.
We've noticed you are still using our template testing files. 
To learn more about testing with Kedro visit: kedro.org/testing
To learn more about using alternative testing libraries with Kedro visit: xxx.com 
If you no longer need the testing module run `kedro remove-testing`

Evaluative Questions & Thoughts

  • What modules should be included in Kedro by default is an open discussion
    • if telemetry could tell us what were the most commonly enabled/disabled modules this would be valuable information.
  • A simple starting journey could mean that users are comfortable starting again, maybe creating a new starter project with the utilities included then plugging in code they’d written from their simpler ‘test project’.
  • A detailed look at what we consider starters, when we should provide starters and how they differ from each other
    • this feeds into Advanced Starters, but should be considered separately to the idea of modularity.

Alternative Options

  • update/add readme files into directories i.e. \notebooks or \logs to direct users to the relevant part of the docs

Rollout strategy

  • phased rollout: We should not support module insertion/deletion beyond the initialisation step
    • if modules perform well, we can look into the post-initialisation insertion/deletion journey

Planned Research Activities

  1. Team design session aligning Kedro’s priorities and goals
  2. Technical design session to evaluate technical feasibility of potential solutions
@yetudada
Copy link
Contributor

yetudada commented Mar 6, 2023

@amandakys you're killing it on this issue! 🎉 Let me provide supporting evidence and comments.

Things that we have evidence for

  • We must support beginner users better because this group's size is significantly larger than the expert users; internally, we have 19 MLEs vs 427 DS. This skew is likely representative of the external industry too.
  • Current data shows that users do not necessarily leverage best practices, even when it is visible in their project template and CLI (Evaluating CLI command usage #1293); while this data is imperfect, it is the only data that we have.
  • It appears our beginner users will not leverage tools because they don't know how to; look at the oversubscription of "Software Engineering for Data Scientists" - 60 marked for attendance, 69 on the waitlist.

Broadly, I agree with a journey that makes it easy and discoverable for our users to opt-in to additional functionality because it's in line with our principles of "growing beginners into experts".

Comments and questions on the prototype

  • I agree with the direction of the prototype, with a focus on steps 1 and 2 before building out steps 3 and 4
  • Telemetry at project creation will be challenging, but I would like to see that this is explored
  • We might also need an option for documentation
  • I think we might need to call these things "utilities" instead of "modules" because we already have concepts like "modular pipelines"
  • Question: Do we have data to suggest splitting the questions on linting, testing and documentation? Could they be grouped so that users have fewer questions?

Comments on Step 1

  • Can there be consistency on the CLI flags e.g. could kedro new --starter=blank --add-lint --add-test --local-data -add-logs [project-name] just become kedro new --starter=blank --add-utilities=lint,test,data,docs,logs --name=project-name
  • I also really like the addition of allowing users to add in a project-name via the flags

@yetudada yetudada added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Mar 6, 2023
@merelcht
Copy link
Member

merelcht commented Mar 6, 2023

This is fantastic! Thanks for scoping out this work @amandakys

I have one minor comment around testing: kedro test was always a tiny wrapper around pytest so if we add any testing structure I would prefer if we just set it up as pure pytest. kedro lint is a different story because that combined several tools and could potentially still be a useful command.

@noklam
Copy link
Contributor

noklam commented Mar 6, 2023

From the backward compatibility perspective, do these utilities need to be registered in some settings file, or it works just by recognizing certain files structure?

@amandakys
Copy link
Author

amandakys commented Mar 14, 2023

I've done some additional work based on feedback.

I agree with @yetudada that Utilities is a less ambiguous name than Modules.

Step 1: Revised Initialisation Journey

The proposed new commands would include:
kedro new --template=blank which would be a direct rename of kedro new --starter (see #2422)
kedro new --project-name=my_ML_project allowing users to specify a project name inline

If these options --template and --project-name are not supplied the CLI will go through the relevant creation wizard

Project Name
============
Please enter a human readable name for your new project.
Spaces, hyphens, and underscores are allowed.

 [New Kedro Project]: My ML pipeline 
Project Template
=============
Choose a project template for your new project. 
- astro-airflow-iris: An Iris dataset example project with a minimal setup for deploying the pipeline on Airflow with Astronomer
- blank: A minimal project template
- pandas-iris: An Iris dataset example using Pandas
- pyspark: Configuration and inistialisation for a PySpark pipeline
- pyspark-iris: An Iris dataset example using PySpark 
- spaceflights: Spaceflights tutorial example code
- standalone-datacatalog: A minimum setup to use Kedro's DataCatalog

[Select your template]: blank

Step 2: Introduce Utilities

Proposed new commands include:
kedro new --utilities=test,logs,local_data,lint,docs
kedro new --utilities=all
kedro new --utilities=none

If no --utilities is supplied, the CLI will go through the Utility selection wizard. (This has been changed to reduce the number of questions users need to answer to select modules)

Project Utilities
===========
Here you can select which Kedro utilities you'd like to include. 
Don't worry if you change your mind you can always add/remove these modules later.
To read more about these utilities and what they do visit: kedro.org/

Kedro Utilities
1) Linting : some description
2) Testing : xxx
3) Local Data Storage : xxx
4) Custom Logging : xxx
5) Documentation: xxx

Which utilities would you like to include in your project? [1-4/all/1,3]: 

This would then be distinct from the command --add-utilities relevant in Step 3

Telemetry at Project Creation

Yetu mentions that telemetry consent may only asked for and granted after the project creation step, so we might struggle to collect data about what parameters are supplied to the kedro new command. But if we know what files/folders/imports each utility adds, we can find something unique in each utility to look for after project creation is complete to verify what utilities users chose to include.

@deepyaman
Copy link
Member

deepyaman commented Mar 27, 2023

Copying a couple comments from Slack to have them in the same place...

Randomly reading the Getting Started docs for React, I like this goal/phrasing:
ReactKedro has been designed from the start for gradual adoption, and you can use as little or as much ReactKedro as you need.

Fits in with/adds fuel to the whole opt-in/opt-out discussion, as well as recent requests to be able to use pieces of Kedro.

Originally posted in https://kedro-org.slack.com/archives/C03QP0NH2J2/p1678806542848929

I assume the "minimal" path to using Kedro right now is the standalone data catalog (https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_as_a_data_registry.html), but it's not the starting point for new users the way things are written--the starting point is very much the whole thing with pipelines and I/O with all the CLI.

To be honest, even this isn't very minimal, since it requires the CLI + starter + folder structure; could imagine the more minimal case would be just pip install kedro , start a notebook, and create a data catalog programmatically and use it in functions. But then we get back into the loop of, is this showing Kedro's value proposition. 😂

Originally posted in https://kedro-org.slack.com/archives/C03QP0NH2J2/p1678817179219289?thread_ts=1678806542.848929&cid=C03QP0NH2J2

@amandakys
Copy link
Author

amandakys commented Apr 17, 2023

Following our discussion in tech design the next steps for this work are (in order):

@yetudada
Copy link
Contributor

yetudada commented Sep 4, 2023

I'm going to close this ticket in favour of #2506

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation Type: Parent Issue
Projects
Archived in project
Status: Shipped 🚀
Development

No branches or pull requests

5 participants