Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline component for papermill? #497

Closed
jlewi opened this issue Dec 7, 2018 · 19 comments · Fixed by #4578
Closed

Pipeline component for papermill? #497

jlewi opened this issue Dec 7, 2018 · 19 comments · Fixed by #4578

Comments

@jlewi
Copy link
Contributor

jlewi commented Dec 7, 2018

Would it be useful to have a pipeline component that could run a notebook using papermill?
https://github.com/nteract/papermill

papermill is a system for executing notebooks from Netflix.

@jlewi
Copy link
Contributor Author

jlewi commented Dec 7, 2018

/cc @aronchick

@gaoning777
Copy link
Contributor

gaoning777 commented Dec 7, 2018

We already have a notebook sample test using the papermill to parameterize the notebooks.

papermill --prepare-only -p EXPERIMENT_NAME notebook-tfx-test -p OUTPUT_DIR ${RESULTS_GCS_DIR} -p PROJECT_NAME ml-pipeline-test \

I just merged the PR.

@aronchick
Copy link

I think there's a real opportunity to do some super cool things here - for example -
https://github.com/nteract/papermill

  • Make the default notebook in Jupyter "parameterizable" via configMap (or other) namespace variables
  • Collect all statistics in a PV for sharing (don't we also have this problem with TensorBoard/etc)
  • Executing notebooks (particularly non-distributed ones) using native Kubernetes constructs

Am I talking crazy?

@Ark-kun
Copy link
Contributor

Ark-kun commented Dec 12, 2018

I like the idea of launchers.
We have some code in form of command line or container task or notebook or source-code URL.
We can launch that code using different launchers: directly on single node, on CMLE, on Kubeflow etc.

@vicaire
Copy link
Contributor

vicaire commented Mar 26, 2019

+1

This could be used to generate visualization based on the execution of a notebook.

@LeonardAukea
Copy link
Contributor

We are currently using Airflow (utilizing solely theKubernetesPodOperator) to launch jobs with great success. Notebook jobs (Docker+papermill) plays a big role. Now that you guys, support pipelines neatly. I'm actually thinking that we should switch to using kubeflow instead.

One of the great wins is that we get to look at the outputs of the runs by storing the notebooks in an s3 bucket which https://github.com/nteract/commuter is linked to. In Data Science, this is specifically useful since we can look at the actual plots as well and compare results for debugging etc. It would be nice if this would be improved or even integrated with kubeflow. Comparing plots/notebooks between runs etc.

I think it would be neat if you could via the DAG graph visualisation of a pipeline that has run, get linked to a rendering of the generated output notebook.

In general, I think it we will end up in a similar situation with kubeflow. We have multiple components wired towards specific tasks and programming languages. Same goes for specifying profiles for our jupyterhub instance.

I would like to help out on this project.

@aronchick
Copy link

This is so cool! Is there anything we can do to help here?

@LeonardAukea
Copy link
Contributor

LeonardAukea commented Sep 13, 2019

This dude is seems to be on to something. I think it would make sense to help him out a little https://kubeflow-kale.github.io/

eg. provide an official jupyterlab extension for kubeflow. where you could deploy notebook as a pipeline (papermill). Furthermore, having individual cells as lightweight components seems like a pretty good idea.

What we ended up doing was using shorter notebooks deployed with Airflow kubernetesPodOperator & papermill. The work that he has started would be more practical.

Anyways, it seems like a good alternative option to the SDK

@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 24, 2019

@LeonardAukea I think it would be easy for me to introduce this feature.
I've already added support for Airflow operators (via create_component_from_airflow_op) and running notebooks is much easier.

However we first need to solve the following design problem:
How do you specify inputs and outputs of the notebook? Components correspond to functions or entrypoints, but notebook is a script - it does not have natural parameters or return values.

What would be the way to add specify inputs and outputs in the notebook?
How should we support running notebooks that were created before this feature?

@Ark-kun Ark-kun assigned Ark-kun and unassigned hongye-sun Sep 24, 2019
@otaviocv
Copy link

otaviocv commented Dec 6, 2019

@Ark-kun, I was thinking about this desing. The standard parameters for a papermill execution are the input and output notebooks and the parameters dictionary.

How reasonable is to expect a general output file? Such as a json or a yaml.

I think it is already expected that some other outputs, such as data and models, to be stored in an external persistence. Maybe the correct step is to make it explicit in the documentation.

In this case it will not be ideal but it could be used as a base component that can be detailed later by the user.

Also, I think that it would be nice to add a flag for the parameters cell, indicating to accept or not to execute a notebook without an explicit parameters cell.

@Ark-kun
Copy link
Contributor

Ark-kun commented Dec 11, 2019

@otaviocv Can you write a small mockup? How does the notebook look and what arguments do you pass to the papermill component.

@otaviocv
Copy link

Yes, I can.

@otaviocv
Copy link

@Ark-kun, I have tried to develop a simple mockup but didn't time to check integrations with external file systems. My mockup is here.

Papermill has native io with external file systems, like AWS and GCS, for notebook read and write, even Http.

My idea is that the component would have three inputs:

  • input notebook
  • output notebook
  • parameters dictionary (a yaml file)

All these files stored in an external file system or maybe the input notebook and the parameters yaml from a raw path to a git repository.

And one output, maybe a json or yaml with more complex information, or just a text file with some simple information. This would serve as a general papermill component and could be modified just on the output without having to change the implementation or the docker image, just the component description to match the particular output.

What do you think?

I would love to tackle this feature if it is possible... ;)

@Ark-kun
Copy link
Contributor

Ark-kun commented Jan 16, 2020

@otaviocv Looks like the mockup component does not seem to have any outputs that could be passed forward. But even if it passed out the completed notebook, would it be useful for any downstream tasks? Would some component be able to consume a completed notebook?

@ca-scribner
Copy link
Contributor

Did this end up going anywhere? Seems like a nice feature to have

@stale
Copy link

stale bot commented Sep 18, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 18, 2020
@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 22, 2020

/freeze

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 22, 2020
Ark-kun added a commit to Ark-kun/pipelines that referenced this issue Oct 2, 2020
@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 2, 2020

Good news everyone.
I've created the "Run notebook using papermill" component.
You give it a notebook, parameter values (optional) and, optionally, arbitrary input data (like dataset or a directory with files). The component runs the notebook and outputs the executed notebook and an optional directory with any additional output data .

Please check the sample pipeline: https://github.com/kubeflow/pipelines/blob/78a33d92dcc50dc059bb340d58c4073112028f23/components/notebooks/samples/sample_pipeline.py

k8s-ci-robot pushed a commit that referenced this issue Oct 12, 2020
…ixes #497 (#4578)

* Components - Added the Run notebook using papermill component

Fixes #497

* Added a notebook to be used in samples

* Added the sample pipeline
@ca-scribner
Copy link
Contributor

Tried this out today and it is very handy. Much simpler than the .sh wrapper I've used.

In the past when I ran notebooks as components I had an output for each relevant file (model=some_model_file, params_consumed=params_consumed.yml, etc). I could then pass these to downstream parts of the pipeline (myStep = create_some_step(model=upstream_step.outputs['model']). This was useful to put select outputs into object stores (eg: `put model --> minio), and also nice because I could maybe take data from two different upstream sources and easily pass them both to a consumer notebook run by papermill.

The problem I'm now thinking about is that this notebook runner bends that pattern. By returning the output directory as a whole rather than files, passing pieces of outputs feels harder. Any suggestions on how to adapt? Some thoughts I had:

  • convert downstream components to accept an input directory instead of individual files. Fine if all the inputs are coming from the same source (accept /dir, then pull model, data, ... from that dir) but kinda complicated if they're coming from different sources (now each input has a directory and filename)
  • making a custom component.yaml for each notebook I want to run (and it can have the specific file outputs instead of a single directory output). Flexible, but kills the nice generic nature of your work
  • writing a reusable "get file from dir" component that can make a file from an upstream output directory into something consumable by itself at the kfp sdk level. This might be a nice general feature for working with directory outputs in kfp. Downside is it could make for a really messy pipeline graph

Jeffwan pushed a commit to Jeffwan/pipelines that referenced this issue Dec 9, 2020
…ixes kubeflow#497 (kubeflow#4578)

* Components - Added the Run notebook using papermill component

Fixes kubeflow#497

* Added a notebook to be used in samples

* Added the sample pipeline
Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Jul 15, 2021
…ixes #497 (#4578)

* Components - Added the Run notebook using papermill component

Fixes kubeflow/pipelines#497

* Added a notebook to be used in samples

* Added the sample pipeline
Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Jul 17, 2021
…ixes #497 (#4578)

* Components - Added the Run notebook using papermill component

Fixes kubeflow/pipelines#497

* Added a notebook to be used in samples

* Added the sample pipeline
Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Jul 17, 2021
…ixes #497 (#4578)

* Components - Added the Run notebook using papermill component

Fixes kubeflow/pipelines#497

* Added a notebook to be used in samples

* Added the sample pipeline
Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Aug 1, 2021
…ixes #497 (#4578)

* Components - Added the Run notebook using papermill component

Fixes kubeflow/pipelines#497

* Added a notebook to be used in samples

* Added the sample pipeline
Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants