Pipeline component for papermill? #497

jlewi · 2018-12-07T19:15:53Z

Would it be useful to have a pipeline component that could run a notebook using papermill?
https://github.com/nteract/papermill

papermill is a system for executing notebooks from Netflix.

jlewi · 2018-12-07T19:16:09Z

/cc @aronchick

gaoning777 · 2018-12-07T19:18:37Z

We already have a notebook sample test using the papermill to parameterize the notebooks.

pipelines/test/sample-test/run_test.sh

Line 282 in ad1950b

    
           papermill --prepare-only -p EXPERIMENT_NAME notebook-tfx-test -p OUTPUT_DIR ${RESULTS_GCS_DIR} -p PROJECT_NAME ml-pipeline-test \

I just merged the PR.

aronchick · 2018-12-08T20:25:44Z

I think there's a real opportunity to do some super cool things here - for example -
https://github.com/nteract/papermill

Make the default notebook in Jupyter "parameterizable" via configMap (or other) namespace variables
Collect all statistics in a PV for sharing (don't we also have this problem with TensorBoard/etc)
Executing notebooks (particularly non-distributed ones) using native Kubernetes constructs

Am I talking crazy?

Ark-kun · 2018-12-12T23:22:10Z

I like the idea of launchers.
We have some code in form of command line or container task or notebook or source-code URL.
We can launch that code using different launchers: directly on single node, on CMLE, on Kubeflow etc.

vicaire · 2019-03-26T04:25:05Z

+1

This could be used to generate visualization based on the execution of a notebook.

LeonardAukea · 2019-08-23T17:00:15Z

We are currently using Airflow (utilizing solely theKubernetesPodOperator) to launch jobs with great success. Notebook jobs (Docker+papermill) plays a big role. Now that you guys, support pipelines neatly. I'm actually thinking that we should switch to using kubeflow instead.

One of the great wins is that we get to look at the outputs of the runs by storing the notebooks in an s3 bucket which https://github.com/nteract/commuter is linked to. In Data Science, this is specifically useful since we can look at the actual plots as well and compare results for debugging etc. It would be nice if this would be improved or even integrated with kubeflow. Comparing plots/notebooks between runs etc.

I think it would be neat if you could via the DAG graph visualisation of a pipeline that has run, get linked to a rendering of the generated output notebook.

In general, I think it we will end up in a similar situation with kubeflow. We have multiple components wired towards specific tasks and programming languages. Same goes for specifying profiles for our jupyterhub instance.

I would like to help out on this project.

aronchick · 2019-08-23T17:12:18Z

This is so cool! Is there anything we can do to help here?

LeonardAukea · 2019-09-13T16:07:18Z

This dude is seems to be on to something. I think it would make sense to help him out a little https://kubeflow-kale.github.io/

eg. provide an official jupyterlab extension for kubeflow. where you could deploy notebook as a pipeline (papermill). Furthermore, having individual cells as lightweight components seems like a pretty good idea.

What we ended up doing was using shorter notebooks deployed with Airflow kubernetesPodOperator & papermill. The work that he has started would be more practical.

Anyways, it seems like a good alternative option to the SDK

Ark-kun · 2019-09-24T01:16:55Z

@LeonardAukea I think it would be easy for me to introduce this feature.
I've already added support for Airflow operators (via create_component_from_airflow_op) and running notebooks is much easier.

However we first need to solve the following design problem:
How do you specify inputs and outputs of the notebook? Components correspond to functions or entrypoints, but notebook is a script - it does not have natural parameters or return values.

What would be the way to add specify inputs and outputs in the notebook?
How should we support running notebooks that were created before this feature?

otaviocv · 2019-12-06T11:36:32Z

@Ark-kun, I was thinking about this desing. The standard parameters for a papermill execution are the input and output notebooks and the parameters dictionary.

How reasonable is to expect a general output file? Such as a json or a yaml.

I think it is already expected that some other outputs, such as data and models, to be stored in an external persistence. Maybe the correct step is to make it explicit in the documentation.

In this case it will not be ideal but it could be used as a base component that can be detailed later by the user.

Also, I think that it would be nice to add a flag for the parameters cell, indicating to accept or not to execute a notebook without an explicit parameters cell.

Ark-kun · 2019-12-11T19:25:34Z

@otaviocv Can you write a small mockup? How does the notebook look and what arguments do you pass to the papermill component.

otaviocv · 2019-12-13T14:42:25Z

Yes, I can.

otaviocv · 2019-12-14T16:14:59Z

@Ark-kun, I have tried to develop a simple mockup but didn't time to check integrations with external file systems. My mockup is here.

Papermill has native io with external file systems, like AWS and GCS, for notebook read and write, even Http.

My idea is that the component would have three inputs:

input notebook
output notebook
parameters dictionary (a yaml file)

All these files stored in an external file system or maybe the input notebook and the parameters yaml from a raw path to a git repository.

And one output, maybe a json or yaml with more complex information, or just a text file with some simple information. This would serve as a general papermill component and could be modified just on the output without having to change the implementation or the docker image, just the component description to match the particular output.

What do you think?

I would love to tackle this feature if it is possible... ;)

Ark-kun · 2020-01-16T02:19:21Z

@otaviocv Looks like the mockup component does not seem to have any outputs that could be passed forward. But even if it passed out the completed notebook, would it be useful for any downstream tasks? Would some component be able to consume a completed notebook?

ca-scribner · 2020-06-19T16:06:13Z

Did this end up going anywhere? Seems like a nice feature to have

stale · 2020-09-18T17:43:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Ark-kun · 2020-09-22T00:12:56Z

/freeze

Fixes kubeflow#497

Ark-kun · 2020-10-02T10:17:14Z

Good news everyone.
I've created the "Run notebook using papermill" component.
You give it a notebook, parameter values (optional) and, optionally, arbitrary input data (like dataset or a directory with files). The component runs the notebook and outputs the executed notebook and an optional directory with any additional output data .

Please check the sample pipeline: https://github.com/kubeflow/pipelines/blob/78a33d92dcc50dc059bb340d58c4073112028f23/components/notebooks/samples/sample_pipeline.py

…ixes #497 (#4578) * Components - Added the Run notebook using papermill component Fixes #497 * Added a notebook to be used in samples * Added the sample pipeline

ca-scribner · 2020-10-13T20:45:44Z

Tried this out today and it is very handy. Much simpler than the .sh wrapper I've used.

In the past when I ran notebooks as components I had an output for each relevant file (model=some_model_file, params_consumed=params_consumed.yml, etc). I could then pass these to downstream parts of the pipeline (myStep = create_some_step(model=upstream_step.outputs['model']). This was useful to put select outputs into object stores (eg: `put model --> minio), and also nice because I could maybe take data from two different upstream sources and easily pass them both to a consumer notebook run by papermill.

The problem I'm now thinking about is that this notebook runner bends that pattern. By returning the output directory as a whole rather than files, passing pieces of outputs feels harder. Any suggestions on how to adapt? Some thoughts I had:

convert downstream components to accept an input directory instead of individual files. Fine if all the inputs are coming from the same source (accept /dir, then pull model, data, ... from that dir) but kinda complicated if they're coming from different sources (now each input has a directory and filename)
making a custom component.yaml for each notebook I want to run (and it can have the specific file outputs instead of a single directory output). Flexible, but kills the nice generic nature of your work
writing a reusable "get file from dir" component that can make a file from an upstream output directory into something consumable by itself at the kfp sdk level. This might be a nice general feature for working with directory outputs in kfp. Downside is it could make for a really messy pipeline graph

…ixes kubeflow#497 (kubeflow#4578) * Components - Added the Run notebook using papermill component Fixes kubeflow#497 * Added a notebook to be used in samples * Added the sample pipeline

…ixes #497 (#4578) * Components - Added the Run notebook using papermill component Fixes kubeflow/pipelines#497 * Added a notebook to be used in samples * Added the sample pipeline

vicaire assigned hongye-sun Mar 26, 2019

vicaire added priority/p1 area/components kind/feature labels Mar 26, 2019

Ark-kun assigned Ark-kun and unassigned hongye-sun Sep 24, 2019

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 18, 2020

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 22, 2020

Ark-kun added a commit to Ark-kun/pipelines that referenced this issue Oct 2, 2020

Components - Added the Run notebook using papermill component

4ebce5f

Fixes kubeflow#497

Ark-kun mentioned this issue Oct 2, 2020

feat(components): Added the Run notebook using papermill component. Fixes #497 #4578

Merged

k8s-ci-robot closed this as completed in #4578 Oct 12, 2020

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023

util check_secret (kubeflow#497)

74ee388

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline component for papermill? #497

Pipeline component for papermill? #497

jlewi commented Dec 7, 2018

jlewi commented Dec 7, 2018

gaoning777 commented Dec 7, 2018 •

edited

Loading

aronchick commented Dec 8, 2018

Ark-kun commented Dec 12, 2018

vicaire commented Mar 26, 2019

LeonardAukea commented Aug 23, 2019

aronchick commented Aug 23, 2019

LeonardAukea commented Sep 13, 2019 •

edited

Loading

Ark-kun commented Sep 24, 2019

otaviocv commented Dec 6, 2019

Ark-kun commented Dec 11, 2019 •

edited

Loading

otaviocv commented Dec 13, 2019

otaviocv commented Dec 14, 2019

Ark-kun commented Jan 16, 2020

ca-scribner commented Jun 19, 2020

stale bot commented Sep 18, 2020

Ark-kun commented Sep 22, 2020

Ark-kun commented Oct 2, 2020

ca-scribner commented Oct 13, 2020

Pipeline component for papermill? #497

Pipeline component for papermill? #497

Comments

jlewi commented Dec 7, 2018

jlewi commented Dec 7, 2018

gaoning777 commented Dec 7, 2018 • edited Loading

aronchick commented Dec 8, 2018

Ark-kun commented Dec 12, 2018

vicaire commented Mar 26, 2019

LeonardAukea commented Aug 23, 2019

aronchick commented Aug 23, 2019

LeonardAukea commented Sep 13, 2019 • edited Loading

Ark-kun commented Sep 24, 2019

otaviocv commented Dec 6, 2019

Ark-kun commented Dec 11, 2019 • edited Loading

otaviocv commented Dec 13, 2019

otaviocv commented Dec 14, 2019

Ark-kun commented Jan 16, 2020

ca-scribner commented Jun 19, 2020

stale bot commented Sep 18, 2020

Ark-kun commented Sep 22, 2020

Ark-kun commented Oct 2, 2020

ca-scribner commented Oct 13, 2020

gaoning777 commented Dec 7, 2018 •

edited

Loading

LeonardAukea commented Sep 13, 2019 •

edited

Loading

Ark-kun commented Dec 11, 2019 •

edited

Loading