-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline component for papermill? #497
Comments
/cc @aronchick |
We already have a notebook sample test using the papermill to parameterize the notebooks. pipelines/test/sample-test/run_test.sh Line 282 in ad1950b
I just merged the PR. |
I think there's a real opportunity to do some super cool things here - for example -
Am I talking crazy? |
I like the idea of launchers. |
+1 This could be used to generate visualization based on the execution of a notebook. |
We are currently using Airflow (utilizing solely the One of the great wins is that we get to look at the outputs of the runs by storing the notebooks in an s3 bucket which https://github.com/nteract/commuter is linked to. In Data Science, this is specifically useful since we can look at the actual plots as well and compare results for debugging etc. It would be nice if this would be improved or even integrated with kubeflow. Comparing plots/notebooks between runs etc. I think it would be neat if you could via the DAG graph visualisation of a pipeline that has run, get linked to a rendering of the generated output notebook. In general, I think it we will end up in a similar situation with kubeflow. We have multiple components wired towards specific tasks and programming languages. Same goes for specifying profiles for our jupyterhub instance. I would like to help out on this project. |
This is so cool! Is there anything we can do to help here? |
This dude is seems to be on to something. I think it would make sense to help him out a little https://kubeflow-kale.github.io/ eg. provide an official jupyterlab extension for kubeflow. where you could deploy notebook as a pipeline (papermill). Furthermore, having individual cells as lightweight components seems like a pretty good idea. What we ended up doing was using shorter notebooks deployed with Airflow kubernetesPodOperator & papermill. The work that he has started would be more practical. Anyways, it seems like a good alternative option to the SDK |
@LeonardAukea I think it would be easy for me to introduce this feature. However we first need to solve the following design problem: What would be the way to add specify inputs and outputs in the notebook? |
@Ark-kun, I was thinking about this desing. The standard parameters for a papermill execution are the input and output notebooks and the parameters dictionary. How reasonable is to expect a general output file? Such as a json or a yaml. I think it is already expected that some other outputs, such as data and models, to be stored in an external persistence. Maybe the correct step is to make it explicit in the documentation. In this case it will not be ideal but it could be used as a base component that can be detailed later by the user. Also, I think that it would be nice to add a flag for the parameters cell, indicating to accept or not to execute a notebook without an explicit parameters cell. |
@otaviocv Can you write a small mockup? How does the notebook look and what arguments do you pass to the papermill component. |
Yes, I can. |
@Ark-kun, I have tried to develop a simple mockup but didn't time to check integrations with external file systems. My mockup is here. Papermill has native io with external file systems, like AWS and GCS, for notebook read and write, even Http. My idea is that the component would have three inputs:
All these files stored in an external file system or maybe the input notebook and the parameters yaml from a raw path to a git repository. And one output, maybe a json or yaml with more complex information, or just a text file with some simple information. This would serve as a general papermill component and could be modified just on the output without having to change the implementation or the docker image, just the component description to match the particular output. What do you think? I would love to tackle this feature if it is possible... ;) |
@otaviocv Looks like the mockup component does not seem to have any outputs that could be passed forward. But even if it passed out the completed notebook, would it be useful for any downstream tasks? Would some component be able to consume a completed notebook? |
Did this end up going anywhere? Seems like a nice feature to have |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/freeze |
Good news everyone. Please check the sample pipeline: https://github.com/kubeflow/pipelines/blob/78a33d92dcc50dc059bb340d58c4073112028f23/components/notebooks/samples/sample_pipeline.py |
Tried this out today and it is very handy. Much simpler than the .sh wrapper I've used. In the past when I ran notebooks as components I had an output for each relevant file ( The problem I'm now thinking about is that this notebook runner bends that pattern. By returning the output directory as a whole rather than files, passing pieces of outputs feels harder. Any suggestions on how to adapt? Some thoughts I had:
|
…ixes kubeflow#497 (kubeflow#4578) * Components - Added the Run notebook using papermill component Fixes kubeflow#497 * Added a notebook to be used in samples * Added the sample pipeline
…ixes #497 (#4578) * Components - Added the Run notebook using papermill component Fixes kubeflow/pipelines#497 * Added a notebook to be used in samples * Added the sample pipeline
…ixes #497 (#4578) * Components - Added the Run notebook using papermill component Fixes kubeflow/pipelines#497 * Added a notebook to be used in samples * Added the sample pipeline
…ixes #497 (#4578) * Components - Added the Run notebook using papermill component Fixes kubeflow/pipelines#497 * Added a notebook to be used in samples * Added the sample pipeline
…ixes #497 (#4578) * Components - Added the Run notebook using papermill component Fixes kubeflow/pipelines#497 * Added a notebook to be used in samples * Added the sample pipeline
Would it be useful to have a pipeline component that could run a notebook using papermill?
https://github.com/nteract/papermill
papermill is a system for executing notebooks from Netflix.
The text was updated successfully, but these errors were encountered: