Skip to content

Workflows: Executing notebooks as a DAG? #468

@matthiasdv

Description

@matthiasdv

Note: This should be tagged as question / suggestion but I don't think I can do that myself.

I have a use case where I want to enable data scientists to execute the notebooks they create as a DAG. This would be part of their development workflow, in order to ensure a set of notebooks work as an integrated pipeline before it is scheduled in a production environment using something like Airflow.

My question; Is there currently a way to achieve this using papermill? The Readme mentions workflows, but papermill engines are the closest thing I can find to a workflow (interpreted as a pipeline of notebooks)

And if there's no such functionality, what would be the best way to integrate this with papermill, bookstore, scrapbook etc?

@MSeal , Am particularly interested in feedback if you have the time.

Some very high level details about the environment this is situated in;

  • Data scientists author individual Jupyter notebooks
  • When moved to production, these are scheduled as a DAG using Airflow
  • Each Airflow task spawns it's own papermill kernell on a remote compute cluster; which is responsible for executing only the notebook described by the task instance.
  • nteract-scrapbook collects logs and provides feedback to the task scheduler.

What I'm working on is a way the data scientists to formally define their set of notebooks as a DAG and be able to execute it during their development workflow, hopefully producing a better integrated pipeline of notebooks without requiring the data scientist to work with (or have knowledge of) the task scheduler (Airflow) and compute cluster internals.

This is similar to this component of Metaflow by Netflix;

dag

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions