Skip to content

I want to be able to process/query big datasets reproducibly without needing to download to my local machine #308

Open
@petebachant

Description

@petebachant

Concepts

  1. Users can define a distributed processing environment like Spark, Dask, Trino, etc., and a command could run in that.
  2. We could spin up a large machine for them, clone the project onto it, then run a stage or the whole pipeline there, commit, push, and they could pull down the result. This wouldn't be a great interactive workflow though. It i s probably a requirement that interactive tasks be able to be done to build the pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions