Support intermediate artifacts

Hi all, thank you so much for developing LineaPy, looks great and I'm really excited about it!

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

When I develop a pipeline, I may want to integrate semantic steps to build my refined dataset table. As an illustration, `master_data` would be data loaded and assembled from a relational DB, whereas `dataset` would be the same table refined with some feature engineering.

Currently, if I try to do this I would save both `master_data` and `dataset` as artifacts, then create a pipeline like:
```python
lineapy.to_pipeline(artifacts=[master_data.name, dataset.name], 
                    dependencies={dataset.name: {master_data.name}},
                    framework='AIRFLOW', pipeline_name='my_great_airflow_pipeline', output_dir='airflow')
```

My issue is that Lineapy would then create steps to build `master_data` from scratch, and also to create `dataset` from scratch instead of loading `master_data` as a starting point. Like:

```python
import pickle


def master_data():

    import pandas as pd
    from sklearn.datasets import load_iris

    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
        target=[iris.target_names[i] for i in iris.target]
    )
    iris_agg = df.set_index("target")
    artifact = pickle.dump(
        iris_agg, open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "wb")
    )


def dataset():

    import pandas as pd
    from sklearn.datasets import load_iris

    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
        target=[iris.target_names[i] for i in iris.target]
    )
    iris_agg = df.set_index("target")
    iris_clean = iris_agg.dropna().assign(test="test")
    dataset = pickle.dump(
        iris_clean, open("/home/oneai/.lineapy/linea_pickles/5Tk63gO", "wb")
    )
```

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

Ideally LineaPy would capture the dependency and build:

My issue is that Lineapy would then create steps to build `master_data` from scratch, and also to create `dataset` from scratch instead of loading `master_data` as a starting point. Something like:

```python
import pickle


def master_data():

    import pandas as pd
    from sklearn.datasets import load_iris

    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
        target=[iris.target_names[i] for i in iris.target]
    )
    iris_agg = df.set_index("target")
    pickle.dump(
        iris_agg, open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "wb")
    )


def dataset():

    import pandas as pd

    iris_agg = pickle.load(
      open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "rb")
   )
    iris_clean = iris_agg.dropna().assign(test="test")
    dataset = pickle.dump(
        iris_clean, open("/home/oneai/.lineapy/linea_pickles/5Tk63gO", "wb")
    )
```

Is it planned to support this behavior?
Am I missing something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support intermediate artifacts #683

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support intermediate artifacts #683

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions