-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Hi all, thank you so much for developing LineaPy, looks great and I'm really excited about it!
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
When I develop a pipeline, I may want to integrate semantic steps to build my refined dataset table. As an illustration, master_data
would be data loaded and assembled from a relational DB, whereas dataset
would be the same table refined with some feature engineering.
Currently, if I try to do this I would save both master_data
and dataset
as artifacts, then create a pipeline like:
lineapy.to_pipeline(artifacts=[master_data.name, dataset.name],
dependencies={dataset.name: {master_data.name}},
framework='AIRFLOW', pipeline_name='my_great_airflow_pipeline', output_dir='airflow')
My issue is that Lineapy would then create steps to build master_data
from scratch, and also to create dataset
from scratch instead of loading master_data
as a starting point. Like:
import pickle
def master_data():
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
target=[iris.target_names[i] for i in iris.target]
)
iris_agg = df.set_index("target")
artifact = pickle.dump(
iris_agg, open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "wb")
)
def dataset():
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
target=[iris.target_names[i] for i in iris.target]
)
iris_agg = df.set_index("target")
iris_clean = iris_agg.dropna().assign(test="test")
dataset = pickle.dump(
iris_clean, open("/home/oneai/.lineapy/linea_pickles/5Tk63gO", "wb")
)
Describe the solution you'd like
A clear and concise description of what you want to happen.
Ideally LineaPy would capture the dependency and build:
My issue is that Lineapy would then create steps to build master_data
from scratch, and also to create dataset
from scratch instead of loading master_data
as a starting point. Something like:
import pickle
def master_data():
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
target=[iris.target_names[i] for i in iris.target]
)
iris_agg = df.set_index("target")
pickle.dump(
iris_agg, open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "wb")
)
def dataset():
import pandas as pd
iris_agg = pickle.load(
open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "rb")
)
iris_clean = iris_agg.dropna().assign(test="test")
dataset = pickle.dump(
iris_clean, open("/home/oneai/.lineapy/linea_pickles/5Tk63gO", "wb")
)
Is it planned to support this behavior?
Am I missing something?