-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove cloud specific prefix #37
Conversation
# existing_pipelines = client.list_pipelines(page_size=100).pipelines | ||
# for existing_pipeline in existing_pipelines: | ||
# if existing_pipeline.name == pipeline_name: | ||
# # Delete existing pipeline before uploading | ||
# logger.warning( | ||
# f"Pipeline {pipeline_name} already exists. Deleting old pipeline..." | ||
# ) | ||
# client.delete_pipeline_version(existing_pipeline.default_version.id) | ||
# client.delete_pipeline(existing_pipeline.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary fix in order to compile the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NielsRogge. LGTM.
# add subset prefix to columns | ||
df = df.rename( | ||
columns={ | ||
col: name + "_" + col for col in df.columns if col not in index_fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For joining all the data into 1 dataframe we will need to join on the index which gets filtered out here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index is not filtered out, this operation only renames the non-index columns
if index is None: | ||
index_df = self._load_index() | ||
ids = index_df["id"].compute() | ||
sources = index_df["source"].compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is no workaround for having to bring those in memory before filtering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to filter a dataframe based on an index, I assume that you need to have the entire index present.
Correct me if I'm wrong @GeorgesLorre @RobbeSneyders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to look into Dask in more detail before I can give input on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a few open questions that we might want to follow up on
Remove cloud specific prefix
This PR removes the need for a specific prefix, as we can just include that in the base path of the configuration.