Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove cloud specific prefix #37

Merged
merged 9 commits into from
Apr 25, 2023
Merged

Remove cloud specific prefix #37

merged 9 commits into from
Apr 25, 2023

Conversation

NielsRogge
Copy link
Contributor

This PR removes the need for a specific prefix, as we can just include that in the base path of the configuration.

Comment on lines +32 to +40
# existing_pipelines = client.list_pipelines(page_size=100).pipelines
# for existing_pipeline in existing_pipelines:
# if existing_pipeline.name == pipeline_name:
# # Delete existing pipeline before uploading
# logger.warning(
# f"Pipeline {pipeline_name} already exists. Deleting old pipeline..."
# )
# client.delete_pipeline_version(existing_pipeline.default_version.id)
# client.delete_pipeline(existing_pipeline.id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary fix in order to compile the pipeline.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NielsRogge. LGTM.

fondant/dataset.py Show resolved Hide resolved
# add subset prefix to columns
df = df.rename(
columns={
col: name + "_" + col for col in df.columns if col not in index_fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For joining all the data into 1 dataframe we will need to join on the index which gets filtered out here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index is not filtered out, this operation only renames the non-index columns

if index is None:
index_df = self._load_index()
ids = index_df["id"].compute()
sources = index_df["source"].compute()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is no workaround for having to bring those in memory before filtering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to filter a dataframe based on an index, I assume that you need to have the entire index present.

Correct me if I'm wrong @GeorgesLorre @RobbeSneyders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to look into Dask in more detail before I can give input on this.

Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few open questions that we might want to follow up on

@NielsRogge NielsRogge merged commit 474e1ce into main Apr 25, 2023
@RobbeSneyders RobbeSneyders deleted the fix_prefix branch May 4, 2023 07:34
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
Remove cloud specific prefix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants