Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset and Component Class #33

Merged
merged 53 commits into from
Apr 24, 2023
Merged

Add Dataset and Component Class #33

merged 53 commits into from
Apr 24, 2023

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Apr 20, 2023

This PR adds the FondantDataset wrapper class around the Manifest, which loads data as Dask dataframes and allows to upload Dask dataframes back to the cloud.

To test everything, the PR also includes a pipeline called "simple pipeline" that includes 3 components: loading from hub, image filtering and embedding. Each component needs to overwrite the FondantComponent class, which uses the FondantDataset class behind the scenes.

To be discussed:

  • I've added project_name to the metadata of the manifest, in order to know the name of the cloud project. This is needed to load/upload data using fsspec.

To do:

  • for now I'm manually adding the "gcs://" prefix and ".parquet" suffix when loading data to and from the cloud, this needs to be addressed (we need a cleaner way that is not hardcoded)
  • only first 2 components are implemented, embedding component is to do
  • for the moment I'm still manually creating the KubeFlow component yaml file for each component. This should be updated to automatically create it based on the Fondant spec using the write_kubeflow_specification method
  • nicer way of creating and passing metadata. The only metadata that is different per component is the component_id. Ideally getting rid of args.metadata
  • enforce usage of data types defined in output subsets when creating the dataset (currently only the column names are checked)

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NielsRogge! I did a quick review, but haven't been able to go through it completely yet.

express/dataset.py Outdated Show resolved Hide resolved
express/manifest.py Outdated Show resolved Hide resolved
express/manifest.py Outdated Show resolved Hide resolved
express/dataset.py Outdated Show resolved Hide resolved
express/dataset.py Outdated Show resolved Hide resolved
express/dataset.py Outdated Show resolved Hide resolved
express/dataset.py Show resolved Hide resolved
express/dataset.py Outdated Show resolved Hide resolved
@ml6team ml6team deleted a comment from PhilippeMoussalli Apr 21, 2023
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to get this PR merged and do follow-up work in separate PRs.

The work still to be done is:

  • Upgrade to KfP 2.X
  • Generate output manifest based on input manifest and component spec
  • Split component into load and transform subclasses
  • Optimize the load_from_hub component
  • Merge loaded subsets into a single dataframe
  • Remove hardcoded "gcs://", ".parquet", and project name
  • Automatically create the Kubeflow component yaml
  • Validate data types of input and output data

I resolved all related comments. Let's address the open comments still in this PR as soon as possible.

import pyarrow as pa


type_to_pyarrow = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where / how will this be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was used to define a schema argument when uploading a Dask parquet file to the cloud.

Ideally dd.to_parquet works without schema argument, I'm trying right now to see if it works without

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this is the error when writing Dask subset dataframes to the cloud using dd.to_parquet:

Screenshot 2023-04-21 at 15 41 08

"set": "Set",
}

kubeflow2python_type = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also not be unambiguously converted since "List" is duplicated as a value.

express/dataset.py Show resolved Hide resolved
express/dataset.py Show resolved Hide resolved
express/dataset.py Show resolved Hide resolved
@GeorgesLorre
Copy link
Collaborator

Let's try to get this PR merged and do follow-up work in separate PRs.

The work still to be done is:

  • Upgrade to KfP 2.X
  • Generate output manifest based on input manifest and component spec
  • Split component into load and transform subclasses
  • Optimize the load_from_hub component
  • Merge loaded subsets into a single dataframe
  • Remove hardcoded "gcs://", ".parquet", and project name
  • Automatically create the Kubeflow component yaml
  • Validate data types of input and output data

I resolved all related comments. Let's address the open comments still in this PR as soon as possible.

I added these on the Trello board so we can distribute and follow up

@RobbeSneyders RobbeSneyders changed the title [WIP] Add dataset wrapper Add Dataset and Component Class Apr 24, 2023
@RobbeSneyders RobbeSneyders marked this pull request as ready for review April 24, 2023 13:06
@PhilippeMoussalli PhilippeMoussalli merged commit 8245d06 into main Apr 24, 2023
@RobbeSneyders RobbeSneyders deleted the add_wrapper branch May 15, 2023 16:29
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR adds the `FondantDataset` wrapper class around the `Manifest`,
which loads data as Dask dataframes and allows to upload Dask dataframes
back to the cloud.

To test everything, the PR also includes a pipeline called "simple
pipeline" that includes 3 components: loading from hub, image filtering
and embedding. Each component needs to overwrite the `FondantComponent`
class, which uses the `FondantDataset` class behind the scenes.

To be discussed:

- [ ] I've added `project_name` to the metadata of the manifest, in
order to know the name of the cloud project. This is needed to
load/upload data using `fsspec`.

To do:

- [ ] for now I'm manually adding the `"gcs://"` prefix and `".parquet"`
suffix when loading data to and from the cloud, this needs to be
addressed (we need a cleaner way that is not hardcoded)
- [ ] only first 2 components are implemented, embedding component is to
do
- [ ] for the moment I'm still manually creating the KubeFlow component
yaml file for each component. This should be updated to automatically
create it based on the Fondant spec using the
[write_kubeflow_specification](https://github.com/ml6team/express/blob/db5807ae868fe36091d8d7f0061450312ab7477b/express/component_spec.py#L207)
method
- [ ] nicer way of creating and passing metadata. The only metadata that
is different per component is the `component_id`. Ideally getting rid of
`args.metadata`
- [ ] enforce usage of data types defined in output subsets when
creating the dataset (currently only the column names are checked)

---------

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants