-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dataset and Component Class #33
Conversation
examples/pipelines/simple_pipeline/components/embedding/embedding.yaml
Outdated
Show resolved
Hide resolved
examples/pipelines/simple_pipeline/components/embedding/embedding.yaml
Outdated
Show resolved
Hide resolved
examples/pipelines/simple_pipeline/components/image_filtering/requirements.txt
Outdated
Show resolved
Hide resolved
examples/pipelines/simple_pipeline/components/image_filtering/src/fondant_component.yaml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NielsRogge! I did a quick review, but haven't been able to go through it completely yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to get this PR merged and do follow-up work in separate PRs.
The work still to be done is:
- Upgrade to KfP 2.X
- Generate output manifest based on input manifest and component spec
- Split component into load and transform subclasses
- Optimize the
load_from_hub
component - Merge loaded subsets into a single dataframe
- Remove hardcoded
"gcs://"
,".parquet"
, and project name - Automatically create the Kubeflow component yaml
- Validate data types of input and output data
I resolved all related comments. Let's address the open comments still in this PR as soon as possible.
express/schema.py
Outdated
import pyarrow as pa | ||
|
||
|
||
type_to_pyarrow = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where / how will this be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was used to define a schema
argument when uploading a Dask parquet file to the cloud.
Ideally dd.to_parquet
works without schema
argument, I'm trying right now to see if it works without
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/pipelines/simple_pipeline/components/load_from_hub/src/main.py
Outdated
Show resolved
Hide resolved
"set": "Set", | ||
} | ||
|
||
kubeflow2python_type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also not be unambiguously converted since "List"
is duplicated as a value.
I added these on the Trello board so we can distribute and follow up |
This PR adds the `FondantDataset` wrapper class around the `Manifest`, which loads data as Dask dataframes and allows to upload Dask dataframes back to the cloud. To test everything, the PR also includes a pipeline called "simple pipeline" that includes 3 components: loading from hub, image filtering and embedding. Each component needs to overwrite the `FondantComponent` class, which uses the `FondantDataset` class behind the scenes. To be discussed: - [ ] I've added `project_name` to the metadata of the manifest, in order to know the name of the cloud project. This is needed to load/upload data using `fsspec`. To do: - [ ] for now I'm manually adding the `"gcs://"` prefix and `".parquet"` suffix when loading data to and from the cloud, this needs to be addressed (we need a cleaner way that is not hardcoded) - [ ] only first 2 components are implemented, embedding component is to do - [ ] for the moment I'm still manually creating the KubeFlow component yaml file for each component. This should be updated to automatically create it based on the Fondant spec using the [write_kubeflow_specification](https://github.com/ml6team/express/blob/db5807ae868fe36091d8d7f0061450312ab7477b/express/component_spec.py#L207) method - [ ] nicer way of creating and passing metadata. The only metadata that is different per component is the `component_id`. Ideally getting rid of `args.metadata` - [ ] enforce usage of data types defined in output subsets when creating the dataset (currently only the column names are checked) --------- Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
This PR adds the
FondantDataset
wrapper class around theManifest
, which loads data as Dask dataframes and allows to upload Dask dataframes back to the cloud.To test everything, the PR also includes a pipeline called "simple pipeline" that includes 3 components: loading from hub, image filtering and embedding. Each component needs to overwrite the
FondantComponent
class, which uses theFondantDataset
class behind the scenes.To be discussed:
project_name
to the metadata of the manifest, in order to know the name of the cloud project. This is needed to load/upload data usingfsspec
.To do:
"gcs://"
prefix and".parquet"
suffix when loading data to and from the cloud, this needs to be addressed (we need a cleaner way that is not hardcoded)component_id
. Ideally getting rid ofargs.metadata