Add Dataset and Component Class #33

NielsRogge · 2023-04-20T13:08:35Z

This PR adds the FondantDataset wrapper class around the Manifest, which loads data as Dask dataframes and allows to upload Dask dataframes back to the cloud.

To test everything, the PR also includes a pipeline called "simple pipeline" that includes 3 components: loading from hub, image filtering and embedding. Each component needs to overwrite the FondantComponent class, which uses the FondantDataset class behind the scenes.

To be discussed:

I've added project_name to the metadata of the manifest, in order to know the name of the cloud project. This is needed to load/upload data using fsspec.

To do:

for now I'm manually adding the "gcs://" prefix and ".parquet" suffix when loading data to and from the cloud, this needs to be addressed (we need a cleaner way that is not hardcoded)
only first 2 components are implemented, embedding component is to do
for the moment I'm still manually creating the KubeFlow component yaml file for each component. This should be updated to automatically create it based on the Fondant spec using the write_kubeflow_specification method
nicer way of creating and passing metadata. The only metadata that is different per component is the component_id. Ideally getting rid of args.metadata
enforce usage of data types defined in output subsets when creating the dataset (currently only the column names are checked)

examples/pipelines/simple_pipeline/components/embedding/embedding.yaml

examples/pipelines/simple_pipeline/components/image_filtering/requirements.txt

examples/pipelines/simple_pipeline/components/image_filtering/src/fondant_component.yaml

RobbeSneyders

Thanks @NielsRogge! I did a quick review, but haven't been able to go through it completely yet.

express/dataset.py

express/manifest.py

express/dataset.py

examples/pipelines/simple_pipeline/components/load_from_hub/src/main.py

RobbeSneyders

Let's try to get this PR merged and do follow-up work in separate PRs.

The work still to be done is:

Upgrade to KfP 2.X
Generate output manifest based on input manifest and component spec
Split component into load and transform subclasses
Optimize the load_from_hub component
Merge loaded subsets into a single dataframe
Remove hardcoded "gcs://", ".parquet", and project name
Automatically create the Kubeflow component yaml
Validate data types of input and output data

I resolved all related comments. Let's address the open comments still in this PR as soon as possible.

RobbeSneyders · 2023-04-21T12:32:27Z

express/schema.py

+import pyarrow as pa
+
+
+type_to_pyarrow = {


Where / how will this be used?

This was used to define a schema argument when uploading a Dask parquet file to the cloud.

Ideally dd.to_parquet works without schema argument, I'm trying right now to see if it works without

FYI this is the error when writing Dask subset dataframes to the cloud using dd.to_parquet:

examples/pipelines/simple_pipeline/components/load_from_hub/src/main.py

RobbeSneyders · 2023-04-21T13:00:20Z

express/component_spec.py

+    "set": "Set",
+}
+
+kubeflow2python_type = {


It can also not be unambiguously converted since "List" is duplicated as a value.

express/dataset.py

GeorgesLorre · 2023-04-24T09:08:29Z

Let's try to get this PR merged and do follow-up work in separate PRs.

The work still to be done is:

Upgrade to KfP 2.X

Generate output manifest based on input manifest and component spec

Split component into load and transform subclasses

Optimize the load_from_hub component

Merge loaded subsets into a single dataframe

Remove hardcoded "gcs://", ".parquet", and project name

Automatically create the Kubeflow component yaml

Validate data types of input and output data

I resolved all related comments. Let's address the open comments still in this PR as soon as possible.

I added these on the Trello board so we can distribute and follow up

This PR adds the `FondantDataset` wrapper class around the `Manifest`, which loads data as Dask dataframes and allows to upload Dask dataframes back to the cloud. To test everything, the PR also includes a pipeline called "simple pipeline" that includes 3 components: loading from hub, image filtering and embedding. Each component needs to overwrite the `FondantComponent` class, which uses the `FondantDataset` class behind the scenes. To be discussed: - [ ] I've added `project_name` to the metadata of the manifest, in order to know the name of the cloud project. This is needed to load/upload data using `fsspec`. To do: - [ ] for now I'm manually adding the `"gcs://"` prefix and `".parquet"` suffix when loading data to and from the cloud, this needs to be addressed (we need a cleaner way that is not hardcoded) - [ ] only first 2 components are implemented, embedding component is to do - [ ] for the moment I'm still manually creating the KubeFlow component yaml file for each component. This should be updated to automatically create it based on the Fondant spec using the [write_kubeflow_specification](https://github.com/ml6team/express/blob/db5807ae868fe36091d8d7f0061450312ab7477b/express/component_spec.py#L207) method - [ ] nicer way of creating and passing metadata. The only metadata that is different per component is the `component_id`. Ideally getting rid of `args.metadata` - [ ] enforce usage of data types defined in output subsets when creating the dataset (currently only the column names are checked) --------- Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>

Niels Rogge added 26 commits April 19, 2023 19:33

First draft

b5d1688

More improvements

c67047b

More improvements

3f258d5

More improvements

a5f17bf

More improvements

573fdda

Update pipeline

a773496

More improvements

d7de4ea

Add comments

9e0b30a

Remove spec_path argument

325ead2

Remove comment

42613a3

Automatically add args

5f4e103

Simplify pipeline

8009d82

Add add_index

2ca256c

Add print statements

a151d96

Add more print statements

cd474cc

Fix locations

4725981

Fix paths

63ac065

More improvements

5f79090

Update pipeline

cb0344b

Fix path

1ce80bf

Fix path

5720358

Fix path

932e05e

Add print statement

c9f69b5

Add more print statements

b185530

Update to dask

27ac896

More improvements

d25ee12

PhilippeMoussalli reviewed Apr 20, 2023

View reviewed changes

examples/pipelines/simple_pipeline/components/embedding/embedding.yaml Outdated Show resolved Hide resolved

PhilippeMoussalli reviewed Apr 20, 2023

View reviewed changes

examples/pipelines/simple_pipeline/components/embedding/embedding.yaml Outdated Show resolved Hide resolved

PhilippeMoussalli reviewed Apr 20, 2023

View reviewed changes

examples/pipelines/simple_pipeline/components/image_filtering/requirements.txt Outdated Show resolved Hide resolved

PhilippeMoussalli reviewed Apr 20, 2023

View reviewed changes

examples/pipelines/simple_pipeline/components/image_filtering/src/fondant_component.yaml Show resolved Hide resolved

Add mapping to pyarrow

b8d908c

RobbeSneyders reviewed Apr 20, 2023

View reviewed changes

Niels Rogge added 5 commits April 21, 2023 08:50

More fixes

45e4a17

Rename ExpressComponent to ComponentSpec

7012700

Remove get_subset

43ec2b1

Use regular class

01b0f22

More fixes

12ebf7e

PhilippeMoussalli reviewed Apr 21, 2023

View reviewed changes

examples/pipelines/simple_pipeline/components/load_from_hub/src/main.py Show resolved Hide resolved

Add typing hints

7a6a108

ml6team deleted a comment from PhilippeMoussalli Apr 21, 2023

RobbeSneyders reviewed Apr 21, 2023

View reviewed changes

Niels Rogge and others added 7 commits April 21, 2023 15:31

Update dask version

8ecfb9f

Remove type_to_pyarrow mapping

bfb5e52

Address comment

b4f8e92

Include custom_artifact in base_path

f9c9679

add kfp v2 todos

b2aec33

remove unused gcp storage class

ead2250

remove project name from required params

c20fa73

PhilippeMoussalli added 4 commits April 24, 2023 13:44

change default host

306f8f7

add name and source to mandatory subset columns

aafd9e3

change general config

9b3f339

pass expected schema to dask

6aa98ef

RobbeSneyders changed the title ~~[WIP] Add dataset wrapper~~ Add Dataset and Component Class Apr 24, 2023

RobbeSneyders marked this pull request as ready for review April 24, 2023 13:06

RobbeSneyders approved these changes Apr 24, 2023

View reviewed changes

remove io module tests

191310f

PhilippeMoussalli merged commit 8245d06 into main Apr 24, 2023

RobbeSneyders deleted the add_wrapper branch May 15, 2023 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dataset and Component Class #33

Add Dataset and Component Class #33

NielsRogge commented Apr 20, 2023 •

edited

Loading

RobbeSneyders left a comment •

edited

Loading

RobbeSneyders left a comment •

edited

Loading

RobbeSneyders Apr 21, 2023

NielsRogge Apr 21, 2023

NielsRogge Apr 21, 2023

RobbeSneyders Apr 21, 2023

GeorgesLorre commented Apr 24, 2023

Add Dataset and Component Class #33

Add Dataset and Component Class #33

Conversation

NielsRogge commented Apr 20, 2023 • edited Loading

RobbeSneyders left a comment • edited Loading

Choose a reason for hiding this comment

RobbeSneyders left a comment • edited Loading

Choose a reason for hiding this comment

RobbeSneyders Apr 21, 2023

Choose a reason for hiding this comment

NielsRogge Apr 21, 2023

Choose a reason for hiding this comment

NielsRogge Apr 21, 2023

Choose a reason for hiding this comment

RobbeSneyders Apr 21, 2023

Choose a reason for hiding this comment

GeorgesLorre commented Apr 24, 2023

NielsRogge commented Apr 20, 2023 •

edited

Loading

RobbeSneyders left a comment •

edited

Loading

RobbeSneyders left a comment •

edited

Loading