Split dataframe to subsets #58

PhilippeMoussalli · 2023-04-28T14:01:57Z

This PR optimizes the process of executing dask transformations and writing by creating delayed tasks that are executed in parallel. This is mainly to optimize the process of writing many different subsets at the same time.

I also added more documentation for the Dataset class. The changes did require some refactoring

NielsRogge · 2023-05-02T13:16:21Z

fondant/dataset.py

+            # Add the output subset to the manifest
+            manifest_fields = [
+                (field.name, Type[field.type.name]) for field in subset.fields.values()
+            ]
+            self.manifest.add_subset(subset_name, fields=manifest_fields)


The Dataset class should be a wrapper around an immutable Manifest (see discussion here).

The evolve method currently takes care of updating the manifest (adding subsets based on the output subsets of the component spec). cc @RobbeSneyders

Good call, I think i'll resolve this when merging with the #56. Maybe let's keep the focus of this PR on the splitting logic with dask

fondant/dataset.py

RobbeSneyders · 2023-05-02T07:29:02Z

fondant/component.py

-        dataset.add_index(df)
-        dataset.add_subsets(df, self.spec)
+        index_task = dataset.get_upload_index_task(df)
+        subset_tasks = dataset.get_upload_subsets_task(df, self.spec)


I would prefer to keep these tasks within the Dataset class. That way we only need to implement different Dataset classes if we want to support different frameworks.

Do we need to change the merge subsets into dataframe into a task as well? So we can include it in the the task list and let dask figure out how to handle it.

Makes sense, I'v updated it accordingly.Only small downside is that now we're handling both writing of index and all of the subsets separately but that shouldn't introduce a major performance bump

GeorgesLorre · 2023-05-03T13:54:39Z

For testing you can use https://docs.pytest.org/en/7.1.x/how-to/tmp_path.html to write out the subsets temporarily.

RobbeSneyders · 2023-05-03T15:57:58Z

#56 has been merged. Can you rebase or merge so the conflicts are resolved? That will make it easier to review.

fondant/dataset.py

RobbeSneyders

Thanks @PhilippeMoussalli! General approach looks good to me. Left some comments :)

fondant/dataset.py

GeorgesLorre · 2023-05-02T14:49:40Z

fondant/component.py

-        dataset.add_index(df)
-        dataset.add_subsets(df, self.spec)
+        index_task = dataset.get_upload_index_task(df)
+        subset_tasks = dataset.get_upload_subsets_task(df, self.spec)


Do we need to change the merge subsets into dataframe into a task as well? So we can include it in the the task list and let dask figure out how to handle it.

GeorgesLorre · 2023-05-04T11:16:38Z

tests/example_data/subsets/index/part.0.parquet

What exactly changed in the parquet files ?

I think the operation you defined returns a lazy dataframe so no need to define a task there.

I changed the source data type from integer to string since that's what we currently expect

GeorgesLorre · 2023-05-04T11:22:15Z

fondant/dataset.py

+        Args:
+            df: The output Dask dataframe returned by the user.
+        """
+        remote_path = self.manifest.index.location
        index_columns = list(self.manifest.index.fields.keys())

        # load index dataframe
        index_df = df[index_columns]


Does this index_df have an index ? Maybe we need to call .set_index() ?

we're using both the id and source as index. It does not seem like dask supports multiindex though https://dask.discourse.group/t/everything-about-multiindex-in-dask/593/2.
We might want to reconsider having the index be a string that contains both the source and id

We could just index id since it is better then no index (this is also what happens in the test data see split.py). I'll create a ticket to solve multi-index.

https://github.com/orgs/ml6team/projects/1/views/1?pane=issue&itemId=27349285

alright added it to both index and subsets

fondant/dataset.py

GeorgesLorre

Nice work!

This PR optimizes the process of executing dask transformations and writing by creating delayed tasks that are executed in parallel. This is mainly to optimize the process of writing many different subsets at the same time. I also added more documentation for the `Dataset` class. The changes did require some refactoring

opitimze data writing through dask tasks

d111c38

PhilippeMoussalli requested review from RobbeSneyders, GeorgesLorre and NielsRogge April 28, 2023 14:01

RobbeSneyders linked an issue May 2, 2023 that may be closed by this pull request

Split dataframe to write different subsets #51

Closed

RobbeSneyders added this to the Alpha milestone May 2, 2023

NielsRogge reviewed May 2, 2023

View reviewed changes

fondant/dataset.py Outdated Show resolved Hide resolved

NielsRogge reviewed May 2, 2023

View reviewed changes

fondant/dataset.py Outdated Show resolved Hide resolved

NielsRogge reviewed May 2, 2023

View reviewed changes

fondant/dataset.py Show resolved Hide resolved

RobbeSneyders reviewed May 2, 2023

View reviewed changes

PhilippeMoussalli added 4 commits May 3, 2023 13:52

move upload tasks to dataset class

7c9aa5e

Merge branch 'main' into split-dataframe-to-subsets

093202d

correct method of replacing subset prefix

a5fe560

linting

f86daa2

PhilippeMoussalli added 2 commits May 3, 2023 16:00

update docstrings

4f923a4

remove old docs

2b7a5d2

PhilippeMoussalli added 5 commits May 4, 2023 09:21

remove manifest update from dataset

36397f8

modify tests

360e7d6

Merge branch 'main' into split-dataframe-to-subsets

b65fa85

revert back to old tests

fe71596

modify function names

909475a