Add Fondant Dataset and Component class (#33)

This PR adds the `FondantDataset` wrapper class around the `Manifest`, which loads data as Dask dataframes and allows to upload Dask dataframes back to the cloud. To test everything, the PR also includes a pipeline called "simple pipeline" that includes 3 components: loading from hub, image filtering and embedding. Each component needs to overwrite the `FondantComponent` class, which uses the `FondantDataset` class behind the scenes. To be discussed: - [ ] I've added `project_name` to the metadata of the manifest, in order to know the name of the cloud project. This is needed to load/upload data using `fsspec`. To do: - [ ] for now I'm manually adding the `"gcs://"` prefix and `".parquet"` suffix when loading data to and from the cloud, this needs to be addressed (we need a cleaner way that is not hardcoded) - [ ] only first 2 components are implemented, embedding component is to do - [ ] for the moment I'm still manually creating the KubeFlow component yaml file for each component. This should be updated to automatically create it based on the Fondant spec using the [write_kubeflow_specification](https://github.com/ml6team/express/blob/db5807ae868fe36091d8d7f0061450312ab7477b/express/component_spec.py#L207) method - [ ] nicer way of creating and passing metadata. The only metadata that is different per component is the `component_id`. Ideally getting rid of `args.metadata` - [ ] enforce usage of data types defined in output subsets when creating the dataset (currently only the column names are checked) --------- Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
ml6team · Apr 24, 2023 · 8245d06 · 8245d06
1 parent db5807a
commit 8245d06
Show file tree

Hide file tree

Showing 29 changed files with 788 additions and 422 deletions.
diff --git a/examples/pipelines/config.py b/examples/pipelines/config.py
@@ -30,4 +30,4 @@ class KubeflowConfig(GeneralConfig):
     ARTIFACT_BUCKET = f"{GeneralConfig.GCP_PROJECT_ID}-kfp-output"
     CLUSTER_NAME = "kfp-express"
     CLUSTER_ZONE = "europe-west4-a"
-    HOST = "https://472c61c751ab9be9-dot-europe-west1.pipelines.googleusercontent.com"
+    HOST = "https://52074149b1563463-dot-europe-west1.pipelines.googleusercontent.com/"
diff --git a/examples/pipelines/simple_pipeline/build_images.sh b/examples/pipelines/simple_pipeline/build_images.sh
@@ -0,0 +1,57 @@
+#!/bin/bash
+
+function usage {
+  echo "Usage: $0 [options]"
+  echo "Options:"
+  echo "  -c, --component <value>  Set the component name. Pass the component folder name to build a certain components or 'all' to build all components in the current directory (required)"
+  echo "  -n, --namespace <value>  Set the namespace (default: ml6team)"
+  echo "  -r, --repo <value>       Set the repo (default: express)"
+  echo "  -t, --tag <value>        Set the tag (default: latest)"
+  echo "  -h, --help               Display this help message"
+}
+
+# Parse the arguments
+while [[ "$#" -gt 0 ]]; do case $1 in
+  -n|--namespace) namespace="$2"; shift;;
+  -r|--repo) repo="$2"; shift;;
+  -t|--tag) tag="$2"; shift;;
+  -c|--component) component="$2"; shift;;
+  -h|--help) usage; exit;;
+  *) echo "Unknown parameter passed: $1"; exit 1;;
+esac; shift; done
+
+# Check for required argument
+if [ -z "${component}" ]; then
+  echo "Error: component parameter is required"
+  usage
+  exit 1
+fi
+
+# Set default values for optional arguments if not passed
+[ -n "${namespace-}" ] || namespace="ml6team"
+[ -n "${repo-}" ] || repo="express"
+[ -n "${tag-}" ] || tag="latest"
+
+# Get the component directory
+component_dir=$(pwd)/"components"
+
+# Loop through all subdirectories
+for dir in $component_dir/*/; do
+  cd "$dir"
+  BASENAME=${dir%/}
+  BASENAME=${BASENAME##*/}
+  # Build all images or one image depending on the passed argument
+  if [[ "$BASENAME" == "${component}" ]] || [[ "${component}" == "all" ]]; then
+    full_image_name=ghcr.io/${namespace}/${BASENAME}:${tag}
+    echo $full_image_name
+    docker build -t "$full_image_name" \
+     --build-arg COMMIT_SHA=$(git rev-parse HEAD) \
+     --build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \
+     --build-arg BUILD_TIMESTAMP=$(date '+%F_%H:%M:%S') \
+     --label org.opencontainers.image.source=https://github.com/${namespace}/${repo} \
+     --platform=linux/arm64 \
+     .
+    docker push "$full_image_name"
+  fi
+  cd "$component_dir"
+done
diff --git a/examples/pipelines/simple_pipeline/components/embedding/src/fondant_component.yaml b/examples/pipelines/simple_pipeline/components/embedding/src/fondant_component.yaml
@@ -0,0 +1,23 @@
+name: Embedding
+description: Component that embeds images using CLIP
+image: embedding:latest
+
+input_subsets:
+  images:
+      fields:
+        data:
+          type: binary
+
+output_subsets:
+  embeddings:
+    fields:
+      data:
+        type: float
+
+args:
+  model_id:
+    description: Model id on the Hugging Face hub
+    type: str
+  batch_size:
+    description: Batch size to use when embedding
+    type: int
diff --git a/examples/pipelines/simple_pipeline/components/image_filtering/Dockerfile b/examples/pipelines/simple_pipeline/components/image_filtering/Dockerfile
@@ -0,0 +1,29 @@
+FROM --platform=linux/amd64 python:3.8-slim
+
+## System dependencies
+RUN apt-get update && \
+    apt-get upgrade -y && \
+    apt-get install git curl -y
+
+# Downloading gcloud package
+RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
+
+# Installing the package
+RUN mkdir -p /usr/local/gcloud \
+&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
+&& /usr/local/gcloud/google-cloud-sdk/install.sh
+
+# Adding the package path to local
+ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin
+
+# install requirements
+COPY requirements.txt /
+RUN pip3 install --no-cache-dir -r requirements.txt
+
+# Copy over src-files of the component
+COPY src /src
+
+# Set the working directory to the source folder
+WORKDIR /src
+
+ENTRYPOINT ["python", "main.py"]
diff --git a/examples/pipelines/simple_pipeline/components/image_filtering/kubeflow_component.yaml b/examples/pipelines/simple_pipeline/components/image_filtering/kubeflow_component.yaml
@@ -0,0 +1,35 @@
+name: image_filtering
+description: A component that filters images
+inputs: 
+    - name: input_manifest_path
+      description: Path to the input manifest
+      type: String
+
+    - name: min_width
+      description: Desired minimum width
+      type: Integer
+
+    - name: min_height
+      description: Desired minimum height
+      type: Integer
+
+    - name: metadata
+      description: Metadata arguments, passed as a json dict string
+      type: String
+
+
+outputs:
+    - name: output_manifest_path
+      description: Path to the output manifest
+
+implementation:
+    container:
+        image: ghcr.io/ml6team/image_filtering:latest
+        command: [
+            python3, main.py,
+            --input_manifest_path,       {inputPath: input_manifest_path},
+            --min_width,                 {inputValue: min_width},
+            --min_height,                {inputValue: min_height},
+            --metadata,                  {inputValue: metadata},
+            --output_manifest_path,      {outputPath: output_manifest_path},
+        ]
diff --git a/examples/pipelines/simple_pipeline/components/image_filtering/requirements.txt b/examples/pipelines/simple_pipeline/components/image_filtering/requirements.txt
@@ -0,0 +1 @@
+git+https://github.com/ml6team/express.git@b1855308ca9251da5ddd8e6b88c34bc1c082a71b#egg=express
diff --git a/examples/pipelines/simple_pipeline/components/image_filtering/src/fondant_component.yaml b/examples/pipelines/simple_pipeline/components/image_filtering/src/fondant_component.yaml
@@ -0,0 +1,27 @@
+name: Image filtering
+description: Component that filters images based on desired minimum width and height
+image: image_filtering:latest
+
+input_subsets:
+  images:
+    fields:
+      width:
+        type: int16
+      height:
+        type: int16
+
+output_subsets:
+  images:
+    fields:
+      width:
+        type: int16
+      height:
+        type: int16
+
+args:
+  min_width:
+    description: Desired minimum width
+    type: int
+  min_height:
+    description: Desired minimum height
+    type: int
diff --git a/examples/pipelines/simple_pipeline/components/image_filtering/src/main.py b/examples/pipelines/simple_pipeline/components/image_filtering/src/main.py
@@ -0,0 +1,38 @@
+"""
+This component filters images of the dataset based on image size (minimum height and width).
+"""
+import logging
+from typing import Dict
+
+import dask.dataframe as dd
+
+from express.dataset import FondantComponent
+from express.logger import configure_logging
+
+configure_logging()
+logger = logging.getLogger(__name__)
+
+
+class ImageFilterComponent(FondantComponent):
+    """
+    Component that filters images based on height and width.
+    """
+    def process(self, dataset: dd.DataFrame, args: Dict) -> dd.DataFrame:
+        """
+        Args:
+            dataset
+            args: args to pass to the function
+        
+        Returns:
+            dataset
+        """
+        logger.info("Filtering dataset...")
+        min_width, min_height = args.min_width, args.min_height
+        filtered_dataset = dataset.filter(lambda example: example["images_width"] > min_width and example["images_height"] > min_height)
+
+        return filtered_dataset
+
+
+if __name__ == "__main__":
+    component = ImageFilterComponent()
+    component.run()
diff --git a/examples/pipelines/simple_pipeline/components/load_from_hub/Dockerfile b/examples/pipelines/simple_pipeline/components/load_from_hub/Dockerfile
@@ -0,0 +1,29 @@
+FROM --platform=linux/amd64 python:3.8-slim
+
+## System dependencies
+RUN apt-get update && \
+    apt-get upgrade -y && \
+    apt-get install git curl -y
+
+# Downloading gcloud package
+RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
+
+# Installing the package
+RUN mkdir -p /usr/local/gcloud \
+&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
+&& /usr/local/gcloud/google-cloud-sdk/install.sh
+
+# Adding the package path to local
+ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin
+
+# install requirements
+COPY requirements.txt /
+RUN pip3 install --no-cache-dir -r requirements.txt
+
+# Copy over src-files of the component
+COPY src /src
+
+# Set the working directory to the source folder
+WORKDIR /src
+
+ENTRYPOINT ["python", "main.py"]
diff --git a/examples/pipelines/simple_pipeline/components/load_from_hub/__init__.py b/examples/pipelines/simple_pipeline/components/load_from_hub/__init__.py
diff --git a/examples/pipelines/simple_pipeline/components/load_from_hub/kubeflow_component.yaml b/examples/pipelines/simple_pipeline/components/load_from_hub/kubeflow_component.yaml
@@ -0,0 +1,30 @@
+name: load_from_hub
+description: A component that takes a dataset name from the 🤗 hub as input and uploads it to a GCS bucket.
+inputs: 
+    - name: dataset_name
+      description: Name of dataset on the hub
+      type: String
+
+    - name: batch_size
+      description: Batch size to use to create image metadata
+      type: Integer
+
+    - name: metadata
+      description: Metadata arguments, passed as a json dict string
+      type: String
+
+
+outputs:
+    - name: output_manifest_path
+      description: Path to the output manifest
+
+implementation:
+    container:
+        image: ghcr.io/ml6team/load_from_hub:latest
+        command: [
+            python3, main.py,
+            --dataset_name,              {inputValue: dataset_name},
+            --batch_size,                {inputValue: batch_size},
+            --metadata,                  {inputValue: metadata},
+            --output_manifest_path,      {outputPath: output_manifest_path},
+        ]
diff --git a/examples/pipelines/simple_pipeline/components/load_from_hub/requirements.txt b/examples/pipelines/simple_pipeline/components/load_from_hub/requirements.txt
@@ -0,0 +1,4 @@
+datasets==2.11.0
+git+https://github.com/ml6team/express.git@8ecfb9fcaf0b8d457626179fe44347df829b8979#egg=express
+Pillow==9.4.0
+gcsfs==2023.4.0
diff --git a/examples/pipelines/simple_pipeline/components/load_from_hub/src/fondant_component.yaml b/examples/pipelines/simple_pipeline/components/load_from_hub/src/fondant_component.yaml
@@ -0,0 +1,31 @@
+name: Load from hub
+description: Component that loads a dataset from the hub
+image: load_from_hub:latest
+
+input_subsets:
+  images:
+      fields:
+        data:
+          type: binary
+
+output_subsets:
+  images:
+    fields:
+      data:
+        type: binary
+      width:
+        type: int16
+      height:
+        type: int16
+  captions:
+    fields:
+      data:
+        type: utf8
+
+args:
+  dataset_name:
+    description: Name of dataset on the hub
+    type: str
+  batch_size:
+    description: Batch size to use to create image metadata
+    type: int