ml6team · NielsRogge · Apr 4, 2023 · Mar 22, 2023 · Mar 22, 2023 · Mar 22, 2023
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -6,7 +6,15 @@ repos:
     rev: 'v0.0.254'
     hooks:
       - id: ruff
-        files: "^express/"
+        files:  |
+            (?x)^(
+                express/.*|
+                examples/pipelines/hf_dataset_pipeline|
+                examples/pipelines/finetune_stable_diffusion/components/load_from_hub_component|
+                examples/pipelines/finetune_stable_diffusion/components/image_filter_component|
+                examples/pipelines/finetune_stable_diffusion/components/embedding_component|
+                examples/pipelines/finetune_stable_diffusion/dataset_creation_pipeline.py|
+            )$
         args: [--fix, --exit-non-zero-on-fix]
 
 
@@ -24,4 +32,12 @@ repos:
     hooks:
       - id: black
         name: black
-        files: "^express/"
+        files:  |
+            (?x)^(
+                express/.*|
+                examples/pipelines/hf_dataset_pipeline|
+                examples/pipelines/finetune_stable_diffusion/components/load_from_hub_component|
+                examples/pipelines/finetune_stable_diffusion/components/image_filter_component|
+                examples/pipelines/finetune_stable_diffusion/components/embedding_component|
+                examples/pipelines/finetune_stable_diffusion/dataset_creation_pipeline.py|
+            )$
diff --git a/docs/README.md b/docs/README.md
@@ -99,4 +99,62 @@ After transforming the input data (see below), an **ExpressDatasetDraft** create
 ### 1.b) Transforms and Loaders
 The most common type of component in Express is an **ExpressTransformComponent**, which takes an `ExpressDataset` and an optional dict of arguments as input and returns an `ExpressDatasetDraft` of transformed output data.
 
-However, at the start of a new pipeline, you won't yet have any express datasets to transform. Instead, an express pipeline can use an **ExpressLoaderComponent** as entry-point, which only takes the optional dict of arguments to construct an ExpressDatasetDraft. For example, the arguments could specify an external data location and how to interpret it, after which a loader job can create a first `ExpressDataset`.
+However, at the start of a new pipeline, you won't yet have any express datasets to transform. Instead, an express pipeline can use an **ExpressLoaderComponent** as entry-point, which only takes the optional dict of arguments to construct an ExpressDatasetDraft. For example, the arguments could specify an external data location and how to interpret it, after which a loader job can create a first `ExpressDataset`.
+
+## **Data Manifest: a common approach to simplify different steps throughout the pipeline**
+In order to keep track of the different data sources, we opt for a manifest-centered approach where 
+a manifest is simply a JSON file that is passed and modified throughout the different steps of the pipeline. 
+
+```json
+{
+   "dataset_id":"<run_id>-<component_name>",
+   "index":"<path to the index parquet file>",
+   "associated_data":{
+      "dataset":{
+         "namespace_1":"<path to the dataset (metadata) parquet file of the datasets associated with `namespace_1`>",
+         "...":""
+      },
+      "caption":{
+         "namespace_1":"<path to the caption parquet file associated with `namespace_1`>",
+         "...":""
+      },
+      "embedding":{
+         "namespace_1":"<remote path to the directory containing the embeddings associated with `namespace_1`",
+         "...":""
+      }
+   },
+   "metadata":{
+      "branch":"<the name of the branch associated with the component>",
+      "commit_hash":"<the commit of the component>",
+      "creation_date":"<the creation date of the manifest>",
+      "run_id":"<a unique identifier associated with the kfp pipeline run>"
+   }
+}
+```
+Further deep dive on some notations:  
+
+* **namespace:** the namespace is used to identify the different data sources. For example, you can give 
+your seed images a specific namespace (e.g. `seed`). Then, the images retrieved with clip-retrieval will 
+have different namespace (e.g. `knn`, `centroid`).
+
+* **index**: the index denotes a unique index for each images with the format <namespace_uid> (e.g. `seed_00010`).
+It indexes all the data sources in `associated_data`.
+**Note**: the index keeps track of all the namespace (e.g. [`seed_00010`,`centroid_0001`, ...])
+
+* **dataset**: a set of parquet files for each namespace that contain relevant metadata
+(image size, location, ...) as well as the index.
+
+* **caption**: a set of parquet files for each namespace that contain captions
+image captions as well as the index.
+
+* **metadata**: Helps keep track of the step that generated that manifest, code version and pipeline run id.
+
+The Express pipeline consists of multiple steps defines as **Express steps** that are repeated 
+throughout the pipeline. The manifest pattern offers the required flexibility to promote its reuse and avoid
+duplication of data sources. For example:  
+
+* **Data filtering** (e.g. filtering on image size): add new indices to the `index` but retain associated data.  
+
+* **Data creation** (e.g. clip retrieval): add new indicies to the new `index` and another source of data under associated data with a new namespace.  
+
+* **Data transformation** (e.g. image formatting): retain indices but replace dataset source in `dataset`.  
diff --git a/examples/pipelines/finetune_stable_diffusion/README.md b/examples/pipelines/finetune_stable_diffusion/README.md
@@ -101,63 +101,3 @@ bash build_images.sh
 
 This will build all the components located in the `components` folder, you could also opt for building a specific component
 by passing the `--build-dir` and passing the folder name of the component you want to build. 
-
-
-#TODO: move those docs elsewhere 
-## **Data Manifest: a common approach to simplify different steps throughout the pipeline**
-In order to keep track of the different data sources, we opt for a manifest-centered approach where 
-a manifest is simply a JSON file that is passed and modified throughout the different steps of the pipeline. 
-
-```json
-{
-   "dataset_id":"<run_id>-<component_name>",
-   "index":"<path to the index parquet file>",
-   "associated_data":{
-      "dataset":{
-         "namespace_1":"<path to the dataset (metadata) parquet file of the datasets associated with `namespace_1`>",
-         "...":""
-      },
-      "caption":{
-         "namespace_1":"<path to the caption parquet file associated with `namespace_1`>",
-         "...":""
-      },
-      "embedding":{
-         "namespace_1":"<remote path to the directory containing the embeddings associated with `namespace_1`",
-         "...":""
-      }
-   },
-   "metadata":{
-      "branch":"<the name of the branch associated with the component>",
-      "commit_hash":"<the commit of the component>",
-      "creation_date":"<the creation date of the manifest>",
-      "run_id":"<a unique identifier associated with the kfp pipeline run>"
-   }
-}
-```
-Further deep dive on some notations:  
-
-* **namespace:** the namespace is used to identify the different data sources. For example, you can give 
-your seed images a specific namespace (e.g. `seed`). Then, the images retrieved with clip-retrieval will 
-have different namespace (e.g. `knn`, `centroid`).
-
-* **index**: the index denotes a unique index for each images with the format <namespace_uid> (e.g. `seed_00010`).
-It indexes all the data sources in `associated_data`.
-**Note**: the index keeps track of all the namespace (e.g. [`seed_00010`,`centroid_0001`, ...])
-
-* **dataset**: a set of parquet files for each namespace that contain relevant metadata
-(image size, location, ...) as well as the index.
-
-* **caption**: a set of parquet files for each namespace that contain captions
-image captions as well as the index.
-
-* **metadata**: Helps keep track of the step that generated that manifest, code version and pipeline run id.
-
-The Express pipeline consists of multiple steps defines as **Express steps** that are repeated 
-throughout the pipeline. The manifest pattern offers the required flexibility to promote its reuse and avoid
-duplication of data sources. For example:  
-
-* **Data filtering** (e.g. filtering on image size): add new indices to the `index` but retain associated data.  
-
-* **Data creation** (e.g. clip retrieval): add new indicies to the new `index` and another source of data under associated data with a new namespace.  
-
-* **Data transformation** (e.g. image formatting): retain indices but replace dataset source in `dataset`.  
diff --git a/examples/pipelines/finetune_stable_diffusion/build_images.sh b/examples/pipelines/finetune_stable_diffusion/build_images.sh
@@ -49,6 +49,7 @@ for dir in $component_dir/*/; do
      --build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \
      --build-arg BUILD_TIMESTAMP=$(date '+%F_%H:%M:%S') \
      --label org.opencontainers.image.source=https://github.com/${namespace}/${repo} \
+     --platform=linux/arm64 \
      .
     docker push "$full_image_name"
   fi

diff --git a/examples/pipelines/finetune_stable_diffusion/components/clip_retrieval_component/Dockerfile b/examples/pipelines/finetune_stable_diffusion/components/clip_retrieval_component/Dockerfile
@@ -1,14 +1,31 @@
-FROM europe-west1-docker.pkg.dev/storied-landing-366912/storied-landing-366912-default-repository/mlpipelines/kubeflow/components/base_component:latest
+FROM --platform=linux/amd64 python:3.8-slim
 
-# Set the working directory to the source folder
-WORKDIR /src
+## System dependencies
+RUN apt-get update && \
+    apt-get upgrade -y && \
+    apt-get install git curl -y
+
+# Downloading gcloud package
+RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
+
+# Installing the package
+RUN mkdir -p /usr/local/gcloud \
+&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
+&& /usr/local/gcloud/google-cloud-sdk/install.sh
+
+# Adding the package path to local
+ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin
 
 # Copy over src-files of the component
 COPY requirements.txt .
 
 # Install packages
 RUN pip3 install -r requirements.txt
 
+# Copy over src-files of the component
 COPY src /src
 
+# Set the working directory to the source folder
+WORKDIR /src
+
 ENTRYPOINT ["python", "main.py"]
diff --git a/...es/pipelines/finetune_stable_diffusion/components/clip_retrieval_component/component.yaml b/...es/pipelines/finetune_stable_diffusion/components/clip_retrieval_component/component.yaml
@@ -1,58 +1,29 @@
 name: clip_retrieval_component
-description: A component that takes a dataset manifest and returns an output data manifest with an extended dataset
-  by retrieving similar images from the laion dataset using different retrieval strategies (knn, centroid)
-inputs:
-  - name: run_id
-    description: The run id of the pipeline
-    type: String
-  - name: artifact_bucket
-    description: The GCS bucket used to store the artifact
-    type: String
-  - name: component_name
-    description: the name of the component (used to create gcs artefact path)
-    type: String
-  - name: project_id
-    description: The id of the gcp-project
-    type: String
-  - name: laion_index_url
-    description: contains the indices of the metadata. Those indices need to be transformed in case you decide to use only a subset of the dataset
-    type: String
-  - name: laion_metadata_url
-    description: url to the metadata of laion dataset metadata (arrow format). It can either contain a subset of the laion 5b metadata (e.g. laion-en) or all of the metadata
-    type: String
-  - name: nb_images_knn
-    description: The number of images to return via the knn strategy (per image)
-    type: Integer
-  - name: nb_images_centroid
-    description: The number of images to return via the centroid strategy
-    type: Integer
-  - name: data_manifest_path
-    description: The previous component manifest path
-    type: String
+description: A component that retrieves similar images from the LAION dataset.
+inputs: 
+    - name: extra_args
+      description: Additional arguments passed to the component, as a json dict string
+      type: String
+
+    - name: metadata
+      description: Metadata arguments, passed as a json dict string
+      type: String
 
+    - name: input_manifest
+      description: Path to the input manifest
+      type: String
+
 outputs:
-  - name: data_manifest_path_clip_retrieval_component
-    description: Path to the local file containing the gcs path where the output has been stored
-  - name: parquet_path_clip_centroid_retrieval
-    description: The path to the parquet file containing the urls from centroid retrieval
-  - name: parquet_path_clip_knn_retrieval
-    description: The path to the parquet file containing the urls from knn retrieval
+    - name: output_manifest
+      description: Path to the output manifest
 
 implementation:
-  container:
-    image: europe-west1-docker.pkg.dev/storied-landing-366912/storied-landing-366912-default-repository/mlpipelines/kubeflow/components/clip_retrieval_component:latest
-    command: [
-      python3, main.py,
-      --run-id,                                      { inputValue: run_id },
-      --artifact-bucket,                             { inputValue: artifact_bucket },
-      --component-name,                              { inputValue: component_name },
-      --project-id,                                  { inputValue: project_id, },
-      --laion-index-url,                             { inputValue: laion_index_url },
-      --laion-metadata-url,                          { inputValue: laion_metadata_url },
-      --nb-images-knn,                               { inputValue: nb_images_knn },
-      --nb-images-centroid,                          { inputValue: nb_images_centroid },
-      --data-manifest-path,                          { inputPath: data_manifest_path },
-      --data-manifest-path-clip-retrieval-component, { outputPath: data_manifest_path_clip_retrieval_component },
-      --parquet-path-clip-centroid-retrieval,        { outputPath: parquet_path_clip_centroid_retrieval },
-      --parquet-path-clip-knn-retrieval,             { outputPath: parquet_path_clip_knn_retrieval },
-    ]
+    container:
+        image: ghcr.io/ml6team/clip_retrieval_component:latest
+        command: [
+            python3, main.py,
+            --input-manifest,       {inputPath: input_manifest},
+            --metadata,             {inputValue: metadata},
+            --extra-args,           {inputValue: extra_args},
+            --output-manifest,      {outputPath: output_manifest},
+        ]
diff --git a/.../pipelines/finetune_stable_diffusion/components/clip_retrieval_component/requirements.txt b/.../pipelines/finetune_stable_diffusion/components/clip_retrieval_component/requirements.txt
@@ -1,5 +1,7 @@
-clip-retrieval==2.34.2
-numpy>=1.19.5,<2
-pandas==1.3.5
-tqdm==4.64.1
+git+https://github.com/ml6team/express.git@3cc6ed0c2c1d21777ab32d21f3d96f0c58e36090
+datasets==2.11.0
+numpy==1.24.2
+clip-retrieval==2.36.1
+numpy==1.24.2
+tqdm==4.65.0
 Pillow==9.3.0