Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor stable diffusion pipeline [part 1] #20

Merged
merged 38 commits into from
Apr 4, 2023
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
4830be4
First draft
Mar 22, 2023
511803a
More improvements
Mar 22, 2023
e26ce7c
More improvements
Mar 22, 2023
89f02ad
Update dataset name
Mar 22, 2023
0ef5e08
Add embedding component
Mar 22, 2023
b02a6e8
More improvements
Mar 23, 2023
a8bcbf0
More improvements and fixes
Mar 23, 2023
6c785b1
More fixes
Mar 23, 2023
8c3923d
Add print statement
Mar 23, 2023
a447d16
Update requirements
Mar 23, 2023
b37cfa3
Add print statements
Mar 23, 2023
24b88e7
Add print statements
Mar 23, 2023
932bb33
Apply fix
Mar 23, 2023
3b44bbf
Fix kwargs
Mar 23, 2023
8788dfb
Add print statements
Mar 23, 2023
6e7e599
Apply fix
Mar 23, 2023
9fb4061
Fix embedding component
Mar 24, 2023
e032900
Update requirements
Mar 24, 2023
2325335
Use single build_images.sh script
Mar 24, 2023
27875bd
Update image paths
Mar 24, 2023
45f8d70
Add clip retrieval component
Mar 28, 2023
d763a93
Add back the non updated files
Mar 29, 2023
d8f8fac
Add READNEs to components
Mar 29, 2023
74740f3
Address comments
Mar 29, 2023
0a07eae
Address comments
Mar 29, 2023
8196e8d
Remove type int
Mar 29, 2023
3bfc126
Add batch size argument
Mar 29, 2023
2595575
Update clip retrieval component
Mar 29, 2023
c1e03cc
Update platform
Mar 29, 2023
f828244
More improvements
Mar 30, 2023
70ed04b
Update dockerfile
Mar 30, 2023
f93774b
Run pre-commit run --all-files
Mar 30, 2023
7952f79
Add retrieval mini component
Mar 30, 2023
ccba060
Fix path
Mar 30, 2023
3cc6ed0
Fix pipeline
Mar 30, 2023
ab322d6
More improvements
Mar 30, 2023
7e047a1
Update CLIP retrieval component
Mar 31, 2023
1567a36
Use old implementation of clip retrieval
Apr 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,15 @@ repos:
rev: 'v0.0.254'
hooks:
- id: ruff
files: "^express/"
files: |
(?x)^(
express/.*|
examples/pipelines/hf_dataset_pipeline|
examples/pipelines/finetune_stable_diffusion/components/load_from_hub_component|
examples/pipelines/finetune_stable_diffusion/components/image_filter_component|
examples/pipelines/finetune_stable_diffusion/components/embedding_component|
examples/pipelines/finetune_stable_diffusion/dataset_creation_pipeline.py|
)$
args: [--fix, --exit-non-zero-on-fix]


Expand All @@ -24,4 +32,12 @@ repos:
hooks:
- id: black
name: black
files: "^express/"
files: |
(?x)^(
express/.*|
examples/pipelines/hf_dataset_pipeline|
examples/pipelines/finetune_stable_diffusion/components/load_from_hub_component|
examples/pipelines/finetune_stable_diffusion/components/image_filter_component|
examples/pipelines/finetune_stable_diffusion/components/embedding_component|
examples/pipelines/finetune_stable_diffusion/dataset_creation_pipeline.py|
)$
60 changes: 59 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,62 @@ After transforming the input data (see below), an **ExpressDatasetDraft** create
### 1.b) Transforms and Loaders
The most common type of component in Express is an **ExpressTransformComponent**, which takes an `ExpressDataset` and an optional dict of arguments as input and returns an `ExpressDatasetDraft` of transformed output data.

However, at the start of a new pipeline, you won't yet have any express datasets to transform. Instead, an express pipeline can use an **ExpressLoaderComponent** as entry-point, which only takes the optional dict of arguments to construct an ExpressDatasetDraft. For example, the arguments could specify an external data location and how to interpret it, after which a loader job can create a first `ExpressDataset`.
However, at the start of a new pipeline, you won't yet have any express datasets to transform. Instead, an express pipeline can use an **ExpressLoaderComponent** as entry-point, which only takes the optional dict of arguments to construct an ExpressDatasetDraft. For example, the arguments could specify an external data location and how to interpret it, after which a loader job can create a first `ExpressDataset`.

## **Data Manifest: a common approach to simplify different steps throughout the pipeline**
In order to keep track of the different data sources, we opt for a manifest-centered approach where
a manifest is simply a JSON file that is passed and modified throughout the different steps of the pipeline.

```json
{
"dataset_id":"<run_id>-<component_name>",
"index":"<path to the index parquet file>",
"associated_data":{
"dataset":{
"namespace_1":"<path to the dataset (metadata) parquet file of the datasets associated with `namespace_1`>",
"...":""
},
"caption":{
"namespace_1":"<path to the caption parquet file associated with `namespace_1`>",
"...":""
},
"embedding":{
"namespace_1":"<remote path to the directory containing the embeddings associated with `namespace_1`",
"...":""
}
},
"metadata":{
"branch":"<the name of the branch associated with the component>",
"commit_hash":"<the commit of the component>",
"creation_date":"<the creation date of the manifest>",
"run_id":"<a unique identifier associated with the kfp pipeline run>"
}
}
```
Further deep dive on some notations:

* **namespace:** the namespace is used to identify the different data sources. For example, you can give
your seed images a specific namespace (e.g. `seed`). Then, the images retrieved with clip-retrieval will
have different namespace (e.g. `knn`, `centroid`).

* **index**: the index denotes a unique index for each images with the format <namespace_uid> (e.g. `seed_00010`).
It indexes all the data sources in `associated_data`.
**Note**: the index keeps track of all the namespace (e.g. [`seed_00010`,`centroid_0001`, ...])

* **dataset**: a set of parquet files for each namespace that contain relevant metadata
(image size, location, ...) as well as the index.

* **caption**: a set of parquet files for each namespace that contain captions
image captions as well as the index.

* **metadata**: Helps keep track of the step that generated that manifest, code version and pipeline run id.

The Express pipeline consists of multiple steps defines as **Express steps** that are repeated
throughout the pipeline. The manifest pattern offers the required flexibility to promote its reuse and avoid
duplication of data sources. For example:

* **Data filtering** (e.g. filtering on image size): add new indices to the `index` but retain associated data.

* **Data creation** (e.g. clip retrieval): add new indicies to the new `index` and another source of data under associated data with a new namespace.

* **Data transformation** (e.g. image formatting): retain indices but replace dataset source in `dataset`.
60 changes: 0 additions & 60 deletions examples/pipelines/finetune_stable_diffusion/README.md
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -101,63 +101,3 @@ bash build_images.sh

This will build all the components located in the `components` folder, you could also opt for building a specific component
by passing the `--build-dir` and passing the folder name of the component you want to build.


#TODO: move those docs elsewhere
## **Data Manifest: a common approach to simplify different steps throughout the pipeline**
In order to keep track of the different data sources, we opt for a manifest-centered approach where
a manifest is simply a JSON file that is passed and modified throughout the different steps of the pipeline.

```json
{
"dataset_id":"<run_id>-<component_name>",
"index":"<path to the index parquet file>",
"associated_data":{
"dataset":{
"namespace_1":"<path to the dataset (metadata) parquet file of the datasets associated with `namespace_1`>",
"...":""
},
"caption":{
"namespace_1":"<path to the caption parquet file associated with `namespace_1`>",
"...":""
},
"embedding":{
"namespace_1":"<remote path to the directory containing the embeddings associated with `namespace_1`",
"...":""
}
},
"metadata":{
"branch":"<the name of the branch associated with the component>",
"commit_hash":"<the commit of the component>",
"creation_date":"<the creation date of the manifest>",
"run_id":"<a unique identifier associated with the kfp pipeline run>"
}
}
```
Further deep dive on some notations:

* **namespace:** the namespace is used to identify the different data sources. For example, you can give
your seed images a specific namespace (e.g. `seed`). Then, the images retrieved with clip-retrieval will
have different namespace (e.g. `knn`, `centroid`).

* **index**: the index denotes a unique index for each images with the format <namespace_uid> (e.g. `seed_00010`).
It indexes all the data sources in `associated_data`.
**Note**: the index keeps track of all the namespace (e.g. [`seed_00010`,`centroid_0001`, ...])

* **dataset**: a set of parquet files for each namespace that contain relevant metadata
(image size, location, ...) as well as the index.

* **caption**: a set of parquet files for each namespace that contain captions
image captions as well as the index.

* **metadata**: Helps keep track of the step that generated that manifest, code version and pipeline run id.

The Express pipeline consists of multiple steps defines as **Express steps** that are repeated
throughout the pipeline. The manifest pattern offers the required flexibility to promote its reuse and avoid
duplication of data sources. For example:

* **Data filtering** (e.g. filtering on image size): add new indices to the `index` but retain associated data.

* **Data creation** (e.g. clip retrieval): add new indicies to the new `index` and another source of data under associated data with a new namespace.

* **Data transformation** (e.g. image formatting): retain indices but replace dataset source in `dataset`.
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ for dir in $component_dir/*/; do
--build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \
--build-arg BUILD_TIMESTAMP=$(date '+%F_%H:%M:%S') \
--label org.opencontainers.image.source=https://github.com/${namespace}/${repo} \
--platform=linux/arm64 \
.
docker push "$full_image_name"
fi
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,31 @@
FROM europe-west1-docker.pkg.dev/storied-landing-366912/storied-landing-366912-default-repository/mlpipelines/kubeflow/components/base_component:latest
FROM --platform=linux/amd64 python:3.8-slim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove --platform=linux/amd64 and only have it as a flag in the bash script?


# Set the working directory to the source folder
WORKDIR /src
## System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git curl -y

# Downloading gcloud package
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz

# Installing the package
RUN mkdir -p /usr/local/gcloud \
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
&& /usr/local/gcloud/google-cloud-sdk/install.sh

# Adding the package path to local
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

# Copy over src-files of the component
COPY requirements.txt .

# Install packages
RUN pip3 install -r requirements.txt

# Copy over src-files of the component
COPY src /src

# Set the working directory to the source folder
WORKDIR /src

ENTRYPOINT ["python", "main.py"]
Original file line number Diff line number Diff line change
@@ -1,58 +1,29 @@
name: clip_retrieval_component
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove all clip retrieval related stuff from this PR?

description: A component that takes a dataset manifest and returns an output data manifest with an extended dataset
by retrieving similar images from the laion dataset using different retrieval strategies (knn, centroid)
inputs:
- name: run_id
description: The run id of the pipeline
type: String
- name: artifact_bucket
description: The GCS bucket used to store the artifact
type: String
- name: component_name
description: the name of the component (used to create gcs artefact path)
type: String
- name: project_id
description: The id of the gcp-project
type: String
- name: laion_index_url
description: contains the indices of the metadata. Those indices need to be transformed in case you decide to use only a subset of the dataset
type: String
- name: laion_metadata_url
description: url to the metadata of laion dataset metadata (arrow format). It can either contain a subset of the laion 5b metadata (e.g. laion-en) or all of the metadata
type: String
- name: nb_images_knn
description: The number of images to return via the knn strategy (per image)
type: Integer
- name: nb_images_centroid
description: The number of images to return via the centroid strategy
type: Integer
- name: data_manifest_path
description: The previous component manifest path
type: String
description: A component that retrieves similar images from the LAION dataset.
inputs:
- name: extra_args
description: Additional arguments passed to the component, as a json dict string
type: String

- name: metadata
description: Metadata arguments, passed as a json dict string
type: String

- name: input_manifest
description: Path to the input manifest
type: String

outputs:
- name: data_manifest_path_clip_retrieval_component
description: Path to the local file containing the gcs path where the output has been stored
- name: parquet_path_clip_centroid_retrieval
description: The path to the parquet file containing the urls from centroid retrieval
- name: parquet_path_clip_knn_retrieval
description: The path to the parquet file containing the urls from knn retrieval
- name: output_manifest
description: Path to the output manifest

implementation:
container:
image: europe-west1-docker.pkg.dev/storied-landing-366912/storied-landing-366912-default-repository/mlpipelines/kubeflow/components/clip_retrieval_component:latest
command: [
python3, main.py,
--run-id, { inputValue: run_id },
--artifact-bucket, { inputValue: artifact_bucket },
--component-name, { inputValue: component_name },
--project-id, { inputValue: project_id, },
--laion-index-url, { inputValue: laion_index_url },
--laion-metadata-url, { inputValue: laion_metadata_url },
--nb-images-knn, { inputValue: nb_images_knn },
--nb-images-centroid, { inputValue: nb_images_centroid },
--data-manifest-path, { inputPath: data_manifest_path },
--data-manifest-path-clip-retrieval-component, { outputPath: data_manifest_path_clip_retrieval_component },
--parquet-path-clip-centroid-retrieval, { outputPath: parquet_path_clip_centroid_retrieval },
--parquet-path-clip-knn-retrieval, { outputPath: parquet_path_clip_knn_retrieval },
]
container:
image: ghcr.io/ml6team/clip_retrieval_component:latest
command: [
python3, main.py,
--input-manifest, {inputPath: input_manifest},
--metadata, {inputValue: metadata},
--extra-args, {inputValue: extra_args},
--output-manifest, {outputPath: output_manifest},
]
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
clip-retrieval==2.34.2
numpy>=1.19.5,<2
pandas==1.3.5
tqdm==4.64.1
git+https://github.com/ml6team/express.git@3cc6ed0c2c1d21777ab32d21f3d96f0c58e36090
datasets==2.11.0
numpy==1.24.2
clip-retrieval==2.36.1
numpy==1.24.2
tqdm==4.65.0
Pillow==9.3.0
Loading