Skip to content

Commit

Permalink
Merge pull request #20 from ml6team/refactor_stable_diffusion_backup
Browse files Browse the repository at this point in the history
Refactor stable diffusion pipeline [part 1]
  • Loading branch information
NielsRogge authored Apr 4, 2023
2 parents f39b475 + 0b3b851 commit 37c33f9
Show file tree
Hide file tree
Showing 46 changed files with 813 additions and 1,377 deletions.
20 changes: 18 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,15 @@ repos:
rev: 'v0.0.254'
hooks:
- id: ruff
files: "^express/"
files: |
(?x)^(
express/.*|
examples/pipelines/hf_dataset_pipeline|
examples/pipelines/finetune_stable_diffusion/components/load_from_hub_component|
examples/pipelines/finetune_stable_diffusion/components/image_filter_component|
examples/pipelines/finetune_stable_diffusion/components/embedding_component|
examples/pipelines/finetune_stable_diffusion/dataset_creation_pipeline.py|
)$
args: [--fix, --exit-non-zero-on-fix]


Expand All @@ -24,4 +32,12 @@ repos:
hooks:
- id: black
name: black
files: "^express/"
files: |
(?x)^(
express/.*|
examples/pipelines/hf_dataset_pipeline|
examples/pipelines/finetune_stable_diffusion/components/load_from_hub_component|
examples/pipelines/finetune_stable_diffusion/components/image_filter_component|
examples/pipelines/finetune_stable_diffusion/components/embedding_component|
examples/pipelines/finetune_stable_diffusion/dataset_creation_pipeline.py|
)$
60 changes: 59 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,62 @@ After transforming the input data (see below), an **ExpressDatasetDraft** create
### 1.b) Transforms and Loaders
The most common type of component in Express is an **ExpressTransformComponent**, which takes an `ExpressDataset` and an optional dict of arguments as input and returns an `ExpressDatasetDraft` of transformed output data.

However, at the start of a new pipeline, you won't yet have any express datasets to transform. Instead, an express pipeline can use an **ExpressLoaderComponent** as entry-point, which only takes the optional dict of arguments to construct an ExpressDatasetDraft. For example, the arguments could specify an external data location and how to interpret it, after which a loader job can create a first `ExpressDataset`.
However, at the start of a new pipeline, you won't yet have any express datasets to transform. Instead, an express pipeline can use an **ExpressLoaderComponent** as entry-point, which only takes the optional dict of arguments to construct an ExpressDatasetDraft. For example, the arguments could specify an external data location and how to interpret it, after which a loader job can create a first `ExpressDataset`.

## **Data Manifest: a common approach to simplify different steps throughout the pipeline**
In order to keep track of the different data sources, we opt for a manifest-centered approach where
a manifest is simply a JSON file that is passed and modified throughout the different steps of the pipeline.

```json
{
"dataset_id":"<run_id>-<component_name>",
"index":"<path to the index parquet file>",
"associated_data":{
"dataset":{
"namespace_1":"<path to the dataset (metadata) parquet file of the datasets associated with `namespace_1`>",
"...":""
},
"caption":{
"namespace_1":"<path to the caption parquet file associated with `namespace_1`>",
"...":""
},
"embedding":{
"namespace_1":"<remote path to the directory containing the embeddings associated with `namespace_1`",
"...":""
}
},
"metadata":{
"branch":"<the name of the branch associated with the component>",
"commit_hash":"<the commit of the component>",
"creation_date":"<the creation date of the manifest>",
"run_id":"<a unique identifier associated with the kfp pipeline run>"
}
}
```
Further deep dive on some notations:

* **namespace:** the namespace is used to identify the different data sources. For example, you can give
your seed images a specific namespace (e.g. `seed`). Then, the images retrieved with clip-retrieval will
have different namespace (e.g. `knn`, `centroid`).

* **index**: the index denotes a unique index for each images with the format <namespace_uid> (e.g. `seed_00010`).
It indexes all the data sources in `associated_data`.
**Note**: the index keeps track of all the namespace (e.g. [`seed_00010`,`centroid_0001`, ...])

* **dataset**: a set of parquet files for each namespace that contain relevant metadata
(image size, location, ...) as well as the index.

* **caption**: a set of parquet files for each namespace that contain captions
image captions as well as the index.

* **metadata**: Helps keep track of the step that generated that manifest, code version and pipeline run id.

The Express pipeline consists of multiple steps defines as **Express steps** that are repeated
throughout the pipeline. The manifest pattern offers the required flexibility to promote its reuse and avoid
duplication of data sources. For example:

* **Data filtering** (e.g. filtering on image size): add new indices to the `index` but retain associated data.

* **Data creation** (e.g. clip retrieval): add new indicies to the new `index` and another source of data under associated data with a new namespace.

* **Data transformation** (e.g. image formatting): retain indices but replace dataset source in `dataset`.
60 changes: 0 additions & 60 deletions examples/pipelines/finetune_stable_diffusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,63 +101,3 @@ bash build_images.sh

This will build all the components located in the `components` folder, you could also opt for building a specific component
by passing the `--build-dir` and passing the folder name of the component you want to build.


#TODO: move those docs elsewhere
## **Data Manifest: a common approach to simplify different steps throughout the pipeline**
In order to keep track of the different data sources, we opt for a manifest-centered approach where
a manifest is simply a JSON file that is passed and modified throughout the different steps of the pipeline.

```json
{
"dataset_id":"<run_id>-<component_name>",
"index":"<path to the index parquet file>",
"associated_data":{
"dataset":{
"namespace_1":"<path to the dataset (metadata) parquet file of the datasets associated with `namespace_1`>",
"...":""
},
"caption":{
"namespace_1":"<path to the caption parquet file associated with `namespace_1`>",
"...":""
},
"embedding":{
"namespace_1":"<remote path to the directory containing the embeddings associated with `namespace_1`",
"...":""
}
},
"metadata":{
"branch":"<the name of the branch associated with the component>",
"commit_hash":"<the commit of the component>",
"creation_date":"<the creation date of the manifest>",
"run_id":"<a unique identifier associated with the kfp pipeline run>"
}
}
```
Further deep dive on some notations:

* **namespace:** the namespace is used to identify the different data sources. For example, you can give
your seed images a specific namespace (e.g. `seed`). Then, the images retrieved with clip-retrieval will
have different namespace (e.g. `knn`, `centroid`).

* **index**: the index denotes a unique index for each images with the format <namespace_uid> (e.g. `seed_00010`).
It indexes all the data sources in `associated_data`.
**Note**: the index keeps track of all the namespace (e.g. [`seed_00010`,`centroid_0001`, ...])

* **dataset**: a set of parquet files for each namespace that contain relevant metadata
(image size, location, ...) as well as the index.

* **caption**: a set of parquet files for each namespace that contain captions
image captions as well as the index.

* **metadata**: Helps keep track of the step that generated that manifest, code version and pipeline run id.

The Express pipeline consists of multiple steps defines as **Express steps** that are repeated
throughout the pipeline. The manifest pattern offers the required flexibility to promote its reuse and avoid
duplication of data sources. For example:

* **Data filtering** (e.g. filtering on image size): add new indices to the `index` but retain associated data.

* **Data creation** (e.g. clip retrieval): add new indicies to the new `index` and another source of data under associated data with a new namespace.

* **Data transformation** (e.g. image formatting): retain indices but replace dataset source in `dataset`.
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ for dir in $component_dir/*/; do
--build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \
--build-arg BUILD_TIMESTAMP=$(date '+%F_%H:%M:%S') \
--label org.opencontainers.image.source=https://github.com/${namespace}/${repo} \
--platform=linux/arm64 \
.
docker push "$full_image_name"
fi
Expand Down

This file was deleted.

This file was deleted.

This file was deleted.

Empty file.
Loading

0 comments on commit 37c33f9

Please sign in to comment.