-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Fondant Dataset and Component class (#33)
This PR adds the `FondantDataset` wrapper class around the `Manifest`, which loads data as Dask dataframes and allows to upload Dask dataframes back to the cloud. To test everything, the PR also includes a pipeline called "simple pipeline" that includes 3 components: loading from hub, image filtering and embedding. Each component needs to overwrite the `FondantComponent` class, which uses the `FondantDataset` class behind the scenes. To be discussed: - [ ] I've added `project_name` to the metadata of the manifest, in order to know the name of the cloud project. This is needed to load/upload data using `fsspec`. To do: - [ ] for now I'm manually adding the `"gcs://"` prefix and `".parquet"` suffix when loading data to and from the cloud, this needs to be addressed (we need a cleaner way that is not hardcoded) - [ ] only first 2 components are implemented, embedding component is to do - [ ] for the moment I'm still manually creating the KubeFlow component yaml file for each component. This should be updated to automatically create it based on the Fondant spec using the [write_kubeflow_specification](https://github.com/ml6team/express/blob/db5807ae868fe36091d8d7f0061450312ab7477b/express/component_spec.py#L207) method - [ ] nicer way of creating and passing metadata. The only metadata that is different per component is the `component_id`. Ideally getting rid of `args.metadata` - [ ] enforce usage of data types defined in output subsets when creating the dataset (currently only the column names are checked) --------- Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
- Loading branch information
1 parent
db5807a
commit 8245d06
Showing
29 changed files
with
788 additions
and
422 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
#!/bin/bash | ||
|
||
function usage { | ||
echo "Usage: $0 [options]" | ||
echo "Options:" | ||
echo " -c, --component <value> Set the component name. Pass the component folder name to build a certain components or 'all' to build all components in the current directory (required)" | ||
echo " -n, --namespace <value> Set the namespace (default: ml6team)" | ||
echo " -r, --repo <value> Set the repo (default: express)" | ||
echo " -t, --tag <value> Set the tag (default: latest)" | ||
echo " -h, --help Display this help message" | ||
} | ||
|
||
# Parse the arguments | ||
while [[ "$#" -gt 0 ]]; do case $1 in | ||
-n|--namespace) namespace="$2"; shift;; | ||
-r|--repo) repo="$2"; shift;; | ||
-t|--tag) tag="$2"; shift;; | ||
-c|--component) component="$2"; shift;; | ||
-h|--help) usage; exit;; | ||
*) echo "Unknown parameter passed: $1"; exit 1;; | ||
esac; shift; done | ||
|
||
# Check for required argument | ||
if [ -z "${component}" ]; then | ||
echo "Error: component parameter is required" | ||
usage | ||
exit 1 | ||
fi | ||
|
||
# Set default values for optional arguments if not passed | ||
[ -n "${namespace-}" ] || namespace="ml6team" | ||
[ -n "${repo-}" ] || repo="express" | ||
[ -n "${tag-}" ] || tag="latest" | ||
|
||
# Get the component directory | ||
component_dir=$(pwd)/"components" | ||
|
||
# Loop through all subdirectories | ||
for dir in $component_dir/*/; do | ||
cd "$dir" | ||
BASENAME=${dir%/} | ||
BASENAME=${BASENAME##*/} | ||
# Build all images or one image depending on the passed argument | ||
if [[ "$BASENAME" == "${component}" ]] || [[ "${component}" == "all" ]]; then | ||
full_image_name=ghcr.io/${namespace}/${BASENAME}:${tag} | ||
echo $full_image_name | ||
docker build -t "$full_image_name" \ | ||
--build-arg COMMIT_SHA=$(git rev-parse HEAD) \ | ||
--build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \ | ||
--build-arg BUILD_TIMESTAMP=$(date '+%F_%H:%M:%S') \ | ||
--label org.opencontainers.image.source=https://github.com/${namespace}/${repo} \ | ||
--platform=linux/arm64 \ | ||
. | ||
docker push "$full_image_name" | ||
fi | ||
cd "$component_dir" | ||
done |
23 changes: 23 additions & 0 deletions
23
examples/pipelines/simple_pipeline/components/embedding/src/fondant_component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
name: Embedding | ||
description: Component that embeds images using CLIP | ||
image: embedding:latest | ||
|
||
input_subsets: | ||
images: | ||
fields: | ||
data: | ||
type: binary | ||
|
||
output_subsets: | ||
embeddings: | ||
fields: | ||
data: | ||
type: float | ||
|
||
args: | ||
model_id: | ||
description: Model id on the Hugging Face hub | ||
type: str | ||
batch_size: | ||
description: Batch size to use when embedding | ||
type: int |
29 changes: 29 additions & 0 deletions
29
examples/pipelines/simple_pipeline/components/image_filtering/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
FROM --platform=linux/amd64 python:3.8-slim | ||
|
||
## System dependencies | ||
RUN apt-get update && \ | ||
apt-get upgrade -y && \ | ||
apt-get install git curl -y | ||
|
||
# Downloading gcloud package | ||
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz | ||
|
||
# Installing the package | ||
RUN mkdir -p /usr/local/gcloud \ | ||
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \ | ||
&& /usr/local/gcloud/google-cloud-sdk/install.sh | ||
|
||
# Adding the package path to local | ||
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin | ||
|
||
# install requirements | ||
COPY requirements.txt / | ||
RUN pip3 install --no-cache-dir -r requirements.txt | ||
|
||
# Copy over src-files of the component | ||
COPY src /src | ||
|
||
# Set the working directory to the source folder | ||
WORKDIR /src | ||
|
||
ENTRYPOINT ["python", "main.py"] |
35 changes: 35 additions & 0 deletions
35
examples/pipelines/simple_pipeline/components/image_filtering/kubeflow_component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
name: image_filtering | ||
description: A component that filters images | ||
inputs: | ||
- name: input_manifest_path | ||
description: Path to the input manifest | ||
type: String | ||
|
||
- name: min_width | ||
description: Desired minimum width | ||
type: Integer | ||
|
||
- name: min_height | ||
description: Desired minimum height | ||
type: Integer | ||
|
||
- name: metadata | ||
description: Metadata arguments, passed as a json dict string | ||
type: String | ||
|
||
|
||
outputs: | ||
- name: output_manifest_path | ||
description: Path to the output manifest | ||
|
||
implementation: | ||
container: | ||
image: ghcr.io/ml6team/image_filtering:latest | ||
command: [ | ||
python3, main.py, | ||
--input_manifest_path, {inputPath: input_manifest_path}, | ||
--min_width, {inputValue: min_width}, | ||
--min_height, {inputValue: min_height}, | ||
--metadata, {inputValue: metadata}, | ||
--output_manifest_path, {outputPath: output_manifest_path}, | ||
] |
1 change: 1 addition & 0 deletions
1
examples/pipelines/simple_pipeline/components/image_filtering/requirements.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
git+https://github.com/ml6team/express.git@b1855308ca9251da5ddd8e6b88c34bc1c082a71b#egg=express |
27 changes: 27 additions & 0 deletions
27
examples/pipelines/simple_pipeline/components/image_filtering/src/fondant_component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
name: Image filtering | ||
description: Component that filters images based on desired minimum width and height | ||
image: image_filtering:latest | ||
|
||
input_subsets: | ||
images: | ||
fields: | ||
width: | ||
type: int16 | ||
height: | ||
type: int16 | ||
|
||
output_subsets: | ||
images: | ||
fields: | ||
width: | ||
type: int16 | ||
height: | ||
type: int16 | ||
|
||
args: | ||
min_width: | ||
description: Desired minimum width | ||
type: int | ||
min_height: | ||
description: Desired minimum height | ||
type: int |
38 changes: 38 additions & 0 deletions
38
examples/pipelines/simple_pipeline/components/image_filtering/src/main.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
""" | ||
This component filters images of the dataset based on image size (minimum height and width). | ||
""" | ||
import logging | ||
from typing import Dict | ||
|
||
import dask.dataframe as dd | ||
|
||
from express.dataset import FondantComponent | ||
from express.logger import configure_logging | ||
|
||
configure_logging() | ||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class ImageFilterComponent(FondantComponent): | ||
""" | ||
Component that filters images based on height and width. | ||
""" | ||
def process(self, dataset: dd.DataFrame, args: Dict) -> dd.DataFrame: | ||
""" | ||
Args: | ||
dataset | ||
args: args to pass to the function | ||
Returns: | ||
dataset | ||
""" | ||
logger.info("Filtering dataset...") | ||
min_width, min_height = args.min_width, args.min_height | ||
filtered_dataset = dataset.filter(lambda example: example["images_width"] > min_width and example["images_height"] > min_height) | ||
|
||
return filtered_dataset | ||
|
||
|
||
if __name__ == "__main__": | ||
component = ImageFilterComponent() | ||
component.run() |
29 changes: 29 additions & 0 deletions
29
examples/pipelines/simple_pipeline/components/load_from_hub/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
FROM --platform=linux/amd64 python:3.8-slim | ||
|
||
## System dependencies | ||
RUN apt-get update && \ | ||
apt-get upgrade -y && \ | ||
apt-get install git curl -y | ||
|
||
# Downloading gcloud package | ||
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz | ||
|
||
# Installing the package | ||
RUN mkdir -p /usr/local/gcloud \ | ||
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \ | ||
&& /usr/local/gcloud/google-cloud-sdk/install.sh | ||
|
||
# Adding the package path to local | ||
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin | ||
|
||
# install requirements | ||
COPY requirements.txt / | ||
RUN pip3 install --no-cache-dir -r requirements.txt | ||
|
||
# Copy over src-files of the component | ||
COPY src /src | ||
|
||
# Set the working directory to the source folder | ||
WORKDIR /src | ||
|
||
ENTRYPOINT ["python", "main.py"] |
Empty file.
30 changes: 30 additions & 0 deletions
30
examples/pipelines/simple_pipeline/components/load_from_hub/kubeflow_component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
name: load_from_hub | ||
description: A component that takes a dataset name from the 🤗 hub as input and uploads it to a GCS bucket. | ||
inputs: | ||
- name: dataset_name | ||
description: Name of dataset on the hub | ||
type: String | ||
|
||
- name: batch_size | ||
description: Batch size to use to create image metadata | ||
type: Integer | ||
|
||
- name: metadata | ||
description: Metadata arguments, passed as a json dict string | ||
type: String | ||
|
||
|
||
outputs: | ||
- name: output_manifest_path | ||
description: Path to the output manifest | ||
|
||
implementation: | ||
container: | ||
image: ghcr.io/ml6team/load_from_hub:latest | ||
command: [ | ||
python3, main.py, | ||
--dataset_name, {inputValue: dataset_name}, | ||
--batch_size, {inputValue: batch_size}, | ||
--metadata, {inputValue: metadata}, | ||
--output_manifest_path, {outputPath: output_manifest_path}, | ||
] |
4 changes: 4 additions & 0 deletions
4
examples/pipelines/simple_pipeline/components/load_from_hub/requirements.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
datasets==2.11.0 | ||
git+https://github.com/ml6team/express.git@8ecfb9fcaf0b8d457626179fe44347df829b8979#egg=express | ||
Pillow==9.4.0 | ||
gcsfs==2023.4.0 |
31 changes: 31 additions & 0 deletions
31
examples/pipelines/simple_pipeline/components/load_from_hub/src/fondant_component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
name: Load from hub | ||
description: Component that loads a dataset from the hub | ||
image: load_from_hub:latest | ||
|
||
input_subsets: | ||
images: | ||
fields: | ||
data: | ||
type: binary | ||
|
||
output_subsets: | ||
images: | ||
fields: | ||
data: | ||
type: binary | ||
width: | ||
type: int16 | ||
height: | ||
type: int16 | ||
captions: | ||
fields: | ||
data: | ||
type: utf8 | ||
|
||
args: | ||
dataset_name: | ||
description: Name of dataset on the hub | ||
type: str | ||
batch_size: | ||
description: Batch size to use to create image metadata | ||
type: int |
Oops, something went wrong.