Skip to content

Commit

Permalink
Add Fondant Dataset and Component class (#33)
Browse files Browse the repository at this point in the history
This PR adds the `FondantDataset` wrapper class around the `Manifest`,
which loads data as Dask dataframes and allows to upload Dask dataframes
back to the cloud.

To test everything, the PR also includes a pipeline called "simple
pipeline" that includes 3 components: loading from hub, image filtering
and embedding. Each component needs to overwrite the `FondantComponent`
class, which uses the `FondantDataset` class behind the scenes.

To be discussed:

- [ ] I've added `project_name` to the metadata of the manifest, in
order to know the name of the cloud project. This is needed to
load/upload data using `fsspec`.

To do:

- [ ] for now I'm manually adding the `"gcs://"` prefix and `".parquet"`
suffix when loading data to and from the cloud, this needs to be
addressed (we need a cleaner way that is not hardcoded)
- [ ] only first 2 components are implemented, embedding component is to
do
- [ ] for the moment I'm still manually creating the KubeFlow component
yaml file for each component. This should be updated to automatically
create it based on the Fondant spec using the
[write_kubeflow_specification](https://github.com/ml6team/express/blob/db5807ae868fe36091d8d7f0061450312ab7477b/express/component_spec.py#L207)
method
- [ ] nicer way of creating and passing metadata. The only metadata that
is different per component is the `component_id`. Ideally getting rid of
`args.metadata`
- [ ] enforce usage of data types defined in output subsets when
creating the dataset (currently only the column names are checked)

---------

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
  • Loading branch information
3 people authored Apr 24, 2023
1 parent db5807a commit 8245d06
Show file tree
Hide file tree
Showing 29 changed files with 788 additions and 422 deletions.
2 changes: 1 addition & 1 deletion examples/pipelines/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ class KubeflowConfig(GeneralConfig):
ARTIFACT_BUCKET = f"{GeneralConfig.GCP_PROJECT_ID}-kfp-output"
CLUSTER_NAME = "kfp-express"
CLUSTER_ZONE = "europe-west4-a"
HOST = "https://472c61c751ab9be9-dot-europe-west1.pipelines.googleusercontent.com"
HOST = "https://52074149b1563463-dot-europe-west1.pipelines.googleusercontent.com/"
57 changes: 57 additions & 0 deletions examples/pipelines/simple_pipeline/build_images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash

function usage {
echo "Usage: $0 [options]"
echo "Options:"
echo " -c, --component <value> Set the component name. Pass the component folder name to build a certain components or 'all' to build all components in the current directory (required)"
echo " -n, --namespace <value> Set the namespace (default: ml6team)"
echo " -r, --repo <value> Set the repo (default: express)"
echo " -t, --tag <value> Set the tag (default: latest)"
echo " -h, --help Display this help message"
}

# Parse the arguments
while [[ "$#" -gt 0 ]]; do case $1 in
-n|--namespace) namespace="$2"; shift;;
-r|--repo) repo="$2"; shift;;
-t|--tag) tag="$2"; shift;;
-c|--component) component="$2"; shift;;
-h|--help) usage; exit;;
*) echo "Unknown parameter passed: $1"; exit 1;;
esac; shift; done

# Check for required argument
if [ -z "${component}" ]; then
echo "Error: component parameter is required"
usage
exit 1
fi

# Set default values for optional arguments if not passed
[ -n "${namespace-}" ] || namespace="ml6team"
[ -n "${repo-}" ] || repo="express"
[ -n "${tag-}" ] || tag="latest"

# Get the component directory
component_dir=$(pwd)/"components"

# Loop through all subdirectories
for dir in $component_dir/*/; do
cd "$dir"
BASENAME=${dir%/}
BASENAME=${BASENAME##*/}
# Build all images or one image depending on the passed argument
if [[ "$BASENAME" == "${component}" ]] || [[ "${component}" == "all" ]]; then
full_image_name=ghcr.io/${namespace}/${BASENAME}:${tag}
echo $full_image_name
docker build -t "$full_image_name" \
--build-arg COMMIT_SHA=$(git rev-parse HEAD) \
--build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \
--build-arg BUILD_TIMESTAMP=$(date '+%F_%H:%M:%S') \
--label org.opencontainers.image.source=https://github.com/${namespace}/${repo} \
--platform=linux/arm64 \
.
docker push "$full_image_name"
fi
cd "$component_dir"
done
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: Embedding
description: Component that embeds images using CLIP
image: embedding:latest

input_subsets:
images:
fields:
data:
type: binary

output_subsets:
embeddings:
fields:
data:
type: float

args:
model_id:
description: Model id on the Hugging Face hub
type: str
batch_size:
description: Batch size to use when embedding
type: int
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM --platform=linux/amd64 python:3.8-slim

## System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git curl -y

# Downloading gcloud package
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz

# Installing the package
RUN mkdir -p /usr/local/gcloud \
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
&& /usr/local/gcloud/google-cloud-sdk/install.sh

# Adding the package path to local
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

# install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy over src-files of the component
COPY src /src

# Set the working directory to the source folder
WORKDIR /src

ENTRYPOINT ["python", "main.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: image_filtering
description: A component that filters images
inputs:
- name: input_manifest_path
description: Path to the input manifest
type: String

- name: min_width
description: Desired minimum width
type: Integer

- name: min_height
description: Desired minimum height
type: Integer

- name: metadata
description: Metadata arguments, passed as a json dict string
type: String


outputs:
- name: output_manifest_path
description: Path to the output manifest

implementation:
container:
image: ghcr.io/ml6team/image_filtering:latest
command: [
python3, main.py,
--input_manifest_path, {inputPath: input_manifest_path},
--min_width, {inputValue: min_width},
--min_height, {inputValue: min_height},
--metadata, {inputValue: metadata},
--output_manifest_path, {outputPath: output_manifest_path},
]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
git+https://github.com/ml6team/express.git@b1855308ca9251da5ddd8e6b88c34bc1c082a71b#egg=express
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Image filtering
description: Component that filters images based on desired minimum width and height
image: image_filtering:latest

input_subsets:
images:
fields:
width:
type: int16
height:
type: int16

output_subsets:
images:
fields:
width:
type: int16
height:
type: int16

args:
min_width:
description: Desired minimum width
type: int
min_height:
description: Desired minimum height
type: int
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
"""
This component filters images of the dataset based on image size (minimum height and width).
"""
import logging
from typing import Dict

import dask.dataframe as dd

from express.dataset import FondantComponent
from express.logger import configure_logging

configure_logging()
logger = logging.getLogger(__name__)


class ImageFilterComponent(FondantComponent):
"""
Component that filters images based on height and width.
"""
def process(self, dataset: dd.DataFrame, args: Dict) -> dd.DataFrame:
"""
Args:
dataset
args: args to pass to the function
Returns:
dataset
"""
logger.info("Filtering dataset...")
min_width, min_height = args.min_width, args.min_height
filtered_dataset = dataset.filter(lambda example: example["images_width"] > min_width and example["images_height"] > min_height)

return filtered_dataset


if __name__ == "__main__":
component = ImageFilterComponent()
component.run()
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM --platform=linux/amd64 python:3.8-slim

## System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git curl -y

# Downloading gcloud package
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz

# Installing the package
RUN mkdir -p /usr/local/gcloud \
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
&& /usr/local/gcloud/google-cloud-sdk/install.sh

# Adding the package path to local
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

# install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy over src-files of the component
COPY src /src

# Set the working directory to the source folder
WORKDIR /src

ENTRYPOINT ["python", "main.py"]
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: load_from_hub
description: A component that takes a dataset name from the 🤗 hub as input and uploads it to a GCS bucket.
inputs:
- name: dataset_name
description: Name of dataset on the hub
type: String

- name: batch_size
description: Batch size to use to create image metadata
type: Integer

- name: metadata
description: Metadata arguments, passed as a json dict string
type: String


outputs:
- name: output_manifest_path
description: Path to the output manifest

implementation:
container:
image: ghcr.io/ml6team/load_from_hub:latest
command: [
python3, main.py,
--dataset_name, {inputValue: dataset_name},
--batch_size, {inputValue: batch_size},
--metadata, {inputValue: metadata},
--output_manifest_path, {outputPath: output_manifest_path},
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
datasets==2.11.0
git+https://github.com/ml6team/express.git@8ecfb9fcaf0b8d457626179fe44347df829b8979#egg=express
Pillow==9.4.0
gcsfs==2023.4.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Load from hub
description: Component that loads a dataset from the hub
image: load_from_hub:latest

input_subsets:
images:
fields:
data:
type: binary

output_subsets:
images:
fields:
data:
type: binary
width:
type: int16
height:
type: int16
captions:
fields:
data:
type: utf8

args:
dataset_name:
description: Name of dataset on the hub
type: str
batch_size:
description: Batch size to use to create image metadata
type: int
Loading

0 comments on commit 8245d06

Please sign in to comment.