Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset and Component Class #33

Merged
merged 53 commits into from
Apr 24, 2023
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
b5d1688
First draft
Apr 19, 2023
c67047b
More improvements
Apr 19, 2023
3f258d5
More improvements
Apr 19, 2023
a5f17bf
More improvements
Apr 19, 2023
573fdda
More improvements
Apr 19, 2023
a773496
Update pipeline
Apr 19, 2023
d7de4ea
More improvements
Apr 19, 2023
9e0b30a
Add comments
Apr 19, 2023
325ead2
Remove spec_path argument
Apr 19, 2023
42613a3
Remove comment
Apr 19, 2023
5f4e103
Automatically add args
Apr 19, 2023
8009d82
Simplify pipeline
Apr 19, 2023
2ca256c
Add add_index
Apr 20, 2023
a151d96
Add print statements
Apr 20, 2023
cd474cc
Add more print statements
Apr 20, 2023
4725981
Fix locations
Apr 20, 2023
63ac065
Fix paths
Apr 20, 2023
5f79090
More improvements
Apr 20, 2023
cb0344b
Update pipeline
Apr 20, 2023
1ce80bf
Fix path
Apr 20, 2023
5720358
Fix path
Apr 20, 2023
932e05e
Fix path
Apr 20, 2023
c9f69b5
Add print statement
Apr 20, 2023
b185530
Add more print statements
Apr 20, 2023
27ac896
Update to dask
Apr 20, 2023
d25ee12
More improvements
Apr 20, 2023
3ac51b0
More improvements
Apr 20, 2023
f654f35
Fix project_name
Apr 20, 2023
2b922b8
Update requirements
Apr 20, 2023
4bb78d7
Address comment
Apr 20, 2023
5675199
More fixes
Apr 20, 2023
0951d5e
More improvements
Apr 20, 2023
595a295
Update requirements
Apr 20, 2023
aecc259
Add more fixes
Apr 20, 2023
b8d908c
Add mapping to pyarrow
Apr 20, 2023
45e4a17
More fixes
Apr 21, 2023
7012700
Rename ExpressComponent to ComponentSpec
Apr 21, 2023
43ec2b1
Remove get_subset
Apr 21, 2023
01b0f22
Use regular class
Apr 21, 2023
12ebf7e
More fixes
Apr 21, 2023
7a6a108
Add typing hints
Apr 21, 2023
8ecfb9f
Update dask version
Apr 21, 2023
bfb5e52
Remove type_to_pyarrow mapping
Apr 21, 2023
b4f8e92
Address comment
Apr 21, 2023
f9c9679
Include custom_artifact in base_path
Apr 21, 2023
b2aec33
add kfp v2 todos
PhilippeMoussalli Apr 24, 2023
ead2250
remove unused gcp storage class
PhilippeMoussalli Apr 24, 2023
c20fa73
remove project name from required params
PhilippeMoussalli Apr 24, 2023
306f8f7
change default host
PhilippeMoussalli Apr 24, 2023
aafd9e3
add name and source to mandatory subset columns
PhilippeMoussalli Apr 24, 2023
9b3f339
change general config
PhilippeMoussalli Apr 24, 2023
6aa98ef
pass expected schema to dask
PhilippeMoussalli Apr 24, 2023
191310f
remove io module tests
PhilippeMoussalli Apr 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions examples/pipelines/simple_pipeline/build_images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash

function usage {
echo "Usage: $0 [options]"
echo "Options:"
echo " -c, --component <value> Set the component name. Pass the component folder name to build a certain components or 'all' to build all components in the current directory (required)"
echo " -n, --namespace <value> Set the namespace (default: ml6team)"
echo " -r, --repo <value> Set the repo (default: express)"
echo " -t, --tag <value> Set the tag (default: latest)"
echo " -h, --help Display this help message"
}

# Parse the arguments
while [[ "$#" -gt 0 ]]; do case $1 in
-n|--namespace) namespace="$2"; shift;;
-r|--repo) repo="$2"; shift;;
-t|--tag) tag="$2"; shift;;
-c|--component) component="$2"; shift;;
-h|--help) usage; exit;;
*) echo "Unknown parameter passed: $1"; exit 1;;
esac; shift; done

# Check for required argument
if [ -z "${component}" ]; then
echo "Error: component parameter is required"
usage
exit 1
fi

# Set default values for optional arguments if not passed
[ -n "${namespace-}" ] || namespace="ml6team"
[ -n "${repo-}" ] || repo="express"
[ -n "${tag-}" ] || tag="latest"

# Get the component directory
component_dir=$(pwd)/"components"

# Loop through all subdirectories
for dir in $component_dir/*/; do
cd "$dir"
BASENAME=${dir%/}
BASENAME=${BASENAME##*/}
# Build all images or one image depending on the passed argument
if [[ "$BASENAME" == "${component}" ]] || [[ "${component}" == "all" ]]; then
full_image_name=ghcr.io/${namespace}/${BASENAME}:${tag}
echo $full_image_name
docker build -t "$full_image_name" \
--build-arg COMMIT_SHA=$(git rev-parse HEAD) \
--build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \
--build-arg BUILD_TIMESTAMP=$(date '+%F_%H:%M:%S') \
--label org.opencontainers.image.source=https://github.com/${namespace}/${repo} \
--platform=linux/arm64 \
.
docker push "$full_image_name"
fi
cd "$component_dir"
done
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: Embedding
description: Component that embeds images using CLIP
image: embedding:latest
RobbeSneyders marked this conversation as resolved.
Show resolved Hide resolved

input_subsets:
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved

output_subsets:
images:
fields:
data:
type: binary
captions:
fields:
data:
type: utf8

args:
model_id:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description: Model id on the Hugging Face hub
type: str
batch_size:
description: Batch size to use when embedding
type: int
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM --platform=linux/amd64 python:3.8-slim
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved

## System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git curl -y

# Downloading gcloud package
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz

# Installing the package
RUN mkdir -p /usr/local/gcloud \
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
&& /usr/local/gcloud/google-cloud-sdk/install.sh

# Adding the package path to local
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

# install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy over src-files of the component
COPY src /src

# Set the working directory to the source folder
WORKDIR /src

ENTRYPOINT ["python", "main.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: image_filtering
description: A component that filters images
inputs:
- name: input_manifest_path
description: Path to the input manifest
type: String

- name: min_width
description: Desired minimum width
type: Integer

- name: min_height
description: Desired minimum height
type: Integer

- name: metadata
description: Metadata arguments, passed as a json dict string
type: String


outputs:
- name: output_manifest_path
description: Path to the output manifest

implementation:
container:
image: ghcr.io/ml6team/image_filtering:latest
command: [
python3, main.py,
--input_manifest_path, {inputPath: input_manifest_path},
--min_width, {inputValue: min_width},
--min_height, {inputValue: min_height},
--metadata, {inputValue: metadata},
--output_manifest_path, {outputPath: output_manifest_path},
]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
git+https://github.com/ml6team/express.git@b1855308ca9251da5ddd8e6b88c34bc1c082a71b#egg=express[datasets]
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Image filtering
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved
description: Component that filters images based on desired minimum width and height
image: image_filtering:latest

input_subsets:
images:
fields:
width:
type: int32
height:
type: int32

output_subsets:
images:
fields:
width:
type: int32
height:
type: int32

args:
min_width:
description: Desired minimum width
type: int
min_height:
description: Desired minimum height
type: int
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""
This component filters images of the dataset based on image size (minimum height and width).
"""
import logging
from typing import Dict

from datasets import Dataset

from express.dataset import FondantComponent
from express.logger import configure_logging

configure_logging()
logger = logging.getLogger(__name__)


class ImageFilterComponent(FondantComponent):
"""
Component that filters images based on height and width.
"""

@classmethod
def process(cls, dataset: Dataset, args: Dict) -> Dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we swtich over to using the t.Dict everywhere to be consistent. Also, if you're the Dict from the typing module, it's better to define the datatypes inside of it

"""
Args:
dataset
PhilippeMoussalli marked this conversation as resolved.
Show resolved Hide resolved
args: args to pass to the function

Returns:
dataset
"""
logger.info("Filtering dataset...")
min_width, min_height = args.min_width, args.min_height
filtered_dataset = dataset.filter(lambda example: example["images_width"] > min_width and example["images_height"] > min_height)

return filtered_dataset


if __name__ == "__main__":
ImageFilterComponent.run()
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM --platform=linux/amd64 python:3.8-slim

## System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git curl -y

# Downloading gcloud package
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz

# Installing the package
RUN mkdir -p /usr/local/gcloud \
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
&& /usr/local/gcloud/google-cloud-sdk/install.sh

# Adding the package path to local
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

# install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy over src-files of the component
COPY src /src

# Set the working directory to the source folder
WORKDIR /src

ENTRYPOINT ["python", "main.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: load_from_hub
description: A component that takes a dataset name from the 🤗 hub as input and uploads it to a GCS bucket.
inputs:
- name: dataset_name
description: Name of dataset on the hub
type: String

- name: batch_size
description: Batch size to use to create image metadata
type: Integer

- name: metadata
description: Metadata arguments, passed as a json dict string
type: String


outputs:
- name: output_manifest_path
description: Path to the output manifest

implementation:
container:
image: ghcr.io/ml6team/load_from_hub:latest
command: [
python3, main.py,
--dataset_name, {inputValue: dataset_name},
--batch_size, {inputValue: batch_size},
--metadata, {inputValue: metadata},
--output_manifest_path, {outputPath: output_manifest_path},
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
git+https://github.com/ml6team/express.git@b1855308ca9251da5ddd8e6b88c34bc1c082a71b#egg=express[datasets]
Pillow==9.4.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Load from hub
description: Component that loads a dataset from the hub
image: load_from_hub:latest

input_subsets:
images:
fields:
data:
type: binary

output_subsets:
images:
fields:
data:
type: binary
width:
type: int32
height:
type: int32
captions:
fields:
data:
type: utf8

args:
dataset_name:
description: Name of dataset on the hub
type: str
batch_size:
description: Batch size to use to create image metadata
type: int
Loading