-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add image embedding component #148
Conversation
68a523a
to
0a84ca4
Compare
Can you rebase this one @PhilippeMoussalli? |
34de964
to
2df997e
Compare
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Small comments, mainly about docstrings and other small stuff
|
||
|
||
@dask.delayed | ||
def load(example): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string?
|
||
|
||
@dask.delayed | ||
def transform(image, processor, device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string?
embeddings_df = delayed_series.to_frame(name="embeddings_data") | ||
|
||
# add index columns | ||
embeddings_df["id"] = dataframe["id"].reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it needed to reset_index() separately on all the columns individually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point I don't think it's needed
|
||
args: | ||
model_id: | ||
description: Model id on the Hugging Face hub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put a default/example value here too? Some people might not know where to look to find the correct CLIP Model ID
@@ -51,7 +51,7 @@ class Type(enum.Enum): | |||
|
|||
int8_list = pa.list_(pa.int8()) | |||
|
|||
float16_list = pa.list_(pa.float16()) | |||
float32_list = pa.list_(pa.float32()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we only supporting float32_list then or do we also need float16 in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now we only need that one, this is a temporary fix we're planning on handling nested data types differently at a later stage
@@ -31,7 +39,8 @@ | |||
}, | |||
) | |||
|
|||
write_to_hub_op = FondantComponentOp( | |||
|
|||
write_to_hub_op = ComponentOp( | |||
component_spec_path="components/write_to_hub/fondant_component.yaml", | |||
arguments={ | |||
"username": "philippemo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this username stay your name? Or something like 'demo-pipeline'?
6c8afa8
to
572b494
Compare
PR that adds the image embedding component. Largely inspired by Niel's PR #111 (inference and batching with dask).
PR that adds the image embedding component. Largely inspired by Niel's PR #111 (inference and batching with dask).
PR that adds the image embedding component. Largely inspired by Niel's PR #111 (inference and batching with dask).