-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[#34236] Add Vertex AI Multi-Modal embedding handler #35677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| uuid.uuid4().hex) | ||
| self.model_name = "multimodalembedding" | ||
| self.image_path = "gs://apache-beam-ml/testing/inputs/vertex_images/sunflowers/1008566138_6927679c8a.jpg" # pylint: disable=line-too-long | ||
| self.video_path = "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4" # pylint: disable=line-too-long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can move this to the apache-beam-ml bucket for stability
| image: Optional[Image] = None | ||
| video: Optional[Video] = None | ||
| contextual_text: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want wrappers here so that we can get additional metadata about the content - see https://docs.google.com/document/d/1gYWeGzua3osE5pyrBpMgAMoX0GSiVIa4xiNbbkfUsM8/edit?disco=AAABpbxKVuI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shifted video to use a relatively simple tuple wrapper to make (Video, VideoSegmentConfig) pairs based on the doc discussion. I don't know how much sense it makes to wrap every field in this object since images and text do not have configuration objects associated with them; should this change, we can always update things on our end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From https://www.google.com/url?q=https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api%23request_body&sa=D&source=docs&ust=1755616714498799&usg=AOvVaw3yaLxsXOYCo9hUA32-GbqB it looks like image at least could have an associated mimeType now or eventually (to represent if it is e.g. a jpg, png, etc...).
should this change, we can always update things on our end.
Could we update this in a backwards compatible way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for something like the mimeType being specifiable that would likely be something composed in the Image object (if you look at get_embeddings() from Vertex those objects are how we construct the request) and not something we'd have to explicitly support. If we really wanted to support that now we'd be rolling our own request code, which I don't think is in our best interest.
I think my main hangup on adding a bunch of wrappers around each field is largely around the ease of use around the transform. The whole point of adding the input adapters was so users wouldn't have to compose the input dataclass themselves, just pass the fields to us and we do the rest. I do worry about overcomplicating this in anticipation of maybe having additional configuration options added to vertex later. Admittedly I am not as familiar with other frameworks for multimodal embeddings beyond Vertex, so maybe this is a reasonable expectation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for something like the mimeType being specifiable that would likely be something composed in the Image object (if you look at get_embeddings() from Vertex those objects are how we construct the request) and not something we'd have to explicitly support
Is it part of the Image object today? I agree that is one way it could be supported, its not clear that it would definitely be the way Vertex would choose to support it though.
I'm also thinking this would play nicely with Chunks for text (basically we could have wrappers for all 3 object types which seems fairly natural).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not part of the image object today, but given how get_embeddings() works there wouldn't be a way to specify that otherwise unless they really expanded the function signature.
There was some conversation with @claudevdm around expanding Chunks to be capable of containing multimodal data and writing a separate input/output adapter pair for them, but I suppose there's separate merit around just accepting text Chunks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not part of the image object today, but given how get_embeddings() works there wouldn't be a way to specify that otherwise unless they really expanded the function signature.
Yeah, I recognize that there is no way of specifying this today. The advantage of using our own dataclass to capture information like this is that we could support it in a forward-compatible way through a single config object in the future regardless of how vertex's client decides to support it.
There was some conversation with @claudevdm around expanding Chunks to be capable of containing multimodal data and writing a separate input/output adapter pair for them, but I suppose there's separate merit around just accepting text Chunks here.
I would lean towards unifying this now rather than trying to have an extra way of doing this wrapping in the future. That will also allow this to play nicely with writing (text) to a vector db, though we'd still need to do similar work for images/videos eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wound up wrapping everything (this wound up being cleaner for both videos and text, for images it's a little overkill as of now but that's fine) and got everything documented. Started messing with the union input type. but given that we're actually interacting with column names in the function signature instead we get some really annoying code paths (and also run into Python's native type hinting limitations) so I dropped that idea for now. Will likely return later.
|
Assigning reviewers: R: @shunping for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
R: @damccorm |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
damccorm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had one remaining (minor) suggestion, otherwise this LGTM - thanks!
| batch: Sequence[dict[str, Any]], | ||
| embeddings: Sequence[MultiModalEmbeddingResponse]) -> list[dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Any typing here was tripping me up. Could we create a type for Union[VertexImage, VertexVideo, Chunk] and use it instead of Any here (and in _multimodal_dict_input_fn)? I think that would help with readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is where we run into the definition of the input here (and this applies to any of the embedding handlers that take dict inputs.) The input and output are not necessarily solely composed of the inputs to be embedded and the result afterward, they're lists of arbitrary dictionaries and the multimodal inputs are just specific columns within each dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. Ok, thanks
| batch: Sequence[dict[str, Any]], | ||
| embeddings: Sequence[MultiModalEmbeddingResponse]) -> list[dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. Ok, thanks
Prototype embedding handler support for Vertex AI multi-modal embeddings (full image/video/contextual string support) based on an existing design doc. Establishes the use of dataclasses as convention for passing in paired inputs for each medium with a type adapter implemented to convert between typical MLTransform columnar inputs and the dataclass.
Fixes #34236
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.