Skip to content

Conversation

@jrmccluskey
Copy link
Contributor

@jrmccluskey jrmccluskey commented Jul 24, 2025

Prototype embedding handler support for Vertex AI multi-modal embeddings (full image/video/contextual string support) based on an existing design doc. Establishes the use of dataclasses as convention for passing in paired inputs for each medium with a type adapter implemented to convert between typical MLTransform columnar inputs and the dataclass.

Fixes #34236


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

uuid.uuid4().hex)
self.model_name = "multimodalembedding"
self.image_path = "gs://apache-beam-ml/testing/inputs/vertex_images/sunflowers/1008566138_6927679c8a.jpg" # pylint: disable=line-too-long
self.video_path = "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4" # pylint: disable=line-too-long
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can move this to the apache-beam-ml bucket for stability

Comment on lines 301 to 303
image: Optional[Image] = None
video: Optional[Video] = None
contextual_text: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want wrappers here so that we can get additional metadata about the content - see https://docs.google.com/document/d/1gYWeGzua3osE5pyrBpMgAMoX0GSiVIa4xiNbbkfUsM8/edit?disco=AAABpbxKVuI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shifted video to use a relatively simple tuple wrapper to make (Video, VideoSegmentConfig) pairs based on the doc discussion. I don't know how much sense it makes to wrap every field in this object since images and text do not have configuration objects associated with them; should this change, we can always update things on our end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://www.google.com/url?q=https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api%23request_body&sa=D&source=docs&ust=1755616714498799&usg=AOvVaw3yaLxsXOYCo9hUA32-GbqB it looks like image at least could have an associated mimeType now or eventually (to represent if it is e.g. a jpg, png, etc...).

should this change, we can always update things on our end.

Could we update this in a backwards compatible way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for something like the mimeType being specifiable that would likely be something composed in the Image object (if you look at get_embeddings() from Vertex those objects are how we construct the request) and not something we'd have to explicitly support. If we really wanted to support that now we'd be rolling our own request code, which I don't think is in our best interest.

I think my main hangup on adding a bunch of wrappers around each field is largely around the ease of use around the transform. The whole point of adding the input adapters was so users wouldn't have to compose the input dataclass themselves, just pass the fields to us and we do the rest. I do worry about overcomplicating this in anticipation of maybe having additional configuration options added to vertex later. Admittedly I am not as familiar with other frameworks for multimodal embeddings beyond Vertex, so maybe this is a reasonable expectation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for something like the mimeType being specifiable that would likely be something composed in the Image object (if you look at get_embeddings() from Vertex those objects are how we construct the request) and not something we'd have to explicitly support

Is it part of the Image object today? I agree that is one way it could be supported, its not clear that it would definitely be the way Vertex would choose to support it though.

I'm also thinking this would play nicely with Chunks for text (basically we could have wrappers for all 3 object types which seems fairly natural).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not part of the image object today, but given how get_embeddings() works there wouldn't be a way to specify that otherwise unless they really expanded the function signature.

There was some conversation with @claudevdm around expanding Chunks to be capable of containing multimodal data and writing a separate input/output adapter pair for them, but I suppose there's separate merit around just accepting text Chunks here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not part of the image object today, but given how get_embeddings() works there wouldn't be a way to specify that otherwise unless they really expanded the function signature.

Yeah, I recognize that there is no way of specifying this today. The advantage of using our own dataclass to capture information like this is that we could support it in a forward-compatible way through a single config object in the future regardless of how vertex's client decides to support it.

There was some conversation with @claudevdm around expanding Chunks to be capable of containing multimodal data and writing a separate input/output adapter pair for them, but I suppose there's separate merit around just accepting text Chunks here.

I would lean towards unifying this now rather than trying to have an extra way of doing this wrapping in the future. That will also allow this to play nicely with writing (text) to a vector db, though we'd still need to do similar work for images/videos eventually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wound up wrapping everything (this wound up being cleaner for both videos and text, for images it's a little overkill as of now but that's fine) and got everything documented. Started messing with the union input type. but given that we're actually interacting with column names in the function signature instead we get some really annoying code paths (and also run into Python's native type hinting limitations) so I dropped that idea for now. Will likely return later.

@jrmccluskey jrmccluskey changed the title [DO NOT MERGE] Prototype Vertex MultiModal embedding handler [#34236]Add Vertex AI Multi-Modal embedding handler Sep 4, 2025
@jrmccluskey jrmccluskey changed the title [#34236]Add Vertex AI Multi-Modal embedding handler [#34236] Add Vertex AI Multi-Modal embedding handler Sep 4, 2025
@jrmccluskey jrmccluskey marked this pull request as ready for review September 4, 2025 17:37
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2025

Assigning reviewers:

R: @shunping for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@jrmccluskey
Copy link
Contributor Author

R: @damccorm

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2025

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one remaining (minor) suggestion, otherwise this LGTM - thanks!

Comment on lines +406 to +407
batch: Sequence[dict[str, Any]],
embeddings: Sequence[MultiModalEmbeddingResponse]) -> list[dict[str, Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Any typing here was tripping me up. Could we create a type for Union[VertexImage, VertexVideo, Chunk] and use it instead of Any here (and in _multimodal_dict_input_fn)? I think that would help with readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is where we run into the definition of the input here (and this applies to any of the embedding handlers that take dict inputs.) The input and output are not necessarily solely composed of the inputs to be embedded and the result afterward, they're lists of arbitrary dictionaries and the multimodal inputs are just specific columns within each dictionary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. Ok, thanks

Comment on lines +406 to +407
batch: Sequence[dict[str, Any]],
embeddings: Sequence[MultiModalEmbeddingResponse]) -> list[dict[str, Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. Ok, thanks

@jrmccluskey jrmccluskey merged commit fe188e3 into apache:master Sep 8, 2025
91 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: VertexAIImageEmbeddings adding the ability to send in contextual text too

2 participants