[#34236] Add Vertex AI Multi-Modal embedding handler #35677

jrmccluskey · 2025-07-24T15:20:27Z

Prototype embedding handler support for Vertex AI multi-modal embeddings (full image/video/contextual string support) based on an existing design doc. Establishes the use of dataclasses as convention for passing in paired inputs for each medium with a type adapter implemented to convert between typical MLTransform columnar inputs and the dataclass.

Fixes #34236

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

jrmccluskey · 2025-08-14T15:54:08Z

sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai_test.py

+        uuid.uuid4().hex)
+    self.model_name = "multimodalembedding"
+    self.image_path = "gs://apache-beam-ml/testing/inputs/vertex_images/sunflowers/1008566138_6927679c8a.jpg"  # pylint: disable=line-too-long
+    self.video_path = "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4"  # pylint: disable=line-too-long


Can move this to the apache-beam-ml bucket for stability

damccorm · 2025-08-18T19:30:17Z

sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py

+  image: Optional[Image] = None
+  video: Optional[Video] = None
+  contextual_text: Optional[str] = None


I think we probably want wrappers here so that we can get additional metadata about the content - see https://docs.google.com/document/d/1gYWeGzua3osE5pyrBpMgAMoX0GSiVIa4xiNbbkfUsM8/edit?disco=AAABpbxKVuI

I shifted video to use a relatively simple tuple wrapper to make (Video, VideoSegmentConfig) pairs based on the doc discussion. I don't know how much sense it makes to wrap every field in this object since images and text do not have configuration objects associated with them; should this change, we can always update things on our end.

From https://www.google.com/url?q=https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api%23request_body&sa=D&source=docs&ust=1755616714498799&usg=AOvVaw3yaLxsXOYCo9hUA32-GbqB it looks like image at least could have an associated mimeType now or eventually (to represent if it is e.g. a jpg, png, etc...).

should this change, we can always update things on our end.

Could we update this in a backwards compatible way?

So for something like the mimeType being specifiable that would likely be something composed in the Image object (if you look at get_embeddings() from Vertex those objects are how we construct the request) and not something we'd have to explicitly support. If we really wanted to support that now we'd be rolling our own request code, which I don't think is in our best interest.

I think my main hangup on adding a bunch of wrappers around each field is largely around the ease of use around the transform. The whole point of adding the input adapters was so users wouldn't have to compose the input dataclass themselves, just pass the fields to us and we do the rest. I do worry about overcomplicating this in anticipation of maybe having additional configuration options added to vertex later. Admittedly I am not as familiar with other frameworks for multimodal embeddings beyond Vertex, so maybe this is a reasonable expectation.

So for something like the mimeType being specifiable that would likely be something composed in the Image object (if you look at get_embeddings() from Vertex those objects are how we construct the request) and not something we'd have to explicitly support

Is it part of the Image object today? I agree that is one way it could be supported, its not clear that it would definitely be the way Vertex would choose to support it though.

I'm also thinking this would play nicely with Chunks for text (basically we could have wrappers for all 3 object types which seems fairly natural).

It is not part of the image object today, but given how get_embeddings() works there wouldn't be a way to specify that otherwise unless they really expanded the function signature.

There was some conversation with @claudevdm around expanding Chunks to be capable of containing multimodal data and writing a separate input/output adapter pair for them, but I suppose there's separate merit around just accepting text Chunks here.

It is not part of the image object today, but given how get_embeddings() works there wouldn't be a way to specify that otherwise unless they really expanded the function signature.

Yeah, I recognize that there is no way of specifying this today. The advantage of using our own dataclass to capture information like this is that we could support it in a forward-compatible way through a single config object in the future regardless of how vertex's client decides to support it.

There was some conversation with @claudevdm around expanding Chunks to be capable of containing multimodal data and writing a separate input/output adapter pair for them, but I suppose there's separate merit around just accepting text Chunks here.

I would lean towards unifying this now rather than trying to have an extra way of doing this wrapping in the future. That will also allow this to play nicely with writing (text) to a vector db, though we'd still need to do similar work for images/videos eventually

Wound up wrapping everything (this wound up being cleaner for both videos and text, for images it's a little overkill as of now but that's fine) and got everything documented. Started messing with the union input type. but given that we're actually interacting with column names in the function signature instead we get some really annoying code paths (and also run into Python's native type hinting limitations) so I dropped that idea for now. Will likely return later.

github-actions · 2025-09-04T18:09:33Z

Assigning reviewers:

R: @shunping for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

jrmccluskey · 2025-09-04T18:16:17Z

R: @damccorm

github-actions · 2025-09-04T18:18:11Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py

damccorm

Had one remaining (minor) suggestion, otherwise this LGTM - thanks!

damccorm · 2025-09-05T18:26:44Z

sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py

+    batch: Sequence[dict[str, Any]],
+    embeddings: Sequence[MultiModalEmbeddingResponse]) -> list[dict[str, Any]]:


I think the Any typing here was tripping me up. Could we create a type for Union[VertexImage, VertexVideo, Chunk] and use it instead of Any here (and in _multimodal_dict_input_fn)? I think that would help with readability

So this is where we run into the definition of the input here (and this applies to any of the embedding handlers that take dict inputs.) The input and output are not necessarily solely composed of the inputs to be embedded and the result afterward, they're lists of arbitrary dictionaries and the multimodal inputs are just specific columns within each dictionary.

Yeah, you're right. Ok, thanks

sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py

damccorm · 2025-09-08T17:50:50Z

sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py

+    batch: Sequence[dict[str, Any]],
+    embeddings: Sequence[MultiModalEmbeddingResponse]) -> list[dict[str, Any]]:


Yeah, you're right. Ok, thanks

Prototype Vertex MultiModal embedding handler

3686dd2

github-actions bot added the python label Jul 24, 2025

jrmccluskey added 6 commits July 24, 2025 11:27

remove unused types

a48de18

change temp file and artifact paths to use dedicated directories

7666f9f

formatting

8cd8e03

quick unit tests for the base multimodal embedding handler

d0b22ad

Migrate to input adapter, add testing for video

736e9b8

linting

7bf5a8f

jrmccluskey commented Aug 14, 2025

View reviewed changes

isort

7c4bce6

damccorm reviewed Aug 18, 2025

View reviewed changes

jrmccluskey added 8 commits August 19, 2025 10:44

made segment configuration per-video instance

1a8c7df

fix corrected video input type

c2db03a

speed up video test by passing a GCS URI instead of loading the video

a908cbf

formatting

14ef19c

move to wrapped inputs

5607c32

clarify types in dict_input_fn

c720d0c

linting

a23a1c8

fix chunk construction

675b833

jrmccluskey changed the title ~~[DO NOT MERGE] Prototype Vertex MultiModal embedding handler~~ [#34236]Add Vertex AI Multi-Modal embedding handler Sep 4, 2025

jrmccluskey changed the title ~~[#34236]Add Vertex AI Multi-Modal embedding handler~~ [#34236] Add Vertex AI Multi-Modal embedding handler Sep 4, 2025

jrmccluskey marked this pull request as ready for review September 4, 2025 17:37

github-actions bot added the Next Action: Reviewers label Sep 4, 2025

damccorm reviewed Sep 4, 2025

View reviewed changes

sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py Show resolved Hide resolved

update main input to use wrappers

66e799d

damccorm reviewed Sep 5, 2025

View reviewed changes

damccorm approved these changes Sep 8, 2025

View reviewed changes

jrmccluskey merged commit fe188e3 into apache:master Sep 8, 2025
91 checks passed

		batch: Sequence[dict[str, Any]],
		embeddings: Sequence[MultiModalEmbeddingResponse]) -> list[dict[str, Any]]:

[#34236] Add Vertex AI Multi-Modal embedding handler #35677

[#34236] Add Vertex AI Multi-Modal embedding handler #35677

Uh oh!

Conversation

jrmccluskey commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

jrmccluskey commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jrmccluskey commented Jul 24, 2025 •

edited

Loading