Preserve pane index through reshuffle. #34348

claudevdm · 2025-03-19T18:28:18Z

Should fix #32636.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2025-03-25T15:07:28Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @shunping for label python.
R: @jrmccluskey for label go.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

kennknowles · 2025-03-26T16:08:03Z

sdks/go/pkg/beam/runners/prism/internal/urns/urns.go

@@ -124,6 +124,7 @@ var (
 	CoderTimer              = cdrUrn(pipepb.StandardCoders_TIMER)

 	CoderKV                  = cdrUrn(pipepb.StandardCoders_KV)
+	CoderTuple               = "beam:coder:tuple:v1"


This seems suspiciously out of place, being a magic string and also not a standard coder. What is the story behind it?

I placed it here because python switches between tuple coder depending on number of elements which cauuses python tests to fail with prism runner

beam/sdks/python/apache_beam/coders/coders.py

Line 1232 in 4fe34db

if self.is_kv_coder():

For example, with the pane index change there ends up being a tuple coder with 3 elements (because pane info is now included)

coders { key: "ref_Coder_TupleCoder_6" value { spec { urn: "beam:coder:tuple:v1" } component_coder_ids: "ref_Coder_BytesCoder_1" component_coder_ids: "ref_Coder_NullableCoder_7" component_coder_ids: "ref_Coder_FastPrimitivesCoder_8" } }

But without pane info it falls back to using KV coder which is supported by prism

coders { key: "ref_Coder_TupleCoder_6" value { spec { urn: "beam:coder:kv:v1" } component_coder_ids: "ref_Coder_BytesCoder_1" component_coder_ids: "ref_Coder_NullableCoder_7" } }

Prism error without this change:
ERROR:root:prism error building stage stage-002:
unknown coder urn key: beam:coder:tuple:v1

This seems like something we probably need to fix another way. The actual shuffle needs KVs, with everything we want to preserve reified into the value component.

The actual shuffle needs KVs

What do you mean by this? The way I understand it is

Reshuffle adds random keys (k, v)

ReifyMetadata maintains a kv, with a nested tuple as a value (value, timestamp, pane_info)

beam/sdks/python/apache_beam/transforms/util.py

Line 966 in db0aa82

return key, (value, timestamp, pane_info)

I guess if we want to avoid this we can have a nested kv in ReifyMetedata so it is

key, (value, (timestamp, pane_info))
Then the regular kv coder should work?
Or can we also use windowed_value as the value in the reify output instead of a tuple with the medatada?

The original reify just used a kv as the value in the reify function

beam/sdks/python/apache_beam/transforms/util.py

Line 972 in 57d1c35

return key, (value, timestamp)

We are now including pane info as mentioned above so it becomes a tuple

This only happens for global window case, in the custom window case the value for reify is a windowed value

beam/sdks/python/apache_beam/transforms/util.py

Line 996 in 57d1c35

return key, windowed_value.WindowedValue(value, timestamp, [window])

I see. Yea KV(key, (value, timestamp, pane_info) should be fine. The runner should not need to have any understanding of the coder for (value, timestamp, pane_info) since that is in general user type / coder.

OK... @lostluck does know Prism best. But it isn't in line with my take on the model. Tuple is a language-specific esoteric coder that isn't part of the Beam model and shouldn't be explicitly understood by anything outside the Python SDK.

Oh and of course big follow-up question: this coder is not new... presumably it already works, so why does this change require it to become runner-understood?

It's as Claude says.

Basically the Python Tuple coder is an outlier: It pretends to be a standard beam coder, with an arbitrary number of components, and Python plays fast and loose with the notion of coder types. No other "custom coder" uses exposed sub components, essentially. Custom coders are usually fully opaque.

The problem here is I tried to avoid needing to enumerate all Beam coders with sub components that needed processing. We already had a "leaf" list, why do we also need a "composite" list? That means there are two approaches:

This current PR's approach: Promote the janky python approach to be a known thing for all runners/SDKs. Since the URN is already pretending to be a standard URN, this isn't too hard, and it permits other SDKs to interoperate with that coder. AKA, turning the exception to be part of the standard.

We are forced to specify known composite coders to avoid Length Prefixing them unnecessarily.
So instead of just the set of Known Leaf Coders, we would have the set of Known Composite Coders, that don't need length prefixing. Anything else should just be length prefixed.

Eg. We add a list of the known Composite URNs that should not be length prefixed by the

Existing Leaf list:

beam/sdks/go/pkg/beam/runners/prism/internal/coders.go

Line 38 in f1bc509

var leafCoders = map[string]struct{}{

Where the check should go, so it's the same logic for wrapping unknowns. eg.

beam/sdks/go/pkg/beam/runners/prism/internal/coders.go

Line 129 in f1bc509

if len(c.GetComponentCoderIds()) == 0 && !leaf {

if (len(c.GetComponentCoderIds()) == 0 && !leaf) || !isKnownCompositeCoder(c) {

This has two risks though:

Changed length prefixing behavior, may mean tests that are currently passing will fail. It'll be important to run the Java suite locally (I don't trust the Github action to run uncached when it's just a prism side change. Only noticed that after I got re-orged. If it takes less than 20m to run, it was cached and can't be trusted).

The converse issue: What if the Python SDK doesn't know how to deal with a runner side wrapped length prefixed tuple coder? Then a Python fix would be needed. This would hopefully be evident in the test suite uses of tuple coder.

(There's a issue with Java Row coders failing deep in the Java SDK when they're wrapped in a length prefix. The introspection doesn't know how to skip the LP wrapper (see #32931).

Added link to the tracking issue for the prism tuple coder issue to the PR description: #32636

Oh and of course big follow-up question: this coder is not new... presumably it already works, so why does this change require it to become runner-understood?

And specifically for this question: If it's not in the validates runner suite it's extremely hard to test when one doesn't use the SDK. In this case, there was exactly one test for it. There's only ~70 validates runner tests for Python.

sdks/python/apache_beam/transforms/util.py

github-actions bot added the python label Mar 19, 2025

claudevdm force-pushed the shuffle-paneinfo branch from 81d6f1d to 6159384 Compare March 20, 2025 15:31

github-actions bot added go runners prism labels Mar 20, 2025

claudevdm force-pushed the shuffle-paneinfo branch from 6159384 to 2952071 Compare March 20, 2025 15:40

claudevdm added 3 commits March 21, 2025 18:45

Preserve pane index through reshuffle.

146518a

Fix coders.

4de2674

Add coder test.

f30477d

claudevdm force-pushed the shuffle-paneinfo branch from 2952071 to f30477d Compare March 21, 2025 22:45

claudevdm added 6 commits March 22, 2025 19:23

Fix lint error.

45748bd

Change compat version.

182b684

refactor.

0b8fc76

Fix typehint issue.

d30bd76

Use fn api runner.

a8ce918

Fix lint error.

db0aa82

claudevdm requested a review from kennknowles March 24, 2025 17:30

claudevdm marked this pull request as ready for review March 25, 2025 14:53

github-actions bot added the Next Action: Reviewers label Mar 25, 2025

kennknowles requested changes Mar 26, 2025

View reviewed changes

claudevdm added 2 commits March 26, 2025 19:40

Refactor.

ee64c2c

Remove strange duplciation.

f1bc509

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve pane index through reshuffle. #34348

Preserve pane index through reshuffle. #34348

claudevdm commented Mar 19, 2025 •

edited by lostluck

Loading

github-actions bot commented Mar 25, 2025

kennknowles Mar 26, 2025

claudevdm Mar 26, 2025 •

edited

Loading

kennknowles Mar 26, 2025

claudevdm Mar 26, 2025 •

edited

Loading

kennknowles Mar 26, 2025

kennknowles Mar 31, 2025

kennknowles Mar 31, 2025

lostluck Mar 31, 2025

lostluck Mar 31, 2025

lostluck Mar 31, 2025

Preserve pane index through reshuffle. #34348

Are you sure you want to change the base?

Preserve pane index through reshuffle. #34348

Conversation

claudevdm commented Mar 19, 2025 • edited by lostluck Loading

GitHub Actions Tests Status (on master branch)

github-actions bot commented Mar 25, 2025

Choose a reason for hiding this comment

claudevdm Mar 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

claudevdm Mar 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

claudevdm commented Mar 19, 2025 •

edited by lostluck

Loading

claudevdm Mar 26, 2025 •

edited

Loading

claudevdm Mar 26, 2025 •

edited

Loading