Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Merge Flink runner translations ? #28617

Open
1 of 15 tasks
jto opened this issue Sep 22, 2023 · 0 comments
Open
1 of 15 tasks

[Task]: Merge Flink runner translations ? #28617

jto opened this issue Sep 22, 2023 · 0 comments

Comments

@jto
Copy link
Contributor

jto commented Sep 22, 2023

What needs to happen?

Currently the Flink runner has 4 alternative implementations to translate a pipeline.

  • FlinkStreamingTransformTranslators
    Based on DataStream API. Only support streaming workflows. It can also support batch workflows once [Flink Runner] Add UseDataStreamForBatch option to Flink runner to enable batch execution on DataStream API #28614 is merged. It uses the "native" org.apache.beam.sdk.Pipeline class and the implementation is based on Flink's DataStream API.
  • FlinkStreamingPortablePipelineTranslator, used for portable streaming pipelines. It has a lot of "overlap" (copy/pasted code) with FlinkStreamingTransformTranslators but supports a slightly different set of transforms. Based on Flink DataStream API too.
  • FlinkBatchPortablePipelineTranslator -> Used for batch portable pipelines. Based on the deprecated DataSet API.
  • FlinkBatchTransformTranslators -> Used for batch "native" java pipelines. . Based on the deprecated DataSet API.

FlinkBatchPortablePipelineTranslator and FlinkBatchTransformTranslators should both be deprecated since the are implemented using the deprecated DataSet API which Flink will eventually remove.

FlinkBatchTransformTranslators can be replaced by FlinkStreamingTransformTranslators (#28614).

Given the similarities between the classes, I think FlinkStreamingTransformTranslators and FlinkStreamingPortablePipelineTranslator could be merged to support both portable and non portable pipelines with a unique translation layer.

Since we can easily convert instances of org.apache.beam.sdk.Pipeline to RunnerAPI.Pipeline, it should be possible to support all types of pipelines (native streaming, portable streaming, native batch, portable batch) with the same translation implementation.

The goal of this task is to introduce this new unique implementation and to eventually remove all the other alternatives.

Assuming this proposal is of interest, my colleagues and myself could implement it.

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant