Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. 

Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying `SparkContext#union(RDD...)` inside a `DStream.transform()` which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).

Spark does not seem to provide any out-of-the-box implementations for this.

One approach I tried was to create a stream from Queue (single RDD stream)  but this is not an option since this fails checkpointing.

Another approach would be to create a custom `InputDStream` that does this.

An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.  

Imported from Jira [BEAM-1444](https://issues.apache.org/jira/browse/BEAM-1444). Original Jira may contain additional context.
Reported by: amitsela.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. #18144

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. #18144

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions