Skip to content

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch.  #18144

@kennknowles

Description

@kennknowles

Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying SparkContext#union(RDD...) inside a DStream.transform() which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).

Spark does not seem to provide any out-of-the-box implementations for this.

One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.

Another approach would be to create a custom InputDStream that does this.

An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.

Imported from Jira BEAM-1444. Original Jira may contain additional context.
Reported by: amitsela.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions