Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying SparkContext#union(RDD...) inside a DStream.transform() which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).
Spark does not seem to provide any out-of-the-box implementations for this.
One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.
Another approach would be to create a custom InputDStream that does this.
An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.
Imported from Jira BEAM-1444. Original Jira may contain additional context.
Reported by: amitsela.