Spark3 structured streaming micro_batch read support #1

SreeramGarlapati · 2021-06-02T06:28:08Z

This work is an extension of the idea in issue apache#179 & the Spark2 work done in PR apache#2272 - only that - this is for Spark3.

In the current implementation:

Iceberg Snapshot is the upper bound for MicroBatch. A given MicroBatch will only Span within a Snapshot. It will not be composed of multiple Snapshots. BatchSize - is used to limit the number of files with in a given snapshot.
The streaming reader - will error out if it encounters any Snapshot of type NOT EQUAL to type APPEND.
Handling DELETES, REPLACE & OVERWRITES is something for future.
Columnar reads are not enabled. Something for future.

…3.stream.read

SreeramGarlapati added 28 commits May 19, 2021 00:40

Wireframe.

e024ba0

remove StreamingOffset related change - as it is handled in another PR …

61611fd

…apache#2615

Merge branch 'master' of https://github.com/apache/iceberg into spark…

31f2a03

…3.stream.read

test changes

6a2adcb

Merge branch 'master' of https://github.com/apache/iceberg into spark…

f1565cc

…3.stream.read

rudimentary implementation for spark3 streaming reads from iceberg table

def1bc0

rudimentary implementation for spark3 streaming reads from iceberg table

a96b3e6

unit test

b18cfdc

works!

993bd9e

Merge branch 'master' of https://github.com/apache/iceberg into spark…

94dd103

…3.stream.read

Unit test.

7063fd3

Merge branch 'master' of https://github.com/apache/iceberg into spark…

96fbc87

…3.stream.read

Unit test.

17f6eb8

checkpoint done!

2044d94

checkpoint done!

15e33e9

refactor

b8e5b34

Merge branch 'master' of https://github.com/apache/iceberg into spark…

fa4f2ae

…3.stream.read

test batchSize option

633afbb

refactor

16d3984

checkstyle

919e386

checkstyle

e3fb1fe

fix indent

bee1690

unit test - full coverage

daee48a

add logic for ignoring deletes and replace

0a65617

minor refactor

f9e9e66

Merge branch 'master' of https://github.com/apache/iceberg into spark…

c7658d3

…3.stream.read

minor refactor

67e2d27

remove ignoreDelete and ignoreReplace.

072c911

github-actions bot added the SPARK label Jun 2, 2021

SreeramGarlapati merged commit 41041f3 into spark3.stream.read.baseline Jun 2, 2021

SreeramGarlapati deleted the spark3.stream.read.1 branch June 2, 2021 06:30

Provide feedback