Support structured streaming read for Iceberg #2272

XuQianJin-Stars · 2021-02-25T05:12:32Z

An implementation of Spark Structured Streaming Read, to track the current processed files of Iceberg table, This PR is a split of the PR-796 of Structured streaming read for Iceberg.

SreeramGarlapati · 2021-04-16T04:43:36Z

Hi @XuQianJin-Stars - is there anything pending in this PR. Pl. let me know if you need any help to push this. Happy to collaborate & contribute.

XuQianJin-Stars · 2021-04-18T09:02:34Z

Hi @XuQianJin-Stars - is there anything pending in this PR. Pl. let me know if you need any help to push this. Happy to collaborate & contribute.

Yes, thank you very much, this function is already available in our internal work, and I want to improve this function in the community.

SreeramGarlapati · 2021-04-19T21:06:43Z

@XuQianJin-Stars - this is great. Is there anything pending in this PR? Or are you waiting on any inputs?
Thanks a lot for your contribution.

@rdblue @RussellSpitzer @aokolnychyi - can you folks pl. add your review/inputs.
We are in need of this change - truly appreciate your help.

jackye1995

Looks good to me in general given what the tests cover, I will try running this in a cluster to see if there is any issue and reply back later.

jackye1995 · 2021-04-27T04:52:54Z

core/src/main/java/org/apache/iceberg/MicroBatches.java

-            CloseableIterator<FileScanTask> taskIter = taskIterable.iterator()) {
+        try (CloseableIterable<FileScanTask> taskIterable = open(indexedManifests.get(idx).first(),
+            scanAllFiles)) {
+          CloseableIterator<FileScanTask> taskIter = taskIterable.iterator();


nit: just declare it as Iterator should be enough.

jackye1995 · 2021-04-27T04:59:44Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

@@ -434,6 +454,35 @@ private static void mergeIcebergHadoopConfs(
    return tasks;
  }

+  // An extracted method which will be overrided by StreamingReader. This is because the tasks generated by Streaming is
+  // per batch and cannot be planned like Reader beforehand.
+  protected boolean checkEnableBatchRead(List<CombinedScanTask> taskList) {


with this, enableBatchRead at L331 can be simplified.

jackye1995 · 2021-04-27T05:04:41Z

spark2/src/main/java/org/apache/iceberg/spark/source/StreamingReader.java

+    return startingOffset;
+  }
+
+  /**


nit: we should provide more meaningful doc, otherwise we should just remove it instead of repeating the method name.

jackye1995 · 2021-04-27T05:05:23Z

spark2/src/main/java/org/apache/iceberg/spark/source/StreamingReader.java

+   * @return MicroBatch of list
+   */
+  @VisibleForTesting
+  @SuppressWarnings("checkstyle:HiddenField")


why suppress this warning?

why suppress this warning?

Task :iceberg-spark2:checkstyleMain FAILED [ant:checkstyle] [ERROR] /opt/sourcecode/iceberg-src/spark2/src/main/java/org/apache/iceberg/spark/source/StreamingReader.java:281:60: 'startOffset' 隐藏属性。 [HiddenField]

I think generally we just change the field visibility

holdenk · 2021-05-11T17:53:13Z

spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

+
+    Configuration conf = new Configuration(lazyBaseConf());
+    Table table = getTableAndResolveHadoopConfiguration(options, conf);
+    String caseSensitive = lazySparkSession().conf().get("spark.sql.caseSensitive");


Maybe add a comment about why this is resolved from the session conf instead of the merged session conf / source options?

holdenk · 2021-05-11T17:56:16Z

spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

@@ -127,6 +130,23 @@ public StreamWriter createStreamWriter(String runId, StructType dsStruct,
    return new StreamingWriter(table, io, encryptionManager, options, queryId, mode, appId, writeSchema, dsStruct);
  }

+  @Override
+  public MicroBatchReader createMicroBatchReader(Optional<StructType> schema, String checkpointLocation,


So it seems like we're not using the checkpointLocation here, do we not need to store anything to be able to recover on failure? I think this is the case because as we read data it's not like it gets "consumed" but I just want to make sure.

holdenk · 2021-05-11T18:00:15Z

spark2/src/main/java/org/apache/iceberg/spark/source/StreamingReader.java

+  }
+
+  @Override
+  public void commit(Offset end) {


Would it make sense to clean up the cachedPendingBatches here, or would they already have been removed at this point?

holdenk · 2021-05-11T19:42:39Z

I really appreciate the time given to the test case :)

XuQianJin-Stars · 2021-05-12T02:05:08Z

hi @holdenk Thank you very much for your review, I will reply to your comments later.

SreeramGarlapati · 2021-05-14T21:39:05Z

spark2/src/main/java/org/apache/iceberg/spark/source/StreamingReader.java

+        (startOffset.shouldScanAllFiles() || isAppend(table.snapshot(startOffset.snapshotId())));
+  }
+
+  private static void assertNoOverwrite(Snapshot snapshot) {


can we pl. take a configuration whether or not to assert. essentially - we want to factor in cases like

compaction - which will perform harmless REPLACE operations &

gdpr deletes

which will not impact the structured streaming result

https://docs.microsoft.com/en-us/azure/databricks/delta/delta-streaming#ignore-updates-and-deletes

This work is an extension of the idea in issue apache#179 & the Spark2 work done in PR apache#2272 - only that - this is for Spark3. **In the current implementation:** * Iceberg Snapshot is the upper bound for MicroBatch. A given MicroBatch will only Span within a Snapshot. It will not be composed of multiple Snapshots. BatchSize - is used to limit the number of files with in a given snapshot. * The streaming reader - will error out if it encounters any Snapshot of type NOT EQUAL to type `APPEND`. * Handling `DELETES`, `REPLACE` & `OVERWRITES` is something for future. * Columnar reads are not enabled. Something for future.

XuQianJin-Stars · 2021-07-15T13:20:16Z

hi @SreeramGarlapati @holdenk @jackye1995 @rdblue @RussellSpitzer Sorry, it took so long to fix the problem, do you have time to help continue to review this pr?

holdenk · 2022-09-13T17:45:38Z

Hey folks (incl. @XuQianJin-Stars & @RussellSpitzer ) -- is this something that people are still open to working on? We're running into a sitaution with the current limited streaming support where the lack of maxFilesPerTrigger (or it's equivelent) which is included in this PR keeps us from being able to do combined historical + streaming reads from Iceberg tables.

holdenk · 2022-09-14T17:11:08Z

& @flyrain - what are your thoughts?

singhpk234 · 2022-09-14T18:27:28Z

is this something that people are still open to working on?

+1, I have a PR out for supporting rate limiting in Spark 3 :

cc @holdenk

github-actions · 2024-07-27T00:13:10Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-08-03T00:13:21Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added core spark labels Feb 25, 2021

XuQianJin-Stars force-pushed the streaming-read branch from afbb6d9 to 7a68114 Compare March 16, 2021 13:04

jackye1995 reviewed Apr 27, 2021

View reviewed changes

XuQianJin-Stars closed this Apr 28, 2021

XuQianJin-Stars reopened this Apr 28, 2021

XuQianJin-Stars force-pushed the streaming-read branch from 987c154 to bacf28c Compare April 28, 2021 10:16

holdenk reviewed May 11, 2021

View reviewed changes

SreeramGarlapati reviewed May 14, 2021

View reviewed changes

This was referenced May 19, 2021

Spark3 structured streaming micro_batch read support #2611

Closed

minor refactoring in MicroBatches class to prep for Spark3 streaming change #2620

Merged

SreeramGarlapati mentioned this pull request Jun 2, 2021

Spark3 structured streaming micro_batch read support SreeramGarlapati/iceberg#1

Merged

SreeramGarlapati mentioned this pull request Jun 2, 2021

Spark3 structured streaming micro_batch read support #2660

Merged

XuQianJin-Stars and others added 3 commits July 15, 2021 09:57

Support structured streaming read for Iceberg

84d20fe

Support structured streaming read for Iceberg

01995a9

Support structured streaming read for Iceberg

3e1cd20

XuQianJin-Stars force-pushed the streaming-read branch from 870156b to 3e1cd20 Compare July 15, 2021 01:58

XuQianJin-Stars added 3 commits July 15, 2021 11:07

Support structured streaming read for Iceberg

057b260

Support structured streaming read for Iceberg

b0bfca3

Support structured streaming read for Iceberg

db6cdb0

XuQianJin-Stars requested a review from RussellSpitzer July 16, 2021 06:23

github-actions bot added the stale label Jul 27, 2024

github-actions bot closed this Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support structured streaming read for Iceberg #2272

Support structured streaming read for Iceberg #2272

XuQianJin-Stars commented Feb 25, 2021

SreeramGarlapati commented Apr 16, 2021

XuQianJin-Stars commented Apr 18, 2021

SreeramGarlapati commented Apr 19, 2021 •

edited

Loading

jackye1995 left a comment

jackye1995 Apr 27, 2021

jackye1995 Apr 27, 2021

jackye1995 Apr 27, 2021

jackye1995 Apr 27, 2021

XuQianJin-Stars Apr 28, 2021

RussellSpitzer Apr 29, 2021

holdenk May 11, 2021

holdenk May 11, 2021

holdenk May 11, 2021

holdenk commented May 11, 2021

XuQianJin-Stars commented May 12, 2021

SreeramGarlapati May 14, 2021 •

edited

Loading

XuQianJin-Stars commented Jul 15, 2021 •

edited

Loading

holdenk commented Sep 13, 2022

holdenk commented Sep 14, 2022

singhpk234 commented Sep 14, 2022

github-actions bot commented Jul 27, 2024

github-actions bot commented Aug 3, 2024

Support structured streaming read for Iceberg #2272

Support structured streaming read for Iceberg #2272

Conversation

XuQianJin-Stars commented Feb 25, 2021

SreeramGarlapati commented Apr 16, 2021

XuQianJin-Stars commented Apr 18, 2021

SreeramGarlapati commented Apr 19, 2021 • edited Loading

jackye1995 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented May 11, 2021

XuQianJin-Stars commented May 12, 2021

SreeramGarlapati May 14, 2021 • edited Loading

Choose a reason for hiding this comment

XuQianJin-Stars commented Jul 15, 2021 • edited Loading

holdenk commented Sep 13, 2022

holdenk commented Sep 14, 2022

singhpk234 commented Sep 14, 2022

github-actions bot commented Jul 27, 2024

github-actions bot commented Aug 3, 2024

SreeramGarlapati commented Apr 19, 2021 •

edited

Loading

SreeramGarlapati May 14, 2021 •

edited

Loading

XuQianJin-Stars commented Jul 15, 2021 •

edited

Loading