Spark 3.3: Add a procedure to generate table changes #6012

flyrain · 2022-10-18T20:02:11Z

Add a procedure to generate table changes. Here are changes in this PR.

Defines the user interface.
Generates update pre-image and post-image when user provide the identifier columns.
Uses the window function instead of joining for better performance.

cc @aokolnychyi @rdblue @szehon-ho @kbendick @anuragmantri @karuppayya @chenjunjiedada @RussellSpitzer

flyrain · 2022-10-18T20:03:44Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+                functions.lit(ChangelogOperation.UPDATE_POSTIMAGE.name()));
+
+    // remove the carry-over rows
+    Dataset<Row> dfWithoutCarryOver = removeCarryOvers(preImageDf.union(postImageDf));


Should we make it optional since it is a heavy operation?

Is there another algorithm we can consider that would make it cheaper? Will something like this work?

- Load DELETEs and INSERTs as a DF - Repartition the DF by primary key, _change_ordinal and locally sort by primary key, _change_ordinal, _operation_type - Call mapPartitions with a closure that would look at the previous, current and next rows - If the previous, current, next row keys are different, output the current row as-is - If the next row key is same, the current row must be DELETE and the next row must be INSERT (if not -> exception) - If other columns beyond the key are same, it is a copied over row - Output null if unchanged rows should be ignored - Output the current row as-is if all rows should be produced - If other columns beyond key are different, it is an update - Output the current row as pre-update - If the previous row key is same as the current one, the current row must be INSERT and the previous row must be DELETE - If other columns beyond the key are same, it is a copied over row - Output null if unchanged rows should be ignored - Output the current row as-is if all rows should be produced - If other columns beyond key are different, it is an update - Output the current row as post-update

That would require reading the changes only once, doing a single hash-based shuffle to co-locate rows for the same key and change ordinal, keeping at most 3 rows in memory at a time. Seems fairly cheap?

I don't think I understand why we would need the previous row and the next row. If we are iterating over rows, then the current will become the previous, so we should only look forward or backward right?

Thanks @aokolnychyi for the suggestion. Make sense to shuffle once. Agreed with @rdblue, to just look forward should be good, no need to search bidirectionally. Will make the change accordingly.

Made the change, could you take a look?

I agree with can pre-compute the required state by just looking ahead.

api/src/main/java/org/apache/iceberg/ChangelogOperation.java

flyrain · 2022-10-19T00:23:23Z

The test failure is not related.

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

api/src/main/java/org/apache/iceberg/ChangelogOperation.java

aokolnychyi · 2022-11-09T22:33:23Z

Let me take a look today.

aokolnychyi · 2022-11-10T06:45:53Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+                functions.lit(ChangelogOperation.UPDATE_POSTIMAGE.name()));
+
+    // remove the carry-over rows
+    Dataset<Row> dfWithoutCarryOver = removeCarryOvers(preImageDf.union(postImageDf));


Is there another algorithm we can consider that would make it cheaper? Will something like this work?

- Load DELETEs and INSERTs as a DF - Repartition the DF by primary key, _change_ordinal and locally sort by primary key, _change_ordinal, _operation_type - Call mapPartitions with a closure that would look at the previous, current and next rows - If the previous, current, next row keys are different, output the current row as-is - If the next row key is same, the current row must be DELETE and the next row must be INSERT (if not -> exception) - If other columns beyond the key are same, it is a copied over row - Output null if unchanged rows should be ignored - Output the current row as-is if all rows should be produced - If other columns beyond key are different, it is an update - Output the current row as pre-update - If the previous row key is same as the current one, the current row must be INSERT and the previous row must be DELETE - If other columns beyond the key are same, it is a copied over row - Output null if unchanged rows should be ignored - Output the current row as-is if all rows should be produced - If other columns beyond key are different, it is an update - Output the current row as post-update

That would require reading the changes only once, doing a single hash-based shuffle to co-locate rows for the same key and change ordinal, keeping at most 3 rows in memory at a time. Seems fairly cheap?

aokolnychyi · 2022-11-10T06:51:36Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+        ProcedureParameter.optional("table_change_view", DataTypes.StringType),
+        ProcedureParameter.optional("identifier_columns", DataTypes.StringType),
+        ProcedureParameter.optional("start_timestamp", DataTypes.TimestampType),
+        ProcedureParameter.optional("end_timestamp", DataTypes.TimestampType),


I am a bit worried about the number of parameters to configure boundaries. What if we replaced all of them with generic options and would pass those options along when loading DataFrame? Then instead of determining what snapshots match our timestamp range in the procedure, we would do that when scanning the changelog table. That way, users would be able to use timestamp boundaries not only via procedure but also via DataFrame. Right now, we only support snapshot ID boundaries.

Make sense to allow Dataframe to consume timestamp range. Will create a followup PR for that. For this procedure, we still need all these parameter, right? What do you mean by replacing all of them with generic options?

I am not sure. I'd consider having read_options or options as a map that would be passed while loading deletes and inserts as DataFrame. Then users can specify boundaries directly in the map.

We already respect these options from SparkReadOptions in the changes table:

// Start snapshot ID used in incremental scans (exclusive) public static final String START_SNAPSHOT_ID = "start-snapshot-id"; // End snapshot ID used in incremental scans (inclusive) public static final String END_SNAPSHOT_ID = "end-snapshot-id";

We could add start-timestamp and end-timestamp, start-snapshot-id-inclusive.

Changed to options in the procedure. Will add the timestamp range in another PR.

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

aokolnychyi · 2022-11-29T05:22:01Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+        ProcedureParameter.optional("table_change_view", DataTypes.StringType),
+        ProcedureParameter.optional("identifier_columns", DataTypes.StringType),
+        ProcedureParameter.optional("start_timestamp", DataTypes.TimestampType),
+        ProcedureParameter.optional("end_timestamp", DataTypes.TimestampType),


I am not sure. I'd consider having read_options or options as a map that would be passed while loading deletes and inserts as DataFrame. Then users can specify boundaries directly in the map.

We already respect these options from SparkReadOptions in the changes table:

// Start snapshot ID used in incremental scans (exclusive) public static final String START_SNAPSHOT_ID = "start-snapshot-id"; // End snapshot ID used in incremental scans (inclusive) public static final String END_SNAPSHOT_ID = "end-snapshot-id";

We could add start-timestamp and end-timestamp, start-snapshot-id-inclusive.

aokolnychyi · 2022-11-29T05:26:09Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java

@@ -53,6 +53,7 @@ private static Map<String, Supplier<ProcedureBuilder>> initProcedureBuilders() {
    mapBuilder.put("ancestors_of", AncestorsOfProcedure::builder);
    mapBuilder.put("register_table", RegisterTableProcedure::builder);
    mapBuilder.put("publish_changes", PublishChangesProcedure::builder);
+    mapBuilder.put("generate_changes", GenerateChangesProcedure::builder);


Are there any alternative names? I am not sure the procedure actually generates changes.
Let's think a bit. It is not bad but I wonder whether we can be a bit more specific.

how about create_change_view generate_change_view?

Other options like register_change_view create_changelog_view

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

flyrain · 2023-01-11T23:30:40Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+            .filter(c -> !c.equals(MetadataColumns.CHANGE_TYPE.name()))
+            .map(df::col)
+            .toArray(Column[]::new);
+    return transform(df, repartitionColumns);


Reused the same changelog iterator for removing carry-over rows only. I think we can optimize it here, for example, using the windows function. WDYT?

i'm not sure I understand the comment here, don't we have the iterator so we don't need to do this?

Here we only remove carryover rows, without computing the updated rows. The iterator is built for both, it will check if they are updated rows and check if they are carryover rows. The first check is not necessary. We only need the second one to see if two rows are identical. If yes, they are carryover rows, we remove them.

The motivation of building the changelog iterator is to combine two operation together in one pass. But if there is only one operation, a window function seems fit better.

flyrain · 2023-01-12T00:58:11Z

Ready for another look. cc @RussellSpitzer @szehon-ho @aokolnychyi

flyrain · 2023-01-12T17:44:32Z

retest this please

RussellSpitzer · 2023-01-12T23:05:49Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+        df = transform(df, repartitionColumns);
+      } else {
+        LOG.warn("Cannot compute the update-rows because identifier columns are not set");
+        if (removeCarryoverRow) {


can we pull the if (removeCarryoverRow) outside of this if statement?

basically
If Idenifier {

}

if carryover {

}

The branch has been removed. Let me know if there is anything I missed.

RussellSpitzer · 2023-01-12T23:21:28Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+              (k, v) -> {
+                if (k.toString().equals(SparkReadOptions.START_TIMESTAMP)
+                    || k.toString().equals(SparkReadOptions.END_TIMESTAMP)) {
+                  options.put(k.toString(), toMillis(v.toString()));


I'm a little lost on our conversion here, don't we already have code to convert this read option from String => millis within the reader itself?

The read option only accepts the string of milliseconds, not the string of timestamp like 2019-02-08 03:29:51.215. Here is an example of read option.

// time travel to October 26, 1986 at 01:21:00 spark.read .option("as-of-timestamp", "499162860000") .format("iceberg") .load("path/to/table")

We shouldn't do any conversion here. If we want to support timestamps, we should add support to the reader. It is possible by adding new functionality to SparkReadConf. For now, I'd avoid any transformations to options in this PR.

Removed the conversion. To support both format, will file separated PR in the reader side.

aokolnychyi · 2023-01-20T22:54:06Z

Getting to this PR soon.

aokolnychyi · 2023-01-23T21:37:51Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+ *   <li>(id=1, data='b', op='UPDATE_AFTER')
+ * </ul>
+ */
+public class GenerateChangesProcedure extends BaseProcedure {


I am not sure about the name. I don't have a great alternative but it does not seem to me like we generate changes in this procedure. It seems more like we register a changelog view. Any alternatives?

aokolnychyi · 2023-01-23T21:39:05Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+  private static final ProcedureParameter[] PARAMETERS =
+      new ProcedureParameter[] {
+        ProcedureParameter.required("table", DataTypes.StringType),
+        ProcedureParameter.optional("table_change_view", DataTypes.StringType),


What about changelog_view, changelog_view_name or similar?

aokolnychyi · 2023-01-23T21:39:57Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+  private static final ProcedureParameter[] PARAMETERS =
+      new ProcedureParameter[] {
+        ProcedureParameter.required("table", DataTypes.StringType),
+        ProcedureParameter.optional("table_change_view", DataTypes.StringType),


Defaulting the name will only work in the session catalog but I am not sure we have to do anything about it. Other catalogs will not support views.

We can add in a "precondition" "catalog blah is does not support views"

Hi @aokolnychyi, could you elaborate a bit? We pass the view name in the output row. I assume it is the same no matter user gives the view name or procedure gives a default name.

Got it by checking this comment #6012 (comment). Let me put a precondition check on the catalog type.

Looking a bit more. Not only SparkSessionCatalog works, SparkCatalog works as well. My test TestGenerateChangesProcedure inherits from SparkCatalogTestBase, which covers 3 types of catalog. They all work well. Am I missing something here?

aokolnychyi · 2023-01-23T21:49:33Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+  private static final StructType OUTPUT_TYPE =
+      new StructType(
+          new StructField[] {
+            new StructField("view_name", DataTypes.StringType, false, Metadata.empty())


This name should probably match the input arg name.

aokolnychyi · 2023-01-23T21:50:52Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+    return identifierColumns;
+  }
+
+  private Dataset<Row> changelogRecords(String tableName, InternalRow args) {


nit: To me, it would be better to pull the needed arguments before calling this method rather than passing the row here and doing the extraction within the method itself.

good idea. Fixed in the new commit.

aokolnychyi · 2023-01-23T21:51:59Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+              (k, v) -> {
+                if (k.toString().equals(SparkReadOptions.START_TIMESTAMP)
+                    || k.toString().equals(SparkReadOptions.END_TIMESTAMP)) {
+                  options.put(k.toString(), toMillis(v.toString()));


We shouldn't do any conversion here. If we want to support timestamps, we should add support to the reader. It is possible by adding new functionality to SparkReadConf. For now, I'd avoid any transformations to options in this PR.

aokolnychyi · 2023-01-23T21:53:25Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+
+  private String[] identifierColumns(InternalRow args, String tableName) {
+    String[] identifierColumns = new String[0];
+    if (!args.isNullAt(5) && !args.getString(5).isEmpty()) {


What if the provided identifier columns don't match with identifier columns defined on one of the scan snapshots?

aokolnychyi · 2023-01-23T21:55:18Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+    if (identifierColumns.length == 0) {
+      Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name());
+      Table table = loadSparkTable(tableIdent).table();
+      identifierColumns = table.schema().identifierFieldNames().toArray(new String[0]);


What if some older snapshots have another set of identifier fields? We don't have to support it but I wonder whether we can validate in the reader that all snapshots that are being scanned have the expected identifier columns or those are undefined. Cause if we use a set of identifier columns and it is different from the real ones, it will become a problem.

Combined with comment #6012 (comment), we could push it down to reader to validate the identifier columns of each snapshot.

aokolnychyi · 2023-01-23T21:57:32Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+        Column[] repartitionColumns = getRepartitionExpr(df, identifierColumns);
+        df = transform(df, repartitionColumns);
+      } else {
+        LOG.warn("Cannot compute the update-rows because identifier columns are not set");


I don't think we should proceed with the execution if the user asked to compute pre/post images but that's not possible.

Made the change.

aokolnychyi · 2023-01-24T04:36:13Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+        .mapPartitions(
+            (MapPartitionsFunction<Row, Row>)
+                rowIterator ->
+                    ChangelogIterator.iterator(rowIterator, changeTypeIdx, repartitionIdx),


I had a few questions about ChangelogIterator, which I left on #6344.

RussellSpitzer · 2023-01-24T16:06:48Z

...tensions/src/test/java/org/apache/iceberg/spark/extensions/TestGenerateChangesProcedure.java

+        sql("CALL %s.system.generate_changes(table => '%s')", catalogName, tableName);
+
+    String viewName = (String) returns.get(0)[0];
+    assertEquals(


Would this output be different without the identifier columns?

Yes, it will. I also made the change that computing update is off by default. So that user has to explicitly set it to true to honor the the identifier columns, otherwise, they are not used.

RussellSpitzer · 2023-01-24T16:22:49Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+ * data='a', op='DELETE') and (id=1, data='a', op='INSERT'), despite it not being an actual change
+ * to the table. The iterator finds the carry-over rows and removes them from the result.
+ *
+ * <p>An update-row is converted from a pair of delete row and insert row. Identifier columns are


pair of a delete row and an insert row

RussellSpitzer · 2023-01-24T16:32:35Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+ * to the table. The iterator finds the carry-over rows and removes them from the result.
+ *
+ * <p>An update-row is converted from a pair of delete row and insert row. Identifier columns are
+ * needed for identifying whether they refer to the same row. You can either set Identifier Field


needed => used?

Identifier columns are used for determining whether an insert and delete record refer to the same row. If the two records share the same values for the identity columns they are considered to be before and after states of the same row.

?

Made the change accordingly.

RussellSpitzer · 2023-01-24T17:01:27Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java

+      df = removeCarryoverRows(df);
+    }
+
+    String viewName = viewName(args, tableName);


We could be checking earlier whether or not the catalog specified by this view name is allowed to create views

RussellSpitzer

I think all of my questions are in the comments now, I don't see any major blockers on this for me.

Resolve the comments

flyrain · 2023-02-22T22:55:52Z

Resolved comments. Ready for another look. cc @aokolnychyi @RussellSpitzer

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangeViewProcedure.java

flyrain · 2023-02-24T23:03:07Z

Thanks a lot for the detailed review, @aokolnychyi ! Resolved them all and ready for another look.

aokolnychyi

Final set of comments and should be good to go.

aokolnychyi · 2023-03-02T06:03:40Z

...sions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4

@@ -173,6 +174,10 @@ stringMap
    : MAP '(' constant (',' constant)* ')'
    ;

+stringArray


Looks good.

aokolnychyi · 2023-03-02T06:06:02Z

...ions/src/test/java/org/apache/iceberg/spark/extensions/TestCreateChangelogViewProcedure.java

+
+    String viewName = (String) returns.get(0)[0];
+
+    // the carry-over rows (2, 'e', 12, 'DELETE', 1), (2, 'e', 12, 'INSERT', 1) are removed, even


Is this comment accurate? I thought we were supposed to keep carryovers in this case.

It is still accurate. The default behavior is that NOT computing updates, but removing carryovers. I can change the command to this, so that it is more clear we test the default behavior here.

"CALL %s.system.create_changelog_view(table => '%s')",

But this test calls remove_carryovers = false and the carryovers are not removed as far as I see in the check below?

oh, sorry, I'm looking at a different test. I have removed the comment in the new commit. Thanks for catching it.

aokolnychyi · 2023-03-02T06:09:59Z

...ions/src/test/java/org/apache/iceberg/spark/extensions/TestCreateChangelogViewProcedure.java

+
+  @Test
+  public void testNotRemoveCarryOvers() {
+    removeTables();


Why do we explicitly call this if we have After method to clean up tables? Is it because we always create a default table first? If so, can we remove the Before init method and just call a correct create method in each test?

That's a good idea. Made the change in the next commit.

aokolnychyi · 2023-03-02T06:12:08Z

...ions/src/test/java/org/apache/iceberg/spark/extensions/TestCreateChangelogViewProcedure.java

+  }
+
+  @After
+  public void removeTables() {


nit: We usually have these init methods at the top.

aokolnychyi · 2023-03-02T06:15:58Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

+import scala.runtime.BoxedUnit;
+
+/**
+ * A procedure that creates a view for changed rows.


Looks accurate now, thanks for updating!

aokolnychyi · 2023-03-02T06:35:55Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

+  }
+
+  @NotNull
+  private static Column[] getRepartitionExpr(Dataset<Row> df, String[] identifiers) {


If we decide to add computeUpdateImages, I would put this logic there directly, like you did for carryovers.

aokolnychyi · 2023-03-02T06:39:02Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

+            : args.getString(CHANGELOG_VIEW_NAME_ORDINAL);
+    if (viewName == null) {
+      String shortTableName =
+          tableName.contains(".") ? tableName.substring(tableName.lastIndexOf(".") + 1) : tableName;


We already pass ident.name() to this method. Instead of checking for dots in the name, I think we can use the approach from the snippet above and escape the name using backticks.

aokolnychyi · 2023-03-02T06:39:19Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

+        args.isNullAt(CHANGELOG_VIEW_NAME_ORDINAL)
+            ? null
+            : args.getString(CHANGELOG_VIEW_NAME_ORDINAL);
+    if (viewName == null) {


What about having if/else instead of an extra var and ternary operator above?

if (args.isNullAt(CHANGELOG_VIEW_NAME_ORDINAL)) { return String.format("`%s_changes`", tableName); } else { return args.getString(CHANGELOG_VIEW_NAME_ORDINAL); }

aokolnychyi · 2023-03-02T06:42:43Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

+              .toArray(String[]::new);
+    }
+
+    if (identifierColumns.length == 0) {


I think it should be if/else. If someone provides empty identifier columns, we should complain.

if (!args.isNullAt(IDENTIFIER_COLUMNS_ORDINAL)) { return ...; } else { return ...; }

aokolnychyi · 2023-03-02T06:48:51Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

+        .mapPartitions(
+            (MapPartitionsFunction<Row, Row>)
+                rowIterator -> ChangelogIterator.create(rowIterator, schema, identifierFields),
+            RowEncoder.apply(df.schema()));


nit: Use schema var defined above?

aokolnychyi · 2023-03-02T06:53:26Z

I also added this to our 1.2 milestone. I think we should be able to merge it tomorrow.
Thanks for making this happen, @flyrain!

flyrain · 2023-03-03T20:17:36Z

Thanks a lot for the review @aokolnychyi. Resolved all of them and ready for another look.

flyrain · 2023-03-03T21:41:33Z

retest this please

flyrain · 2023-03-03T21:42:13Z

The pipeline error is not related.

> A failure occurred while executing org.gradle.api.plugins.quality.internal.CheckstyleAction
   > An unexpected error occurred configuring and executing Checkstyle.
      > java.lang.Error: Error was thrown while processing /home/runner/work/iceberg/iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java

flyrain · 2023-03-03T23:59:54Z

retest this please

aokolnychyi

I had a few minor nits but nothing that would stop from getting this PR in. It is already in a pretty good shape. I'll merge it now. We can address the last feedback in a follow-up PR later.

aokolnychyi · 2023-03-04T05:39:29Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/BaseProcedure.java

@@ -144,6 +148,12 @@ protected SparkTable loadSparkTable(Identifier ident) {
    }
  }

+  protected Dataset<Row> loadDataSetFromTable(Identifier tableIdent, Map<String, String> options) {


Is there a shorter yet descriptive name? Like loadRows, loadContent, etc?

aokolnychyi · 2023-03-04T05:39:51Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/BaseProcedure.java

@@ -144,6 +148,12 @@ protected SparkTable loadSparkTable(Identifier ident) {
    }
  }

+  protected Dataset<Row> loadDataSetFromTable(Identifier tableIdent, Map<String, String> options) {
+    String tableName = Spark3Util.quotedFullIdentifier(tableCatalog().name(), tableIdent);
+    // no need to validate the read options here since the reader will validate them


I don't think we need this comment anymore since it is a pretty generic method now.

aokolnychyi · 2023-03-04T05:40:34Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

+  private Dataset<Row> computeUpdateImages(String[] identifierColumns, Dataset<Row> df) {
+    Preconditions.checkArgument(
+        identifierColumns.length > 0,
+        "Cannot compute the update-rows because identifier columns are not set");


nit: update-rows -> update images?

aokolnychyi · 2023-03-04T05:41:36Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

+        identifierColumns.length > 0,
+        "Cannot compute the update-rows because identifier columns are not set");
+
+    Column[] repartitionColumns = new Column[identifierColumns.length + 1];


nit: We sometimes call it repartitionColumns and sometimes repartitionSpec.
I'd probably use repartitionSpec everywhere since it is shorter (this statement would fit on 1 line?) and matches sortSpec used in other methods.

aokolnychyi · 2023-03-04T05:59:03Z

Thanks, @flyrain! I am excited to test this out in real use cases.

flyrain · 2023-03-04T07:52:09Z

Thanks a lot @aokolnychyi! I will address these comments in a followup PR. It is a milestone. I'm also excited to see how people use it. Thanks everybody for the review, @RussellSpitzer @chenjunjiedada @hililiwei @rdblue!

This change backports PR #6012 to Spark 3.2.

This change backports PR apache#6012 to Spark 3.2.

(cherry picked from commit 9cf9ca2)

flyrain commented Oct 18, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/ChangelogOperation.java Outdated Show resolved Hide resolved

github-actions bot added API spark labels Oct 18, 2022

flyrain changed the title ~~Add a procedure to generate table changes~~ Spark 3.3: Add a procedure to generate table changes Oct 19, 2022

hililiwei reviewed Oct 19, 2022

View reviewed changes

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java Outdated Show resolved Hide resolved

api/src/main/java/org/apache/iceberg/ChangelogOperation.java Outdated Show resolved Hide resolved

flyrain mentioned this pull request Oct 26, 2022

[Feature Request] Support for change data capture #3941

Closed

aokolnychyi reviewed Nov 10, 2022

View reviewed changes

flyrain commented Nov 18, 2022

View reviewed changes

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Nov 29, 2022

View reviewed changes

This was referenced Dec 2, 2022

Spark 3.3: Introduce the changelog iterator #6344

Merged

Spark 3.3: Time range query of changelog tables #6350

Merged

flyrain force-pushed the cdcProcedure branch from 35b2f53 to 5da0192 Compare January 11, 2023 21:41

github-actions bot removed the API label Jan 11, 2023

flyrain commented Jan 11, 2023

View reviewed changes

RussellSpitzer reviewed Jan 12, 2023

View reviewed changes

aokolnychyi reviewed Jan 23, 2023

View reviewed changes

aokolnychyi reviewed Jan 24, 2023

View reviewed changes

RussellSpitzer reviewed Jan 24, 2023

View reviewed changes

Add procedure for generating table changes

1b3ef98

flyrain added 3 commits February 22, 2023 13:54

Retrigger the pipeline

d25560b

Resolve the comments

99a3728

Resolve the comments

Resolve comments

41bb85b

flyrain force-pushed the cdcProcedure branch from 807294e to 41bb85b Compare February 22, 2023 22:52

aokolnychyi reviewed Feb 24, 2023

View reviewed changes

flyrain added 2 commits February 24, 2023 13:28

Resolve comments

d9a264a

Resolve comments

dedf556

aokolnychyi reviewed Mar 2, 2023

View reviewed changes

aokolnychyi added this to the Iceberg 1.2.0 milestone Mar 2, 2023

Resolve comments

1e95a66

Resolve comments

f659fe7

Add comments

54b3da8

aokolnychyi approved these changes Mar 4, 2023

View reviewed changes

aokolnychyi merged commit 9cf9ca2 into apache:master Mar 4, 2023

slfan1989 pushed a commit to slfan1989/iceberg that referenced this pull request Mar 4, 2023

Spark 3.3: Add a procedure to create changelog view (apache#6012)

1f63c24

flyrain added a commit to flyrain/iceberg that referenced this pull request Mar 8, 2023

Spark 3.2: Add a procedure to create changelog view (apache#6012)

7bcea66

flyrain mentioned this pull request Mar 8, 2023

Spark 3.2: Add a procedure to create changelog view #7036

Merged

aokolnychyi pushed a commit that referenced this pull request Mar 8, 2023

Spark 3.2: Add a procedure to create changelog view (#7036)

3b37f58

This change backports PR #6012 to Spark 3.2.

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Spark 3.3: Add a procedure to create changelog view (apache#6012)

d75b427

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Spark 3.2: Add a procedure to create changelog view (apache#7036)

466f821

This change backports PR apache#6012 to Spark 3.2.

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023

Spark 3.3: Add a procedure to create changelog view (apache#6012)

3378f3d

(cherry picked from commit 9cf9ca2)


		String viewName = (String) returns.get(0)[0];

		// the carry-over rows (2, 'e', 12, 'DELETE', 1), (2, 'e', 12, 'INSERT', 1) are removed, even

Spark 3.3: Add a procedure to generate table changes #6012

Spark 3.3: Add a procedure to generate table changes #6012

Conversation

flyrain commented Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

flyrain commented Oct 19, 2022

aokolnychyi commented Nov 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain commented Jan 12, 2023

flyrain commented Jan 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Jan 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

flyrain commented Feb 22, 2023

flyrain commented Feb 24, 2023

aokolnychyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain commented Oct 18, 2022 •

edited

Loading

aokolnychyi Nov 29, 2022 •

edited

Loading

flyrain Feb 20, 2023 •

edited

Loading