[#2541] feat(spark-connector): support basic DDL and DML operations to iceberg catalog #2544

caican00 · 2024-03-15T07:22:11Z

What changes were proposed in this pull request?

support DDL operations to iceberg catalog.
support read and write operations to iceberg Table.

Why are the changes needed?

support basic DDL and DML operations for iceberg table using sparksql.

Fix: #2541

Does this PR introduce any user-facing change?

Yes, users can use sparksql to do iceberg table ddl and read&write operations.

How was this patch tested?

New Iceberg ITs.

…ations to Iceberg catalog

FANNG1 · 2024-03-15T08:43:49Z

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

FANNG1 · 2024-03-15T08:46:28Z

Or you could do a POC first in your local environment, after the integrate test split work is done you could propose your PR.

caican00 · 2024-03-15T09:24:50Z

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

sure, i'll try to separate the integrate test for different datasources first.

caican00 · 2024-03-15T09:35:56Z

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

sure, i'll try to separate the integrate test for different datasources first.

By the way, is it better to use iceberg rest catalog for spark-connector to interact with iceberg in the first one PR? cc @FANNG1

FANNG1 · 2024-03-15T10:00:17Z

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

sure, i'll try to separate the integrate test for different datasources first.

By the way, is it better to use iceberg rest catalog for spark-connector to interact with iceberg in the first one PR? cc @FANNG1

I prefer to support Iceberg Hive Catalog because it's most used, and env is setup.

caican00 · 2024-03-15T10:17:34Z

Thanks for your work! I prefer to delay the Iceberg merge work until finishing basic hive features like partition bucket properties in recent weeks. To support Iceberg catalog integrate test, there is a prepose work that make SparkIT to SparkCommonIT which contains the common tests shared by all catalogs. add new SparkHiveCatalogIT to test Hive specific tests and SparkIcebergCatalogIT to test Iceberg specific tests, both SparkXXCatalogIT extends SparkCommonIT. do you like to work on this first?

sure, i'll try to separate the integrate test for different datasources first.

By the way, is it better to use iceberg rest catalog for spark-connector to interact with iceberg in the first one PR? cc @FANNG1

I prefer to support Iceberg Hive Catalog because it's most used, and env is setup.

If we do not use hms in the future and directly use gravitino backend storage, we should still need rest catalog, right?

caican00 · 2024-03-15T10:22:15Z

Hi @FANNG1 I would like to know if i can do something in this way:

try to separate the integrate test for different datasources first
and then help solve some hive related issues
finally continue to work on iceberg related issues

FANNG1 · 2024-03-15T11:11:45Z

Hi @FANNG1 I would like to know if i can do something in this way:

try to separate the integrate test for different datasources first

and then help solve some hive related issues

finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

caican00 · 2024-03-15T15:49:30Z

Hi @FANNG1 I would like to know if i can do something in this way:

try to separate the integrate test for different datasources first

and then help solve some hive related issues

finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

FANNG1 · 2024-03-16T01:12:56Z

Hi @FANNG1 I would like to know if i can do something in this way:

try to separate the integrate test for different datasources first

and then help solve some hive related issues

finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

caican00 · 2024-03-18T01:59:05Z

Hi @FANNG1 I would like to know if i can do something in this way:

try to separate the integrate test for different datasources first

and then help solve some hive related issues

finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

Yes, I could specialize in this. cc @FANNG1

…r spark-connector

…spark-it

…r spark-connector

…spark-it

caican00 · 2024-03-19T01:51:20Z

Hi @FANNG1 I would like to know if i can do something in this way:

try to separate the integrate test for different datasources first

and then help solve some hive related issues

finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

Hi @FANNG1 I would like to know if i can do something in this way:

try to separate the integrate test for different datasources first

and then help solve some hive related issues

finally continue to work on iceberg related issues

I think it's ok, please note that Iceberg issues have the risk of not merged in 0.5

ok, i got it, thank you @FANNG1 . And may I ask why iceberg issues are not planned in 0.5?

Because we didn't have enough time to support it in 0.5. If you could do most work, we may change the plans, we will disscuse it on Monday.

Yes, I could specialize in this. cc @FANNG1

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

FANNG1 · 2024-03-19T02:07:36Z

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

I want to keep consistent about what will to do to support Iceberg:

refactor SparkIT to support SparkIcebergCatalogIT, [#2566] Improvement(spark-connector): Refactoring integration tests for spark-connector #2578
support basic DDL and DML operation for Iceberg in this PR, which should pass SparkIcebergCatalogIT
Iceberg partition and distribution support, like month(timestamp)
advanced features like low-level update or delete, [Subtask] support row-level operations to iceberg Table #2543 [Subtask] support delete operation to Iceberg Table #2542

Could you finish at least the first 3 features before 4.12 which is code freeze date?

caican00 · 2024-03-19T02:27:02Z

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

I want to keep consistent about what will to do to support Iceberg:

refactor SparkIT to support SparkIcebergCatalogIT, [#2566] Improvement(spark-connector): Refactoring integration tests for spark-connector #2578

support basic DDL and DML operation for Iceberg in this PR, which should pass SparkIcebergCatalogIT

Iceberg partition and distribution support, like month(timestamp)

advanced features like low-level update or delete, [Subtask] support merge operation to iceberg Table #2543 [Subtask] support delete operation to Iceberg Table #2542

Could you finish at least the first 3 features before 4.12 which is code freeze date?

OK, I will give 100% effort to finish them.
And regarding the first point, I have a question, should we support SparkIcebergCatalogIT directly in this pr #2578 ? Currently iceberg's basic DDL and DML operations are not supported in spark-connector.

FANNG1 · 2024-03-19T02:32:04Z

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

I want to keep consistent about what will to do to support Iceberg:

refactor SparkIT to support SparkIcebergCatalogIT, [#2566] Improvement(spark-connector): Refactoring integration tests for spark-connector #2578

support basic DDL and DML operation for Iceberg in this PR, which should pass SparkIcebergCatalogIT

Iceberg partition and distribution support, like month(timestamp)

advanced features like low-level update or delete, [Subtask] support merge operation to iceberg Table #2543 [Subtask] support delete operation to Iceberg Table #2542

Could you finish at least the first 3 features before 4.12 which is code freeze date?

OK, I will give 100% effort to finish them. And regarding the first point, I have a question, should we support SparkIcebergCatalogIT directly in this pr #2578 ? Currently iceberg's basic DDL and DML operations are not supported in spark-connector.

We needn't to support SparkIcebergCatalogIT directly in #2578, #2578 aims to make add SparkIcebergCatalogIT easily like just extends SparkCommonIT.

caican00 · 2024-03-19T02:35:07Z

Hi @FANNG1, kindly ask if there are any conclusions about supporting iceberg in 0.5?

I want to keep consistent about what will to do to support Iceberg:

refactor SparkIT to support SparkIcebergCatalogIT, [#2566] Improvement(spark-connector): Refactoring integration tests for spark-connector #2578

support basic DDL and DML operation for Iceberg in this PR, which should pass SparkIcebergCatalogIT

Iceberg partition and distribution support, like month(timestamp)

advanced features like low-level update or delete, [Subtask] support merge operation to iceberg Table #2543 [Subtask] support delete operation to Iceberg Table #2542

Could you finish at least the first 3 features before 4.12 which is code freeze date?

OK, I will give 100% effort to finish them. And regarding the first point, I have a question, should we support SparkIcebergCatalogIT directly in this pr #2578 ? Currently iceberg's basic DDL and DML operations are not supported in spark-connector.

We needn't to support SparkIcebergCatalogIT directly in #2578, #2578 aims to make add SparkIcebergCatalogIT easily like just extends SparkCommonIT.

got it.

…r spark-connector

…o seperate-spark-it

…r spark-connector

jerryshao · 2024-04-01T04:06:44Z

Can you please check the CI exception here (https://github.com/datastrato/gravitino/actions/runs/8502545002/job/23286861918?pr=2544), I'm not sure is it related to your changes?

...c/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/IcebergCatalogOperations.java

jerryshao · 2024-04-01T04:11:37Z

@FANNG1 would you please help to review again?

caican00 · 2024-04-01T14:51:04Z

Can you please check the CI exception here (https://github.com/datastrato/gravitino/actions/runs/8502545002/job/23286861918?pr=2544), I'm not sure is it related to your changes?

Fixed. Thanks for your review.

...ration-test/src/test/java/com/datastrato/gravitino/integration/test/spark/SparkCommonIT.java

...c/main/java/com/datastrato/gravitino/spark/connector/iceberg/IcebergPropertiesConstants.java

...ration-test/src/test/java/com/datastrato/gravitino/integration/test/spark/SparkCommonIT.java

caican00 · 2024-04-02T03:21:57Z

Hi @FANNG1 could you help review this again? Thanks

...ration-test/src/test/java/com/datastrato/gravitino/integration/test/spark/SparkCommonIT.java

...c/main/java/com/datastrato/gravitino/spark/connector/iceberg/IcebergPropertiesConstants.java

FANNG1 · 2024-04-02T03:34:07Z

Hi @FANNG1 could you help review this again? Thanks

just a few comment, could you fix it?

...ration-test/src/test/java/com/datastrato/gravitino/integration/test/spark/SparkCommonIT.java

caican00 · 2024-04-02T06:35:21Z

Hi @FANNG1 could you help review this again? Thanks

just a few comment, could you fix it?

all comments have been addressed.

FANNG1 · 2024-04-02T06:44:20Z

LGTM, let's wait CI finish

FANNG1 · 2024-04-02T07:13:11Z

@caican00 thanks for your work, merge to main.

caican00 · 2024-04-02T07:17:36Z

@caican00 thanks for your work, merge to main.

@FANNG1 @jerryshao @qqqttt123 Thank you for yours review.

…L operations to iceberg catalog (apache#2544)" This reverts commit 46ebaf6.

[apache#2541] feat(spark-connector): support DDL, read and write oper…

b18810d

…ations to Iceberg catalog

caican00 marked this pull request as draft March 15, 2024 07:22

caican00 added 4 commits March 18, 2024 21:30

[apache#2566] feat(spark-connector): Refactoring integration tests fo…

c4445bd

…r spark-connector

Merge branch 'main' of github.com:datastrato/gravitino into seperate-…

129eb65

…spark-it

[apache#2566] feat(spark-connector): Refactoring integration tests fo…

c9ab007

…r spark-connector

Merge branch 'main' of github.com:datastrato/gravitino into seperate-…

a51aae8

…spark-it

Merge branch 'main' into seperate-spark-it

c226536

caican00 changed the title ~~[#2541] feat(spark-connector): support DDL and readWrite operations to iceberg catalog~~ [#2541] feat(spark-connector): support basic DDL and DML operations to iceberg catalog Mar 19, 2024

caican00 and others added 6 commits March 19, 2024 11:00

[apache#2566] feat(spark-connector): Refactoring integration tests fo…

b80c366

…r spark-connector

Merge branch 'seperate-spark-it' of github.com:caican00/gravitino int…

a905bac

…o seperate-spark-it

[apache#2566] feat(spark-connector): Refactoring integration tests fo…

a7fbb0b

…r spark-connector

[apache#2566] feat(spark-connector): Refactoring integration tests fo…

2847dc4

…r spark-connector

[apache#2566] feat(spark-connector): Refactoring integration tests fo…

b2f31e8

…r spark-connector

Merge branch 'main' into seperate-spark-it

82a7979

caican00 force-pushed the iceberg-read-write branch from 16a4c98 to 69a7af6 Compare April 1, 2024 01:37

qqqttt123 reviewed Apr 1, 2024

View reviewed changes

...c/main/java/com/datastrato/gravitino/catalog/lakehouse/iceberg/IcebergCatalogOperations.java Show resolved Hide resolved

update

3d494ac

caican00 force-pushed the iceberg-read-write branch from ab43f2f to 3d494ac Compare April 1, 2024 12:54

caican00 and others added 2 commits April 1, 2024 21:31

update

2301e35

Merge branch 'main' into iceberg-read-write

8f632cb