Add Spark 3 SQL tests #1156

rdblue · 2020-07-01T18:04:32Z

This adds tests for Spark SQL operations.

The tests primarily focus on the way that query info is passed to Iceberg. This is not intended to be exhaustive for either ways to express changes as SQL or for what Iceberg supports. For example, this only tests updating an int column to long to validate that TableChange.UpdateColumnType is handled.

These tests have exposed a few bugs:

The Spark session catalog doesn't support metadata tables because Spark validates the namespace independent of the catalog. We could add a suffix convention to make the metadata tables accessible, since many people will want to use this catalog. (The session catalog can delegate to Spark's v1 catalog, so this catalog can load any existing table as well as Iceberg from the same Hive namespace.)
Spark has a bug (~~[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite spark#28993~~) in determining when an overwrite should be static or dynamic. If Spark's number of static values in an INSERT OVERWRITE is the same as the number of partition columns (identity), then Spark will use a static overwrite. This doesn't account for hidden partition columns. This should be fixed in Spark 3.0.1 and is a Spark bug, so I the test for this case is disabled for now.
Spark passes the catalog name in the identifier passed to renameTable. It looks like there is no validation on the to identifier.
The replace table transaction discarded the old table state. This has been fixed so table history is kept.

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

hive/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

edgarRd · 2020-07-02T01:00:52Z

hive/src/test/java/org/apache/iceberg/hive/HiveCreateReplaceTableTest.java


    Table table = catalog.loadTable(TABLE_IDENTIFIER);
-    Assert.assertEquals("Partition spec should match", PartitionSpec.unpartitioned(), table.spec());
+    Assert.assertEquals("Partition spec should be unpartitioned", 0, table.spec().fields().size());


Is any PartitionSpecwith 0 fields the unpartitioned spec or should it match a certain spec (like in the removed version)? Seems like the equals method takes fields and specId into the contract.

Any partition spec with 0 fields is considered unpartitioned.

I should note that this needed to be updated because the spec's ID didn't match, causing the test to fail.

PartitionSpec.unpartitioned() doesn't necessarily have the right ID for a given table. We primarily use it when creating tables, where the spec gets rebuilt and assigned the right ID.

Yeah, I supposed that was the reason for the change (spec's ID not matching). Thanks for the context!

edgarRd · 2020-07-02T01:18:42Z

spark3/src/test/java/org/apache/iceberg/spark/sql/TestCreateTable.java

+  }
+
+  @Test
+  public void testCreateTableUsingParquet() {


Should we add tests for the other file formats?

This is just a test that the provider is passed through correctly when using a specific catalog. I probably wouldn't add a test for other formats here in the SQL tests, but I would in unit tests for the SparkCatalog.

edgarRd

Thanks for the changes, LGTM.

rdblue · 2020-07-08T18:01:15Z

spark3/src/test/java/org/apache/iceberg/spark/sql/TestCreateTableAsSelect.java

+
+    // spark_catalog does not use an atomic replace, so the table history and old spec is dropped
+    // the other catalogs do use atomic replace, so the spec id is incremented
+    boolean isAtomic = !"spark_catalog".equals(catalogName);


This behavior is fixed in #1183. If this goes in first, I'll add the fix to that PR before merging.

danielcweeks · 2020-07-08T21:33:40Z

core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

-    TableMetadata metadata = TableMetadata.newTableMetadata(schema, spec, baseLocation, tableProperties);
+
+    TableMetadata metadata;
+    if (ops.current() != null) {


I'm not clear on whether this really should be the right behavior. Basically we're saying that a replace table will keep the existing location (as opposed to using defaults). I suspect we don't have create or replace with location semantics, but this is making some assumptions that a replacement is somehow the same as the old. If we were to go with id based pathing convention, this wouldn't work.

I don't think this is an issue at this point, but it might make sense to push this down to the location provider.

I think this is correct because the REPLACE TABLE doesn't completely wipe out the old table. In most ways, it is the same table.

Calling buildReplacement will replace the schema and partition spec so that the transaction can add a new snapshot. Table history, old snapshots, and existing table properties are kept so that you can inspect the table and don't need to add table configuration every time you run the SQL.

We could add a flag to turn off this behavior and wipe out the old by default, but I don't think that's what users really want. It makes sense for things like default format and other settings to persist across replace operations, so that table configuration and table operations are orthogonal.

ok, I guess that makes sense. I assumed the semantics to be the same a transactional drop and create, but based on a little searching, it's much less clear than that. For example, with db2 create or replace will actually retain all data (assuming if aligns with the new table definition) by default.

I agree that preserving table properties make sense, but wouldn't this wipe out comments (which we may also want to preserve)

It would probably wipe out column comments. I'll have to fix that.

danielcweeks · 2020-07-08T21:34:56Z

+1 Lots of great tests. One minor comment.

rdblue added this to the Spark 3 milestone Jul 1, 2020

rdblue requested review from danielcweeks and aokolnychyi July 1, 2020 18:04

rdblue marked this pull request as draft July 1, 2020 19:26

rdblue force-pushed the add-spark-3-ddl-tests branch 2 times, most recently from 09ceccd to b88ec46 Compare July 1, 2020 22:51

rdblue changed the title ~~Add Spark 3 DDL tests~~ Add Spark 3 SQL tests Jul 1, 2020

rdblue added 2 commits July 1, 2020 16:48

Spark: Add tests for create, alter, insert, and select.

a2dafe1

Add CTAS tests.

d2bc1da

rdblue force-pushed the add-spark-3-ddl-tests branch from 061a124 to d2bc1da Compare July 2, 2020 00:33

rdblue marked this pull request as ready for review July 2, 2020 00:33

Fix checkstyle and Hive replace table tests.

5c9b929

edgarRd reviewed Jul 2, 2020

View reviewed changes

rdblue added 5 commits July 2, 2020 17:06

Ensure the default database is created.

66b7904

Add tests for DataFrameWriterV2 CTAS and RTAS.

06c137e

Add DELETE FROM tests.

20e5247

Fix review comments.

5414590

Fix checkstyle.

b34455e

edgarRd approved these changes Jul 6, 2020

View reviewed changes

Remove work-around for SPARK-32168.

58a9eb5

rdblue commented Jul 8, 2020

View reviewed changes

danielcweeks reviewed Jul 8, 2020

View reviewed changes

rdblue merged commit 115a145 into apache:master Jul 8, 2020

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Add Spark 3 SQL tests (apache#1156)

7d755c3

haizhou-zhao mentioned this pull request Sep 17, 2024

REST Catalog does not validate "to" identifier on rename table #11154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Spark 3 SQL tests #1156

Add Spark 3 SQL tests #1156

Uh oh!

rdblue commented Jul 1, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

edgarRd Jul 2, 2020

Uh oh!

rdblue Jul 2, 2020

Uh oh!

rdblue Jul 2, 2020

Uh oh!

edgarRd Jul 2, 2020

Uh oh!

edgarRd Jul 2, 2020

Uh oh!

rdblue Jul 2, 2020

Uh oh!

edgarRd left a comment

Uh oh!

rdblue Jul 8, 2020

Uh oh!

danielcweeks Jul 8, 2020

Uh oh!

rdblue Jul 8, 2020

Uh oh!

danielcweeks Jul 8, 2020

Uh oh!

rdblue Jul 9, 2020

Uh oh!

danielcweeks commented Jul 8, 2020

Uh oh!

Uh oh!

Add Spark 3 SQL tests #1156

Add Spark 3 SQL tests #1156

Uh oh!

Conversation

rdblue commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edgarRd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielcweeks commented Jul 8, 2020

Uh oh!

Uh oh!

rdblue commented Jul 1, 2020 •

edited

Loading