Skip to content

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 1, 2020

This adds tests for Spark SQL operations.

The tests primarily focus on the way that query info is passed to Iceberg. This is not intended to be exhaustive for either ways to express changes as SQL or for what Iceberg supports. For example, this only tests updating an int column to long to validate that TableChange.UpdateColumnType is handled.

These tests have exposed a few bugs:

  • The Spark session catalog doesn't support metadata tables because Spark validates the namespace independent of the catalog. We could add a suffix convention to make the metadata tables accessible, since many people will want to use this catalog. (The session catalog can delegate to Spark's v1 catalog, so this catalog can load any existing table as well as Iceberg from the same Hive namespace.)
  • Spark has a bug ([SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite spark#28993) in determining when an overwrite should be static or dynamic. If Spark's number of static values in an INSERT OVERWRITE is the same as the number of partition columns (identity), then Spark will use a static overwrite. This doesn't account for hidden partition columns. This should be fixed in Spark 3.0.1 and is a Spark bug, so I the test for this case is disabled for now.
  • Spark passes the catalog name in the identifier passed to renameTable. It looks like there is no validation on the to identifier.
  • The replace table transaction discarded the old table state. This has been fixed so table history is kept.

@rdblue rdblue added this to the Spark 3 milestone Jul 1, 2020
@rdblue rdblue marked this pull request as draft July 1, 2020 19:26
@rdblue rdblue force-pushed the add-spark-3-ddl-tests branch 2 times, most recently from 09ceccd to b88ec46 Compare July 1, 2020 22:51
@rdblue rdblue changed the title Add Spark 3 DDL tests Add Spark 3 SQL tests Jul 1, 2020
@rdblue rdblue force-pushed the add-spark-3-ddl-tests branch from 061a124 to d2bc1da Compare July 2, 2020 00:33
@rdblue rdblue marked this pull request as ready for review July 2, 2020 00:33

Table table = catalog.loadTable(TABLE_IDENTIFIER);
Assert.assertEquals("Partition spec should match", PartitionSpec.unpartitioned(), table.spec());
Assert.assertEquals("Partition spec should be unpartitioned", 0, table.spec().fields().size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any PartitionSpecwith 0 fields the unpartitioned spec or should it match a certain spec (like in the removed version)? Seems like the equals method takes fields and specId into the contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any partition spec with 0 fields is considered unpartitioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should note that this needed to be updated because the spec's ID didn't match, causing the test to fail.

PartitionSpec.unpartitioned() doesn't necessarily have the right ID for a given table. We primarily use it when creating tables, where the spec gets rebuilt and assigned the right ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I supposed that was the reason for the change (spec's ID not matching). Thanks for the context!

}

@Test
public void testCreateTableUsingParquet() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add tests for the other file formats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a test that the provider is passed through correctly when using a specific catalog. I probably wouldn't add a test for other formats here in the SQL tests, but I would in unit tests for the SparkCatalog.

Copy link
Contributor

@edgarRd edgarRd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, LGTM.


// spark_catalog does not use an atomic replace, so the table history and old spec is dropped
// the other catalogs do use atomic replace, so the spec id is incremented
boolean isAtomic = !"spark_catalog".equals(catalogName);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is fixed in #1183. If this goes in first, I'll add the fix to that PR before merging.

TableMetadata metadata = TableMetadata.newTableMetadata(schema, spec, baseLocation, tableProperties);

TableMetadata metadata;
if (ops.current() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on whether this really should be the right behavior. Basically we're saying that a replace table will keep the existing location (as opposed to using defaults). I suspect we don't have create or replace with location semantics, but this is making some assumptions that a replacement is somehow the same as the old. If we were to go with id based pathing convention, this wouldn't work.

I don't think this is an issue at this point, but it might make sense to push this down to the location provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct because the REPLACE TABLE doesn't completely wipe out the old table. In most ways, it is the same table.

Calling buildReplacement will replace the schema and partition spec so that the transaction can add a new snapshot. Table history, old snapshots, and existing table properties are kept so that you can inspect the table and don't need to add table configuration every time you run the SQL.

We could add a flag to turn off this behavior and wipe out the old by default, but I don't think that's what users really want. It makes sense for things like default format and other settings to persist across replace operations, so that table configuration and table operations are orthogonal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I guess that makes sense. I assumed the semantics to be the same a transactional drop and create, but based on a little searching, it's much less clear than that. For example, with db2 create or replace will actually retain all data (assuming if aligns with the new table definition) by default.

I agree that preserving table properties make sense, but wouldn't this wipe out comments (which we may also want to preserve)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably wipe out column comments. I'll have to fix that.

@danielcweeks
Copy link
Contributor

+1 Lots of great tests. One minor comment.

@rdblue rdblue merged commit 115a145 into apache:master Jul 8, 2020
cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants