-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add Spark 3 SQL tests #1156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark 3 SQL tests #1156
Conversation
09ceccd
to
b88ec46
Compare
061a124
to
d2bc1da
Compare
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
|
||
Table table = catalog.loadTable(TABLE_IDENTIFIER); | ||
Assert.assertEquals("Partition spec should match", PartitionSpec.unpartitioned(), table.spec()); | ||
Assert.assertEquals("Partition spec should be unpartitioned", 0, table.spec().fields().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is any PartitionSpec
with 0
fields the unpartitioned
spec or should it match a certain spec (like in the removed version)? Seems like the equals
method takes fields
and specId
into the contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any partition spec with 0 fields is considered unpartitioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should note that this needed to be updated because the spec's ID didn't match, causing the test to fail.
PartitionSpec.unpartitioned()
doesn't necessarily have the right ID for a given table. We primarily use it when creating tables, where the spec gets rebuilt and assigned the right ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I supposed that was the reason for the change (spec's ID not matching). Thanks for the context!
} | ||
|
||
@Test | ||
public void testCreateTableUsingParquet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add tests for the other file formats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a test that the provider is passed through correctly when using a specific catalog. I probably wouldn't add a test for other formats here in the SQL tests, but I would in unit tests for the SparkCatalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, LGTM.
|
||
// spark_catalog does not use an atomic replace, so the table history and old spec is dropped | ||
// the other catalogs do use atomic replace, so the spec id is incremented | ||
boolean isAtomic = !"spark_catalog".equals(catalogName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is fixed in #1183. If this goes in first, I'll add the fix to that PR before merging.
TableMetadata metadata = TableMetadata.newTableMetadata(schema, spec, baseLocation, tableProperties); | ||
|
||
TableMetadata metadata; | ||
if (ops.current() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on whether this really should be the right behavior. Basically we're saying that a replace table will keep the existing location (as opposed to using defaults). I suspect we don't have create or replace with location semantics, but this is making some assumptions that a replacement is somehow the same as the old. If we were to go with id based pathing convention, this wouldn't work.
I don't think this is an issue at this point, but it might make sense to push this down to the location provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct because the REPLACE TABLE
doesn't completely wipe out the old table. In most ways, it is the same table.
Calling buildReplacement
will replace the schema and partition spec so that the transaction can add a new snapshot. Table history, old snapshots, and existing table properties are kept so that you can inspect the table and don't need to add table configuration every time you run the SQL.
We could add a flag to turn off this behavior and wipe out the old by default, but I don't think that's what users really want. It makes sense for things like default format and other settings to persist across replace operations, so that table configuration and table operations are orthogonal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I guess that makes sense. I assumed the semantics to be the same a transactional drop
and create
, but based on a little searching, it's much less clear than that. For example, with db2 create or replace will actually retain all data (assuming if aligns with the new table definition) by default.
I agree that preserving table properties make sense, but wouldn't this wipe out comments (which we may also want to preserve)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably wipe out column comments. I'll have to fix that.
+1 Lots of great tests. One minor comment. |
This adds tests for Spark SQL operations.
The tests primarily focus on the way that query info is passed to Iceberg. This is not intended to be exhaustive for either ways to express changes as SQL or for what Iceberg supports. For example, this only tests updating an int column to long to validate that
TableChange.UpdateColumnType
is handled.These tests have exposed a few bugs:
[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite spark#28993) in determining when an overwrite should be static or dynamic. If Spark's number of static values in an INSERT OVERWRITE is the same as the number of partition columns (identity), then Spark will use a static overwrite. This doesn't account for hidden partition columns. This should be fixed in Spark 3.0.1 and is a Spark bug, so I the test for this case is disabled for now.renameTable
. It looks like there is no validation on theto
identifier.