Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Support OPTIMIZE tbl FULL for clustered table #3793

Merged
merged 10 commits into from
Oct 30, 2024

Conversation

dabao521
Copy link
Contributor

@dabao521 dabao521 commented Oct 22, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

  1. Add new sql syntax OPTIMIZE tbl FULL
  2. Implemented OPTIMIZE tbl FULL to re-cluster all data in the table.

How was this patch tested?

new unit tests added

Does this PR introduce any user-facing changes?

Yes
Previously clustered table won't re-cluster data that was clustered against different cluster keys. With OPTIMIZE tbl FULL, they will be re-clustered against the new keys.

Copy link
Contributor

@ami7o ami7o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. some questions

@@ -153,7 +153,7 @@ class IncrementalZCubeClusteringSuite extends QueryTest
inputZCubeFiles = ClusteringFileStats(2, SKIP_CHECK_SIZE_VALUE),
inputOtherFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE),
inputNumZCubes = 1,
mergedFiles = ClusteringFileStats(6, SKIP_CHECK_SIZE_VALUE),
mergedFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to comment https://github.com/delta-io/delta/pull/3793/files#r1815830851
After fixing the validateClusteringMetrics , this assertion starts to failing and I have to fix it since validateClusteringMetrics is used by new tests as well

@@ -73,17 +73,17 @@ class IncrementalZCubeClusteringSuite extends QueryTest
actualMetrics: ClusteringStats, expectedMetrics: ClusteringStats): Unit = {
var finalActualMetrics = actualMetrics
if (expectedMetrics.inputZCubeFiles.size == SKIP_CHECK_SIZE_VALUE) {
val stats = expectedMetrics.inputZCubeFiles
val stats = finalActualMetrics.inputZCubeFiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test bug left from the commit that added this test. I have to fix this in the PR since new tests depend on validateClusteringMetrics to validate the metrics are correct. Without this fix, though this validation passed, it doesn't mean the program is correct.

@@ -281,5 +283,159 @@ class IncrementalZCubeClusteringSuite extends QueryTest
}
}
}

test("OPTIMIZE FULL") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test case for different clusteringProvider with OPTIMIZE FULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new test OPTIMIZE FULL - change clustering provider

@@ -230,7 +230,7 @@ class IncrementalZCubeClusteringSuite extends QueryTest
inputZCubeFiles = ClusteringFileStats(2, SKIP_CHECK_SIZE_VALUE),
inputOtherFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE),
inputNumZCubes = 1,
mergedFiles = ClusteringFileStats(6, SKIP_CHECK_SIZE_VALUE),
mergedFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im guessing this is also same reason as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is fixing the test bug introduced in https://github.com/delta-io/delta/pull/3793/files#r1815830851

Copy link
Collaborator

@rahulsmahadev rahulsmahadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for working on this

@allisonport-db allisonport-db merged commit 959765a into delta-io:master Oct 30, 2024
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants