-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Support OPTIMIZE tbl FULL for clustered table #3793
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. some questions
spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableStrategy.scala
Outdated
Show resolved
Hide resolved
@@ -153,7 +153,7 @@ class IncrementalZCubeClusteringSuite extends QueryTest | |||
inputZCubeFiles = ClusteringFileStats(2, SKIP_CHECK_SIZE_VALUE), | |||
inputOtherFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE), | |||
inputNumZCubes = 1, | |||
mergedFiles = ClusteringFileStats(6, SKIP_CHECK_SIZE_VALUE), | |||
mergedFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to comment https://github.com/delta-io/delta/pull/3793/files#r1815830851
After fixing the validateClusteringMetrics
, this assertion starts to failing and I have to fix it since validateClusteringMetrics is used by new tests as well
@@ -73,17 +73,17 @@ class IncrementalZCubeClusteringSuite extends QueryTest | |||
actualMetrics: ClusteringStats, expectedMetrics: ClusteringStats): Unit = { | |||
var finalActualMetrics = actualMetrics | |||
if (expectedMetrics.inputZCubeFiles.size == SKIP_CHECK_SIZE_VALUE) { | |||
val stats = expectedMetrics.inputZCubeFiles | |||
val stats = finalActualMetrics.inputZCubeFiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test bug left from the commit that added this test. I have to fix this in the PR since new tests depend on validateClusteringMetrics to validate the metrics are correct. Without this fix, though this validation passed, it doesn't mean the program is correct.
@@ -281,5 +283,159 @@ class IncrementalZCubeClusteringSuite extends QueryTest | |||
} | |||
} | |||
} | |||
|
|||
test("OPTIMIZE FULL") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test case for different clusteringProvider with OPTIMIZE FULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new test OPTIMIZE FULL - change clustering provider
spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableStrategy.scala
Show resolved
Hide resolved
@@ -230,7 +230,7 @@ class IncrementalZCubeClusteringSuite extends QueryTest | |||
inputZCubeFiles = ClusteringFileStats(2, SKIP_CHECK_SIZE_VALUE), | |||
inputOtherFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE), | |||
inputNumZCubes = 1, | |||
mergedFiles = ClusteringFileStats(6, SKIP_CHECK_SIZE_VALUE), | |||
mergedFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im guessing this is also same reason as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is fixing the test bug introduced in https://github.com/delta-io/delta/pull/3793/files#r1815830851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for working on this
Which Delta project/connector is this regarding?
Description
How was this patch tested?
new unit tests added
Does this PR introduce any user-facing changes?
Yes
Previously clustered table won't re-cluster data that was clustered against different cluster keys. With OPTIMIZE tbl FULL, they will be re-clustered against the new keys.