-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17980] [SQL] Fix refreshByPath for converted Hive tables #15521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #67090 has finished for PR 15521 at commit
|
@@ -343,7 +343,8 @@ abstract class Catalog { | |||
|
|||
/** | |||
* Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that | |||
* contains the given data source path. | |||
* contains the given data source path. Path matching is by prefix, i.e. "/" would invalidate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are changing the semantic of REFRESH PATH
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah
Before #14690, users can refresh the whole table(including all partitions) by REFRESH the table path right? even the partition path is not under table path. |
That's correct, refresh table has always worked. There was just a bug introduced that broke refreshByPath, since it doesn't invalidate the lazy val. |
The new prefix matching semantics makes sense to me |
@@ -49,13 +49,18 @@ class TableFileCatalog( | |||
|
|||
private val baseLocation = catalogTable.storage.locationUri | |||
|
|||
// Populated on-demand by calls to cachedAllPartitions | |||
private var allPartitions: ListingFileCatalog = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: according to the existing name style, we should name this var cachedAllPartitions
, and name the public method allPartitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sql("select * from test").count() | ||
} | ||
assert(e2.getMessage.contains("FileNotFoundException")) | ||
spark.catalog.refreshByPath(dir.getAbsolutePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: before #14690, users need to refresh one of the partition paths to invalide the cache, but now they have to refresh the table path, because TableFileCatalog.rootPaths
only contains table path while ListingFileCatalog.rootPaths
only contains partition paths.
I think it's better than before, but it's still a breaking change, should we docuement it in the 2.1 release notes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. To get the old behavior, they can also disable the feature flag.
.map(_.makeQualified(fs.getUri, fs.getWorkingDirectory)) | ||
.contains(qualifiedPath) | ||
.map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString) | ||
.exists(_.startsWith(prefixToInvalidate)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the prefix resolution? I think it's useful, so that users can refresh the table path to invalidate cache for partitioned data source table, but it's not related to this PR right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually need this when metastore partition pruning is disabled for converted hive tables. Otherwise, the unit test below would fail on that case.
(but yeah, we could also leave that alone)
Test build #67137 has finished for PR 15521 at commit
|
thanks, merging to master! |
## What changes were proposed in this pull request? There was a bug introduced in apache#14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions). This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue. cc sameeragarwal for refreshByPath changes mallman ## How was this patch tested? Extended unit test. Author: Eric Liang <ekl@databricks.com> Closes apache#15521 from ericl/fix-caching.
## What changes were proposed in this pull request? There was a bug introduced in apache#14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions). This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue. cc sameeragarwal for refreshByPath changes mallman ## How was this patch tested? Extended unit test. Author: Eric Liang <ekl@databricks.com> Closes apache#15521 from ericl/fix-caching.
What changes were proposed in this pull request?
There was a bug introduced in #14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).
This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.
cc @sameeragarwal for refreshByPath changes
@mallman
How was this patch tested?
Extended unit test.