[SPARK-17980] [SQL] Fix refreshByPath for converted Hive tables #15521

ericl · 2016-10-17T22:08:35Z

What changes were proposed in this pull request?

There was a bug introduced in #14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).

This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.

cc @sameeragarwal for refreshByPath changes
@mallman

How was this patch tested?

Extended unit test.

SparkQA · 2016-10-18T00:09:01Z

Test build #67090 has finished for PR 15521 at commit 7415666.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-18T00:42:39Z

sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala

@@ -343,7 +343,8 @@ abstract class Catalog {

  /**
   * Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that
-   * contains the given data source path.
+   * contains the given data source path. Path matching is by prefix, i.e. "/" would invalidate


So we are changing the semantic of REFRESH PATH right?

cloud-fan · 2016-10-18T00:51:28Z

Before #14690, users can refresh the whole table(including all partitions) by REFRESH the table path right? even the partition path is not under table path.

ericl · 2016-10-18T01:00:31Z

That's correct, refresh table has always worked. There was just a bug introduced that broke refreshByPath, since it doesn't invalidate the lazy val.

sameeragarwal · 2016-10-18T01:03:55Z

The new prefix matching semantics makes sense to me

cloud-fan · 2016-10-18T02:20:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TableFileCatalog.scala

@@ -49,13 +49,18 @@ class TableFileCatalog(

  private val baseLocation = catalogTable.storage.locationUri

+  // Populated on-demand by calls to cachedAllPartitions
+  private var allPartitions: ListingFileCatalog = null


nit: according to the existing name style, we should name this var cachedAllPartitions, and name the public method allPartitions

cloud-fan · 2016-10-18T02:35:47Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala

+              sql("select * from test").count()
+            }
+            assert(e2.getMessage.contains("FileNotFoundException"))
+            spark.catalog.refreshByPath(dir.getAbsolutePath)


Note: before #14690, users need to refresh one of the partition paths to invalide the cache, but now they have to refresh the table path, because TableFileCatalog.rootPaths only contains table path while ListingFileCatalog.rootPaths only contains partition paths.

I think it's better than before, but it's still a breaking change, should we docuement it in the 2.1 release notes?

Makes sense. To get the old behavior, they can also disable the feature flag.

cloud-fan · 2016-10-18T02:38:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-            .map(_.makeQualified(fs.getUri, fs.getWorkingDirectory))
-            .contains(qualifiedPath)
+            .map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString)
+            .exists(_.startsWith(prefixToInvalidate))


why do we need the prefix resolution? I think it's useful, so that users can refresh the table path to invalidate cache for partitioned data source table, but it's not related to this PR right?

You actually need this when metastore partition pruning is disabled for converted hive tables. Otherwise, the unit test below would fail on that case.

(but yeah, we could also leave that alone)

SparkQA · 2016-10-18T20:52:31Z

Test build #67137 has finished for PR 15521 at commit 5088ae0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-19T02:21:36Z

thanks, merging to master!

## What changes were proposed in this pull request? There was a bug introduced in apache#14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions). This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue. cc sameeragarwal for refreshByPath changes mallman ## How was this patch tested? Extended unit test. Author: Eric Liang <ekl@databricks.com> Closes apache#15521 from ericl/fix-caching.

Mon Oct 17 15:00:03 PDT 2016

7415666

cloud-fan reviewed Oct 18, 2016

View reviewed changes

comments

5088ae0

ericl force-pushed the fix-caching branch from b699915 to 5088ae0 Compare October 18, 2016 18:29

asfgit closed this in 5f20ae0 Oct 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17980] [SQL] Fix refreshByPath for converted Hive tables #15521

[SPARK-17980] [SQL] Fix refreshByPath for converted Hive tables #15521

Uh oh!

ericl commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

cloud-fan Oct 18, 2016

Uh oh!

ericl Oct 18, 2016

Uh oh!

cloud-fan commented Oct 18, 2016

Uh oh!

ericl commented Oct 18, 2016 •

edited

Loading

Uh oh!

sameeragarwal commented Oct 18, 2016

Uh oh!

cloud-fan Oct 18, 2016

Uh oh!

ericl Oct 18, 2016

Uh oh!

cloud-fan Oct 18, 2016

Uh oh!

ericl Oct 18, 2016

Uh oh!

cloud-fan Oct 18, 2016 •

edited

Loading

Uh oh!

ericl Oct 18, 2016 •

edited

Loading

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

cloud-fan commented Oct 19, 2016

Uh oh!

Uh oh!

[SPARK-17980] [SQL] Fix refreshByPath for converted Hive tables #15521

[SPARK-17980] [SQL] Fix refreshByPath for converted Hive tables #15521

Uh oh!

Conversation

ericl commented Oct 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

cloud-fan Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 18, 2016

Uh oh!

ericl commented Oct 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameeragarwal commented Oct 18, 2016

Uh oh!

cloud-fan Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Oct 18, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Oct 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

cloud-fan commented Oct 19, 2016

Uh oh!

Uh oh!

ericl commented Oct 18, 2016 •

edited

Loading

cloud-fan Oct 18, 2016 •

edited

Loading

ericl Oct 18, 2016 •

edited

Loading