Skip to content

[SPARK-17980] [SQL] Fix refreshByPath for converted Hive tables #15521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Oct 17, 2016

What changes were proposed in this pull request?

There was a bug introduced in #14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).

This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.

cc @sameeragarwal for refreshByPath changes
@mallman

How was this patch tested?

Extended unit test.

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67090 has finished for PR 15521 at commit 7415666.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -343,7 +343,8 @@ abstract class Catalog {

/**
* Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that
* contains the given data source path.
* contains the given data source path. Path matching is by prefix, i.e. "/" would invalidate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are changing the semantic of REFRESH PATH right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

@cloud-fan
Copy link
Contributor

Before #14690, users can refresh the whole table(including all partitions) by REFRESH the table path right? even the partition path is not under table path.

@ericl
Copy link
Contributor Author

ericl commented Oct 18, 2016

That's correct, refresh table has always worked. There was just a bug introduced that broke refreshByPath, since it doesn't invalidate the lazy val.

@sameeragarwal
Copy link
Member

The new prefix matching semantics makes sense to me

@@ -49,13 +49,18 @@ class TableFileCatalog(

private val baseLocation = catalogTable.storage.locationUri

// Populated on-demand by calls to cachedAllPartitions
private var allPartitions: ListingFileCatalog = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: according to the existing name style, we should name this var cachedAllPartitions, and name the public method allPartitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sql("select * from test").count()
}
assert(e2.getMessage.contains("FileNotFoundException"))
spark.catalog.refreshByPath(dir.getAbsolutePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: before #14690, users need to refresh one of the partition paths to invalide the cache, but now they have to refresh the table path, because TableFileCatalog.rootPaths only contains table path while ListingFileCatalog.rootPaths only contains partition paths.

I think it's better than before, but it's still a breaking change, should we docuement it in the 2.1 release notes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. To get the old behavior, they can also disable the feature flag.

.map(_.makeQualified(fs.getUri, fs.getWorkingDirectory))
.contains(qualifiedPath)
.map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString)
.exists(_.startsWith(prefixToInvalidate))
Copy link
Contributor

@cloud-fan cloud-fan Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the prefix resolution? I think it's useful, so that users can refresh the table path to invalidate cache for partitioned data source table, but it's not related to this PR right?

Copy link
Contributor Author

@ericl ericl Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually need this when metastore partition pruning is disabled for converted hive tables. Otherwise, the unit test below would fail on that case.

(but yeah, we could also leave that alone)

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67137 has finished for PR 15521 at commit 5088ae0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 5f20ae0 Oct 19, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?

There was a bug introduced in apache#14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).

This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.

cc sameeragarwal for refreshByPath changes
mallman

## How was this patch tested?

Extended unit test.

Author: Eric Liang <ekl@databricks.com>

Closes apache#15521 from ericl/fix-caching.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

There was a bug introduced in apache#14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).

This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.

cc sameeragarwal for refreshByPath changes
mallman

## How was this patch tested?

Extended unit test.

Author: Eric Liang <ekl@databricks.com>

Closes apache#15521 from ericl/fix-caching.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants