Skip to content

[SPARK-18817] [SPARKR] [SQL] Set default warehouse dir to tempdir #16290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

shivaram
Copy link
Contributor

What changes were proposed in this pull request?

This PR sets the default warehouse dir to a temporary directory in SparkR to avoid creating directories in the working directory (see JIRA for more details).

To do this we introduce a new SQL config that is used to configure the default. For all other frontends, existing behavior is maintained

How was this patch tested?

Running unit tests locally. Manually with SparkR shell

To do this we introduce a new SQL config that is set to tempdir
from SparkR.
@shivaram
Copy link
Contributor Author

cc @bdwyer2 - who did an initial version at #16247
cc @felixcheung @cloud-fan for review

R/pkg/R/sparkR.R Outdated
# NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN requirements
# See SPARK-18817 for more details
if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) {
assign("spark.sql.default.warehouse.dir", tempdir(), envir = sparkConfigMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just sparkConfigMap[["spark.sql.warehouse.default.dir"]] <- tempdir()

Copy link
Member

@felixcheung felixcheung Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this after L383 "overrideEnvs(sparkConfigMap, paramMap)" in case spark.sql.warehouse.default.dir is passed in named param and explicitly set to something other then the tmp dir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this below L383 - We still need the exists check to make sure we don't overwrite any user provided value ?

@@ -2165,6 +2165,14 @@ test_that("SQL error message is returned from JVM", {
expect_equal(grepl("blah", retError), TRUE)
})

test_that("Default warehouse dir should be set to tempdir", {
# nothing should be written outside tempdir() without explicit user permission
inital_working_directory_files <- list.files()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if warehouse dir ("spark-warehouse") is already there before running this test then the list of file won't change?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Jenkins start with a new workspace every time it runs a test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to other tests in this test file, test_sparkSQL, that is calling to API that might already initialize the warehouse dir.

sparkR.session() is called at the top. Does this createOrReplaceTempView cause the warehouse dir to be created?

https://github.com/shivaram/spark-1/blob/25834109588e8e545deafb1da162958766a057e2/R/pkg/inst/tests/testthat/test_sparkSQL.R#L570

Copy link
Member

@felixcheung felixcheung Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my test, the spark-warehouse directory is created when I run a <- createDataFrame(iris)

so I think by the time this test is run this directory would already be there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think @felixcheung point is right - The dir should be created early on. Also I think in the tests we sometimes configure the hive.metastore.dir in our tests and so I dont see it come up when I run the test cases. I'll try to see if we can design a different test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this test to recursively list all the files and check if spark-warehouse is in it. Another thing is that we could check if the specific table is in it

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70171 has finished for PR 16290 at commit 2583410.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -55,14 +55,19 @@ private[sql] class SharedState(val sparkContext: SparkContext) extends Logging {
s"is set. Setting ${WAREHOUSE_PATH.key} to the value of " +
s"hive.metastore.warehouse.dir ('$hiveWarehouseDir').")
hiveWarehouseDir
} else {
} else if (sparkContext.conf.contains(WAREHOUSE_PATH.key) &&
sparkContext.conf.get(WAREHOUSE_PATH).isDefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indent is not right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indented 4 spaces now

@gatorsmile
Copy link
Member

gatorsmile commented Dec 15, 2016

If the default database has already been created in the metastore, any following changes of spark.sql.default.warehouse.dir can trigger an issue when we create a data source table in the default database (Here, we assume Hive support is enabled). Note, we will not hit any issue if we create a Hive serde table in the default database, or create a data source table in the non-default database.

The directory of managed data source tables is created by Hive. When creating a new data source table, the created directory is based on the current value of hive.metastore.warehouse.dir. However, the value of table location in the metastore is pointing to the child directory of the location of the default database. Thus, you will not hit any issue when you creating such a table. However, the mismatch will cause a problem (because the expected directory does not exist), when we try to select from /insert into this table. This is a bug of Hive metastore.

@dilipbiswal hit this issue very recently. Below shows the location of these two tables.

t11 is a Hive managed data source table we created in the default database. After we creating t11, the directory /user/hive/warehouse/t11 is not created by Hive metastore. Instead, the directory /home/cloudera/mygit/apache/spark/bin/spark-warehouse/t11 is created.

spark-sql> describe extended t11;
...
    Storage(Location: file:/user/hive/warehouse/t11, InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat, Serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Properties: [serialization.format=1]))    
Time taken: 0.105 seconds, Fetched 8 row(s)

t1 is a Hive managed data source table we created in the non-default database dilip that was created after we changed spark.sql.default.warehouse.dir.

spark-sql> use dilip;
Time taken: 0.028 seconds
spark-sql> describe extended t1;
...
    Storage(Location: file:/home/cloudera/mygit/apache/spark/bin/spark-warehouse/dilip.db/t1, InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat, Serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Properties: [serialization.format=1]))    

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70181 has started for PR 16290 at commit 1d0d1d2.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70199 has finished for PR 16290 at commit 1d0d1d2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor Author

Ugh - The failure seems to be from HiveClientSuite and I dont think its related to this PR (as pasted below). However I'm refactoring the SparkR test case, so let me do that and then re-trigger a test

[info] - getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false *** FAILED *** (8 seconds, 479 milliseconds)
[info]   java.lang.RuntimeException: [unresolved dependency: org.apache.hive#hive-metastore;1.2.1: not found, unresolved dependency: org.apache.hive#hive-exec;1.2.1: not found, unresolved dependency: org.apache.hive#hive-common;1.2.1: not found, unresolved dependency: org.apache.hive#hive-serde;1.2.1: not found]

@SparkQA
Copy link

SparkQA commented Dec 16, 2016

Test build #70236 has started for PR 16290 at commit 014d7e1.

.stringConf
.createWithDefault(Utils.resolveURI("spark-warehouse").toString)

val WAREHOUSE_PATH = buildConf("spark.sql.warehouse.dir")
.doc("The location for managed databases and tables.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is not right. spark.sql.warehouse.dir is still the default location when we create a database/table without providing the location value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point. I misunderstood the meaning of default there. Will fix this now

@shivaram
Copy link
Contributor Author

@gatorsmile Thanks for taking a look. Addressed your comments now. Lets see if Jenkins passes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2016

Test build #70269 has finished for PR 16290 at commit 6eec97d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor Author

@felixcheung @bdwyer2 Could you take another look ?

# Create a temporary table
sql("CREATE TABLE people_warehouse_test")
# spark-warehouse should be written only tempdir() and not current working directory
res <- list.files(path = ".", pattern = ".*spark-warehouse.*",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we test to make sure that no files are created during this process instead of only checking for spark-warehouse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - given that this PR is only changing the warehouse dir, I think its only fair to test for that. Or in other words, adding such a test would fail now because of derby.log etc. (per our JIRA discussion) ?

@@ -221,6 +221,19 @@ class SQLConfSuite extends QueryTest with SharedSQLContext {
.sessionState.conf.warehousePath.stripSuffix("/"))
}

test("changing default value of warehouse path") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this test case only cover one of four cases. spark.sql.default.warehouse.dir is set and spark.sql.warehouse.dir is not set. We also need to check the other three cases:

  • spark.sql.default.warehouse.dir is not set and spark.sql.warehouse.dir is not set
  • spark.sql.default.warehouse.dir is set and spark.sql.warehouse.dir is set
  • spark.sql.default.warehouse.dir is not set and spark.sql.warehouse.dir is set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added tests for all 4 cases now

@@ -819,7 +819,13 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging {

def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH)

def warehousePath: String = new Path(getConf(StaticSQLConf.WAREHOUSE_PATH)).toString
def warehousePath: String = {
if (contains(StaticSQLConf.WAREHOUSE_PATH.key)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we are not doing the same check, as shown in another place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - Added the same check here as well

@@ -964,10 +970,16 @@ object StaticSQLConf {
}
}

val DEFAULT_WAREHOUSE_PATH = buildConf("spark.sql.default.warehouse.dir")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this part of the code base - What are the consequences of making it internal ? Is it just in terms of what shows up in documentation or does it affect how users can use it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the internal configuration, it will not be printed out. For example, you can try something like

spark.sql("SET -v").show(numRows = 200, truncate = false)

sparkContext.conf.set("spark.sql.default.warehouse.dir", newWarehouseDefaultPath)
val spark = new SparkSession(sparkContext)
assert(newWarehouseDefaultPath.stripSuffix("/") === spark
.sessionState.conf.warehousePath.stripSuffix("/"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need a check for spark.sharedState.warehousePath because we did the logic changes there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gatorsmile
Copy link
Member

I finished my review. cc @cloud-fan

@shivaram
Copy link
Contributor Author

Thanks @gatorsmile - Addressed your comments.

@SparkQA
Copy link

SparkQA commented Dec 18, 2016

Test build #70314 has finished for PR 16290 at commit f7b4772.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 18, 2016

Test build #70316 has finished for PR 16290 at commit f7b4772.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# Create a temporary table
sql("CREATE TABLE people_warehouse_test")
# spark-warehouse should be written only tempdir() and not current working directory
res <- list.files(path = ".", pattern = ".*spark-warehouse.*",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was testing this and it looked like the current directory (.) was SPARK_HOME/R/pkg/inst/tests/testthat and spark-warehouse would have been in SPARK_HOME/R

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - Will check this today. I think if SPARK_HOME is accessible I can just call list.files with that as path

Copy link
Member

@felixcheung felixcheung Dec 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a couple of other test files would have called sparkR.session already (binary_function, binaryFile, broadcast), so I'd propose adding a new test explicitly named to make sure it is called the very first, ie. https://github.com/apache/spark/pull/16330/files#diff-5ff1ba5d1751f3b1cc96a567e9ab25ffR18

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it needs to run first ? because the default warehouse dir is in tempdir even if spark.session is called before it shouldn't create any warehouse dir in SPARK_HOME/ ?

Copy link
Member

@felixcheung felixcheung Dec 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.
In this case we are specifically looking for spark-warehouse; I guess I was referring to a general check to make sure going forward the list of files before running anything in the package should == list of files after running anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my bigger concern for that is that usually tests are run all at time - i.e. core, sql, hive and then python, R. And there are no guarantees that other module tests won't create files inside SPARK_HOME afaik. So while we can check some basic things with our test, I dont think verifying a global property is always possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I agree completely

@shivaram
Copy link
Contributor Author

@gatorsmile The test error in HiveSparkSubmitSuite seems related. I am debugging locally

@shivaram
Copy link
Contributor Author

@gatorsmile I think I figured out the problem with HiveSparkSubmitSuite but I'm not sure how to solve it. The problem is that in one of the test cases we check the DB location to be same as warehouse path [1].

Now the warehouse path correctly points to spark-warehouse - The catalog though seems to be configured to an existing metastore_db. Now in my machine (and I guess in Jenkins) if we have a metastore_db left over from running SparkR tests then the default location points to the R tempdir which is no longer valid.

It seems to me that the right solution here is to also clear the metastore_db once a set of tests finish ? Is there a way to do this from within the tests or should we be running rm from some of the test running scripts ?

[1]

assert(new Path(defaultDbLocation) == new Path(spark.sharedState.warehousePath))

@felixcheung
Copy link
Member

felixcheung commented Dec 19, 2016

@shivaram with my PR #16330, metastore_db is moved to tempdir and is removed when the R process exits.

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

gatorsmile commented Dec 19, 2016

I checked the most two recent failed test cases in Jenkins. They are not related to the changes in the PR.

In the local environment, I can reproduce the error you mentioned above. I assume what you said is the test case SPARK-18360: default table path of tables in default database should depend on the location of default database. This test case failed because of the following check:

assert(new Path(defaultDbLocation) == new Path(spark.sharedState.warehousePath))

I printed out and found they are actually different.

[info]   2016-12-18 20:47:23.328 - stdout> path1: file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse
[info]   2016-12-18 20:47:23.328 - stdout> path2: file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/

The location of default database is still pointing to the original value of hive.metastore.warehouse.dir or spark.sql.warehouse.dir that was set in the previous test case or our previous local spark job. Ideally, our test suite should directly connect to Derby and drop the default database. Let me do more search.

Also cc @yhuai

@gatorsmile
Copy link
Member

gatorsmile commented Dec 19, 2016

After a research, for avoiding this flaky testcase, the simplest way is to remove the contents in metastore_db (whose location is specified through javax.jdo.option.ConnectionURL) at the beginning and ending of any test case that changes the value of hive.metastore.warehouse.dir or spark.sql.warehouse.dir

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70335 has finished for PR 16290 at commit f7b4772.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Generally, it looks good to me. The only concern is the external behavior change. Since Spark 2.0, users are able to get the warehouse location by the following two ways:

      spark.sql("set spark.sql.warehouse.dir").show()
      println(spark.conf.get("spark.sql.warehouse.dir"))

After this PR, the output will be <undefined>. Do we need extra code changes to fix this? @cloud-fan

@shivaram
Copy link
Contributor Author

@cloud-fan Any thoughts on this ?

R/pkg/R/sparkR.R Outdated
@@ -376,6 +377,12 @@ sparkR.session <- function(
overrideEnvs(sparkConfigMap, paramMap)
}

# NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN requirements
# See SPARK-18817 for more details
if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rethinking it, we might not need to add an extra sql conf. We just need to know whether the value of spark.sql.warehouse.dir is from the users or the original default. If it is the default, R can simply change it.

Maybe it is a good to-have feature for users to know whether the SQLConf value is from users or from the default. cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we can, SessionState.conf.settings contains all the user-setted entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - I will make try to use SessionState and see if it can avoid having to create a new option

shivaram added 2 commits March 5, 2017 22:16
Also add a unit test that checks if new table is created in tmpdir
@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73973 has started for PR 16290 at commit b14c302.

@shivaram
Copy link
Contributor Author

shivaram commented Mar 6, 2017

@gatorsmile @cloud-fan @felixcheung I looked at the SharedState code more closely and it looks like the only time the warehousePath can be set is when the initialization of shared state happens. So I modified the code to set the temp dir for SparkR during SparkSession initialization if it has not already been set.

I verified that this works locally and with a unit test -- Is there anything I might be missing with this approach ?

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74074 has started for PR 16290 at commit 7a98b91.

@felixcheung
Copy link
Member

so base on this comment #16330 (comment)
doesn't it mean we shouldn't set warehouse dir to under tempdir()?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74074/
Test FAILed.

@shivaram
Copy link
Contributor Author

@felixcheung Yeah I think that sounds good. I can close this PR for now and we can revisit this if we have an issue in the future. Also I guess you will update #16330 for the derby.log change. Does this sound okay ?

@felixcheung
Copy link
Member

felixcheung commented Mar 11, 2017 via email

@shivaram shivaram closed this Mar 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants