[SPARK-18817] [SPARKR] [SQL] Set default warehouse dir to tempdir #16290

shivaram · 2016-12-15T01:49:58Z

What changes were proposed in this pull request?

This PR sets the default warehouse dir to a temporary directory in SparkR to avoid creating directories in the working directory (see JIRA for more details).

To do this we introduce a new SQL config that is used to configure the default. For all other frontends, existing behavior is maintained

How was this patch tested?

Running unit tests locally. Manually with SparkR shell

To do this we introduce a new SQL config that is set to tempdir from SparkR.

shivaram · 2016-12-15T01:50:38Z

cc @bdwyer2 - who did an initial version at #16247
cc @felixcheung @cloud-fan for review

felixcheung · 2016-12-15T02:09:09Z

R/pkg/R/sparkR.R

+  # NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN requirements
+  # See SPARK-18817 for more details
+  if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) {
+    assign("spark.sql.default.warehouse.dir", tempdir(), envir = sparkConfigMap)


I think we could just sparkConfigMap[["spark.sql.warehouse.default.dir"]] <- tempdir()

I think we should move this after L383 "overrideEnvs(sparkConfigMap, paramMap)" in case spark.sql.warehouse.default.dir is passed in named param and explicitly set to something other then the tmp dir

Moved this below L383 - We still need the exists check to make sure we don't overwrite any user provided value ?

felixcheung · 2016-12-15T02:15:22Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

@@ -2165,6 +2165,14 @@ test_that("SQL error message is returned from JVM", {
  expect_equal(grepl("blah", retError), TRUE)
 })

+test_that("Default warehouse dir should be set to tempdir", {
+  # nothing should be written outside tempdir() without explicit user permission
+  inital_working_directory_files <- list.files()


if warehouse dir ("spark-warehouse") is already there before running this test then the list of file won't change?

Does Jenkins start with a new workspace every time it runs a test?

I'm referring to other tests in this test file, test_sparkSQL, that is calling to API that might already initialize the warehouse dir.

sparkR.session() is called at the top. Does this createOrReplaceTempView cause the warehouse dir to be created?

https://github.com/shivaram/spark-1/blob/25834109588e8e545deafb1da162958766a057e2/R/pkg/inst/tests/testthat/test_sparkSQL.R#L570

From my test, the spark-warehouse directory is created when I run a <- createDataFrame(iris)

so I think by the time this test is run this directory would already be there

Yeah. I think @felixcheung point is right - The dir should be created early on. Also I think in the tests we sometimes configure the hive.metastore.dir in our tests and so I dont see it come up when I run the test cases. I'll try to see if we can design a different test case.

I refactored this test to recursively list all the files and check if spark-warehouse is in it. Another thing is that we could check if the specific table is in it

SparkQA · 2016-12-15T03:19:39Z

Test build #70171 has finished for PR 16290 at commit 2583410.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-15T05:56:52Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala

@@ -55,14 +55,19 @@ private[sql] class SharedState(val sparkContext: SparkContext) extends Logging {
        s"is set. Setting ${WAREHOUSE_PATH.key} to the value of " +
        s"hive.metastore.warehouse.dir ('$hiveWarehouseDir').")
      hiveWarehouseDir
-    } else {
+    } else if (sparkContext.conf.contains(WAREHOUSE_PATH.key) &&
+               sparkContext.conf.get(WAREHOUSE_PATH).isDefined) {


Nit: indent is not right.

Indented 4 spaces now

gatorsmile · 2016-12-15T06:16:50Z

If the default database has already been created in the metastore, any following changes of spark.sql.default.warehouse.dir can trigger an issue when we create a data source table in the default database (Here, we assume Hive support is enabled). Note, we will not hit any issue if we create a Hive serde table in the default database, or create a data source table in the non-default database.

The directory of managed data source tables is created by Hive. When creating a new data source table, the created directory is based on the current value of hive.metastore.warehouse.dir. However, the value of table location in the metastore is pointing to the child directory of the location of the default database. Thus, you will not hit any issue when you creating such a table. However, the mismatch will cause a problem (because the expected directory does not exist), when we try to select from /insert into this table. This is a bug of Hive metastore.

@dilipbiswal hit this issue very recently. Below shows the location of these two tables.

t11 is a Hive managed data source table we created in the default database. After we creating t11, the directory /user/hive/warehouse/t11 is not created by Hive metastore. Instead, the directory /home/cloudera/mygit/apache/spark/bin/spark-warehouse/t11 is created.

spark-sql> describe extended t11;
...
    Storage(Location: file:/user/hive/warehouse/t11, InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat, Serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Properties: [serialization.format=1]))    
Time taken: 0.105 seconds, Fetched 8 row(s)

t1 is a Hive managed data source table we created in the non-default database dilip that was created after we changed spark.sql.default.warehouse.dir.

spark-sql> use dilip;
Time taken: 0.028 seconds
spark-sql> describe extended t1;
...
    Storage(Location: file:/home/cloudera/mygit/apache/spark/bin/spark-warehouse/dilip.db/t1, InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat, Serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Properties: [serialization.format=1]))

SparkQA · 2016-12-15T06:27:34Z

Test build #70181 has started for PR 16290 at commit 1d0d1d2.

gatorsmile · 2016-12-15T18:54:35Z

retest this please

SparkQA · 2016-12-15T20:21:19Z

Test build #70199 has finished for PR 16290 at commit 1d0d1d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-12-15T20:24:13Z

Ugh - The failure seems to be from HiveClientSuite and I dont think its related to this PR (as pasted below). However I'm refactoring the SparkR test case, so let me do that and then re-trigger a test

[info] - getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false *** FAILED *** (8 seconds, 479 milliseconds)
[info]   java.lang.RuntimeException: [unresolved dependency: org.apache.hive#hive-metastore;1.2.1: not found, unresolved dependency: org.apache.hive#hive-exec;1.2.1: not found, unresolved dependency: org.apache.hive#hive-common;1.2.1: not found, unresolved dependency: org.apache.hive#hive-serde;1.2.1: not found]

SparkQA · 2016-12-16T05:47:35Z

Test build #70236 has started for PR 16290 at commit 014d7e1.

gatorsmile · 2016-12-16T07:02:18Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .stringConf
    .createWithDefault(Utils.resolveURI("spark-warehouse").toString)

+  val WAREHOUSE_PATH = buildConf("spark.sql.warehouse.dir")
+    .doc("The location for managed databases and tables.")


The description is not right. spark.sql.warehouse.dir is still the default location when we create a database/table without providing the location value.

Thats a good point. I misunderstood the meaning of default there. Will fix this now

shivaram · 2016-12-16T20:05:52Z

@gatorsmile Thanks for taking a look. Addressed your comments now. Lets see if Jenkins passes.

SparkQA · 2016-12-16T22:32:42Z

Test build #70269 has finished for PR 16290 at commit 6eec97d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-12-17T05:44:16Z

@felixcheung @bdwyer2 Could you take another look ?

bdwyer2 · 2016-12-17T05:52:38Z

R/pkg/inst/tests/testthat/test_context.R

+  # Create a temporary table
+  sql("CREATE TABLE people_warehouse_test")
+  # spark-warehouse should be written only tempdir() and not current working directory
+  res <- list.files(path = ".", pattern = ".*spark-warehouse.*",


should we test to make sure that no files are created during this process instead of only checking for spark-warehouse?

Well - given that this PR is only changing the warehouse dir, I think its only fair to test for that. Or in other words, adding such a test would fail now because of derby.log etc. (per our JIRA discussion) ?

gatorsmile · 2016-12-17T06:07:30Z

sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala

@@ -221,6 +221,19 @@ class SQLConfSuite extends QueryTest with SharedSQLContext {
      .sessionState.conf.warehousePath.stripSuffix("/"))
  }

+  test("changing default value of warehouse path") {


Currently, this test case only cover one of four cases. spark.sql.default.warehouse.dir is set and spark.sql.warehouse.dir is not set. We also need to check the other three cases:

spark.sql.default.warehouse.dir is not set and spark.sql.warehouse.dir is not set

spark.sql.default.warehouse.dir is set and spark.sql.warehouse.dir is set

spark.sql.default.warehouse.dir is not set and spark.sql.warehouse.dir is set

Good point. Added tests for all 4 cases now

gatorsmile · 2016-12-17T06:39:27Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -819,7 +819,13 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging {

  def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH)

-  def warehousePath: String = new Path(getConf(StaticSQLConf.WAREHOUSE_PATH)).toString
+  def warehousePath: String = {
+    if (contains(StaticSQLConf.WAREHOUSE_PATH.key)) {


What is the reason we are not doing the same check, as shown in another place?

Nice catch - Added the same check here as well

gatorsmile · 2016-12-17T06:57:52Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -964,10 +970,16 @@ object StaticSQLConf {
    }
  }

+  val DEFAULT_WAREHOUSE_PATH = buildConf("spark.sql.default.warehouse.dir")


Should we make it internal?

I am not familiar with this part of the code base - What are the consequences of making it internal ? Is it just in terms of what shows up in documentation or does it affect how users can use it ?

For the internal configuration, it will not be printed out. For example, you can try something like

spark.sql("SET -v").show(numRows = 200, truncate = false)

gatorsmile · 2016-12-17T07:03:05Z

sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala

+      sparkContext.conf.set("spark.sql.default.warehouse.dir", newWarehouseDefaultPath)
+      val spark = new SparkSession(sparkContext)
+      assert(newWarehouseDefaultPath.stripSuffix("/") === spark
+        .sessionState.conf.warehousePath.stripSuffix("/"))


Also need a check for spark.sharedState.warehousePath because we did the logic changes there.

gatorsmile · 2016-12-17T07:03:51Z

I finished my review. cc @cloud-fan

shivaram · 2016-12-18T01:21:09Z

Thanks @gatorsmile - Addressed your comments.

SparkQA · 2016-12-18T02:48:35Z

Test build #70314 has finished for PR 16290 at commit f7b4772.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-18T08:04:18Z

retest this please

SparkQA · 2016-12-18T10:07:27Z

Test build #70316 has finished for PR 16290 at commit f7b4772.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-18T19:58:05Z

R/pkg/inst/tests/testthat/test_context.R

+  # Create a temporary table
+  sql("CREATE TABLE people_warehouse_test")
+  # spark-warehouse should be written only tempdir() and not current working directory
+  res <- list.files(path = ".", pattern = ".*spark-warehouse.*",


was testing this and it looked like the current directory (.) was SPARK_HOME/R/pkg/inst/tests/testthat and spark-warehouse would have been in SPARK_HOME/R

Ah I see - Will check this today. I think if SPARK_HOME is accessible I can just call list.files with that as path

I think a couple of other test files would have called sparkR.session already (binary_function, binaryFile, broadcast), so I'd propose adding a new test explicitly named to make sure it is called the very first, ie. https://github.com/apache/spark/pull/16330/files#diff-5ff1ba5d1751f3b1cc96a567e9ab25ffR18

I'm not sure why it needs to run first ? because the default warehouse dir is in tempdir even if spark.session is called before it shouldn't create any warehouse dir in SPARK_HOME/ ?

You are right.
In this case we are specifically looking for spark-warehouse; I guess I was referring to a general check to make sure going forward the list of files before running anything in the package should == list of files after running anything

I think my bigger concern for that is that usually tests are run all at time - i.e. core, sql, hive and then python, R. And there are no guarantees that other module tests won't create files inside SPARK_HOME afaik. So while we can check some basic things with our test, I dont think verifying a global property is always possible.

That I agree completely

shivaram · 2016-12-18T21:46:00Z

@gatorsmile The test error in HiveSparkSubmitSuite seems related. I am debugging locally

shivaram · 2016-12-19T02:28:48Z

@gatorsmile I think I figured out the problem with HiveSparkSubmitSuite but I'm not sure how to solve it. The problem is that in one of the test cases we check the DB location to be same as warehouse path [1].

Now the warehouse path correctly points to spark-warehouse - The catalog though seems to be configured to an existing metastore_db. Now in my machine (and I guess in Jenkins) if we have a metastore_db left over from running SparkR tests then the default location points to the R tempdir which is no longer valid.

It seems to me that the right solution here is to also clear the metastore_db once a set of tests finish ? Is there a way to do this from within the tests or should we be running rm from some of the test running scripts ?

[1]

spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

Line 824 in 1e5c51f

    
           assert(new Path(defaultDbLocation) == new Path(spark.sharedState.warehousePath))

felixcheung · 2016-12-19T04:23:49Z

@shivaram with my PR #16330, metastore_db is moved to tempdir and is removed when the R process exits.

gatorsmile · 2016-12-19T05:14:59Z

retest this please

gatorsmile · 2016-12-19T05:28:56Z

I checked the most two recent failed test cases in Jenkins. They are not related to the changes in the PR.

In the local environment, I can reproduce the error you mentioned above. I assume what you said is the test case SPARK-18360: default table path of tables in default database should depend on the location of default database. This test case failed because of the following check:

assert(new Path(defaultDbLocation) == new Path(spark.sharedState.warehousePath))

I printed out and found they are actually different.

[info]   2016-12-18 20:47:23.328 - stdout> path1: file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse
[info]   2016-12-18 20:47:23.328 - stdout> path2: file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/

The location of default database is still pointing to the original value of hive.metastore.warehouse.dir or spark.sql.warehouse.dir that was set in the previous test case or our previous local spark job. Ideally, our test suite should directly connect to Derby and drop the default database. Let me do more search.

Also cc @yhuai

gatorsmile · 2016-12-19T06:54:50Z

After a research, for avoiding this flaky testcase, the simplest way is to remove the contents in metastore_db (whose location is specified through javax.jdo.option.ConnectionURL) at the beginning and ending of any test case that changes the value of hive.metastore.warehouse.dir or spark.sql.warehouse.dir

SparkQA · 2016-12-19T07:29:33Z

Test build #70335 has finished for PR 16290 at commit f7b4772.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-19T08:02:46Z

Generally, it looks good to me. The only concern is the external behavior change. Since Spark 2.0, users are able to get the warehouse location by the following two ways:

      spark.sql("set spark.sql.warehouse.dir").show()

      println(spark.conf.get("spark.sql.warehouse.dir"))

After this PR, the output will be <undefined>. Do we need extra code changes to fix this? @cloud-fan

shivaram · 2017-01-13T17:37:42Z

@cloud-fan Any thoughts on this ?

gatorsmile · 2017-02-18T03:27:38Z

R/pkg/R/sparkR.R

@@ -376,6 +377,12 @@ sparkR.session <- function(
    overrideEnvs(sparkConfigMap, paramMap)
  }

+  # NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN requirements
+  # See SPARK-18817 for more details
+  if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) {


After rethinking it, we might not need to add an extra sql conf. We just need to know whether the value of spark.sql.warehouse.dir is from the users or the original default. If it is the default, R can simply change it.

Maybe it is a good to-have feature for users to know whether the SQLConf value is from users or from the default. cc @cloud-fan

actually we can, SessionState.conf.settings contains all the user-setted entries.

Ah I see - I will make try to use SessionState and see if it can avoid having to create a new option

…ault

Also add a unit test that checks if new table is created in tmpdir

SparkQA · 2017-03-06T07:02:31Z

Test build #73973 has started for PR 16290 at commit b14c302.

shivaram · 2017-03-06T07:12:04Z

@gatorsmile @cloud-fan @felixcheung I looked at the SharedState code more closely and it looks like the only time the warehousePath can be set is when the initialization of shared state happens. So I modified the code to set the temp dir for SparkR during SparkSession initialization if it has not already been set.

I verified that this works locally and with a unit test -- Is there anything I might be missing with this approach ?

SparkQA · 2017-03-07T06:32:33Z

Test build #74074 has started for PR 16290 at commit 7a98b91.

felixcheung · 2017-03-07T07:34:10Z

so base on this comment #16330 (comment)
doesn't it mean we shouldn't set warehouse dir to under tempdir()?

AmplabJenkins · 2017-03-07T08:05:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74074/
Test FAILed.

shivaram · 2017-03-11T03:18:28Z

@felixcheung Yeah I think that sounds good. I can close this PR for now and we can revisit this if we have an issue in the future. Also I guess you will update #16330 for the derby.log change. Does this sound okay ?

felixcheung · 2017-03-11T04:24:52Z

Yes that's the plan.

Set default warehouse dir to tempdir

ac05089

To do this we introduce a new SQL config that is set to tempdir from SparkR.

Fix drop table command

2583410

felixcheung reviewed Dec 15, 2016

View reviewed changes

gatorsmile reviewed Dec 15, 2016

View reviewed changes

Handle default warehouse path in SQLConf

1d0d1d2

Update unit test. Address comments

014d7e1

gatorsmile reviewed Dec 16, 2016

View reviewed changes

shivaram mentioned this pull request Dec 16, 2016

[SPARK-18895][TESTS] Fix resource-closing-related and path-related test failures in identified ones on Windows #16305

Closed

Update comments, fix style

6eec97d

bdwyer2 reviewed Dec 17, 2016

View reviewed changes

gatorsmile reviewed Dec 17, 2016

View reviewed changes

Address code review comments

f7b4772

felixcheung reviewed Dec 18, 2016

View reviewed changes

felixcheung mentioned this pull request Dec 18, 2016

[SPARK-18817][SPARKR][SQL] change derby log output to temp dir #16330

Closed

felixcheung mentioned this pull request Feb 17, 2017

[SPARK-19066][SPARKR][Backport-2.1]:LDA doesn't set optimizer correctly #16623

Closed

gatorsmile reviewed Feb 18, 2017

View reviewed changes

shivaram added 2 commits March 5, 2017 22:16

Merge remote-tracking branch 'apache/master' into spark-warehouse-def…

047054e

…ault

Revise PR to set tmpdir in SQLUtils

b14c302

Also add a unit test that checks if new table is created in tmpdir

Clear default session after test

7a98b91

shivaram closed this Mar 11, 2017

[SPARK-18817] [SPARKR] [SQL] Set default warehouse dir to tempdir #16290

[SPARK-18817] [SPARKR] [SQL] Set default warehouse dir to tempdir #16290

Uh oh!

Conversation

shivaram commented Dec 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

shivaram commented Dec 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

gatorsmile commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

shivaram commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Dec 16, 2016

Uh oh!

SparkQA commented Dec 16, 2016

Uh oh!

shivaram commented Dec 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

felixcheung Dec 15, 2016 •

edited

Loading

felixcheung Dec 15, 2016 •

edited

Loading

gatorsmile commented Dec 15, 2016 •

edited

Loading

felixcheung Dec 18, 2016 •

edited

Loading

felixcheung Dec 19, 2016 •

edited

Loading

felixcheung commented Dec 19, 2016 •

edited

Loading