-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-18817] [SPARKR] [SQL] Set default warehouse dir to tempdir #16290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To do this we introduce a new SQL config that is set to tempdir from SparkR.
cc @bdwyer2 - who did an initial version at #16247 |
R/pkg/R/sparkR.R
Outdated
# NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN requirements | ||
# See SPARK-18817 for more details | ||
if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) { | ||
assign("spark.sql.default.warehouse.dir", tempdir(), envir = sparkConfigMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could just sparkConfigMap[["spark.sql.warehouse.default.dir"]] <- tempdir()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this after L383 "overrideEnvs(sparkConfigMap, paramMap)" in case spark.sql.warehouse.default.dir
is passed in named param and explicitly set to something other then the tmp dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this below L383 - We still need the exists
check to make sure we don't overwrite any user provided value ?
@@ -2165,6 +2165,14 @@ test_that("SQL error message is returned from JVM", { | |||
expect_equal(grepl("blah", retError), TRUE) | |||
}) | |||
|
|||
test_that("Default warehouse dir should be set to tempdir", { | |||
# nothing should be written outside tempdir() without explicit user permission | |||
inital_working_directory_files <- list.files() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if warehouse dir ("spark-warehouse") is already there before running this test then the list of file won't change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Jenkins start with a new workspace every time it runs a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referring to other tests in this test file, test_sparkSQL, that is calling to API that might already initialize the warehouse dir.
sparkR.session()
is called at the top. Does this createOrReplaceTempView
cause the warehouse dir to be created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my test, the spark-warehouse
directory is created when I run a <- createDataFrame(iris)
so I think by the time this test is run this directory would already be there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think @felixcheung point is right - The dir should be created early on. Also I think in the tests we sometimes configure the hive.metastore.dir
in our tests and so I dont see it come up when I run the test cases. I'll try to see if we can design a different test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this test to recursively list all the files and check if spark-warehouse
is in it. Another thing is that we could check if the specific table is in it
Test build #70171 has finished for PR 16290 at commit
|
@@ -55,14 +55,19 @@ private[sql] class SharedState(val sparkContext: SparkContext) extends Logging { | |||
s"is set. Setting ${WAREHOUSE_PATH.key} to the value of " + | |||
s"hive.metastore.warehouse.dir ('$hiveWarehouseDir').") | |||
hiveWarehouseDir | |||
} else { | |||
} else if (sparkContext.conf.contains(WAREHOUSE_PATH.key) && | |||
sparkContext.conf.get(WAREHOUSE_PATH).isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indent is not right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indented 4 spaces now
If the default database has already been created in the metastore, any following changes of The directory of managed data source tables is created by Hive. When creating a new data source table, the created directory is based on the current value of @dilipbiswal hit this issue very recently. Below shows the location of these two tables.
|
Test build #70181 has started for PR 16290 at commit |
retest this please |
Test build #70199 has finished for PR 16290 at commit
|
Ugh - The failure seems to be from
|
Test build #70236 has started for PR 16290 at commit |
.stringConf | ||
.createWithDefault(Utils.resolveURI("spark-warehouse").toString) | ||
|
||
val WAREHOUSE_PATH = buildConf("spark.sql.warehouse.dir") | ||
.doc("The location for managed databases and tables.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is not right. spark.sql.warehouse.dir
is still the default location when we create a database/table without providing the location value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a good point. I misunderstood the meaning of default
there. Will fix this now
@gatorsmile Thanks for taking a look. Addressed your comments now. Lets see if Jenkins passes. |
Test build #70269 has finished for PR 16290 at commit
|
@felixcheung @bdwyer2 Could you take another look ? |
# Create a temporary table | ||
sql("CREATE TABLE people_warehouse_test") | ||
# spark-warehouse should be written only tempdir() and not current working directory | ||
res <- list.files(path = ".", pattern = ".*spark-warehouse.*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we test to make sure that no files are created during this process instead of only checking for spark-warehouse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - given that this PR is only changing the warehouse dir, I think its only fair to test for that. Or in other words, adding such a test would fail now because of derby.log
etc. (per our JIRA discussion) ?
@@ -221,6 +221,19 @@ class SQLConfSuite extends QueryTest with SharedSQLContext { | |||
.sessionState.conf.warehousePath.stripSuffix("/")) | |||
} | |||
|
|||
test("changing default value of warehouse path") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this test case only cover one of four cases. spark.sql.default.warehouse.dir
is set and spark.sql.warehouse.dir
is not set. We also need to check the other three cases:
spark.sql.default.warehouse.dir
is not set andspark.sql.warehouse.dir
is not setspark.sql.default.warehouse.dir
is set andspark.sql.warehouse.dir
is setspark.sql.default.warehouse.dir
is not set andspark.sql.warehouse.dir
is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added tests for all 4 cases now
@@ -819,7 +819,13 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging { | |||
|
|||
def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH) | |||
|
|||
def warehousePath: String = new Path(getConf(StaticSQLConf.WAREHOUSE_PATH)).toString | |||
def warehousePath: String = { | |||
if (contains(StaticSQLConf.WAREHOUSE_PATH.key)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason we are not doing the same check, as shown in another place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch - Added the same check here as well
@@ -964,10 +970,16 @@ object StaticSQLConf { | |||
} | |||
} | |||
|
|||
val DEFAULT_WAREHOUSE_PATH = buildConf("spark.sql.default.warehouse.dir") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it internal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this part of the code base - What are the consequences of making it internal ? Is it just in terms of what shows up in documentation or does it affect how users can use it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the internal configuration, it will not be printed out. For example, you can try something like
spark.sql("SET -v").show(numRows = 200, truncate = false)
sparkContext.conf.set("spark.sql.default.warehouse.dir", newWarehouseDefaultPath) | ||
val spark = new SparkSession(sparkContext) | ||
assert(newWarehouseDefaultPath.stripSuffix("/") === spark | ||
.sessionState.conf.warehousePath.stripSuffix("/")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need a check for spark.sharedState.warehousePath
because we did the logic changes there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I finished my review. cc @cloud-fan |
Thanks @gatorsmile - Addressed your comments. |
Test build #70314 has finished for PR 16290 at commit
|
retest this please |
Test build #70316 has finished for PR 16290 at commit
|
# Create a temporary table | ||
sql("CREATE TABLE people_warehouse_test") | ||
# spark-warehouse should be written only tempdir() and not current working directory | ||
res <- list.files(path = ".", pattern = ".*spark-warehouse.*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was testing this and it looked like the current directory (.
) was SPARK_HOME/R/pkg/inst/tests/testthat
and spark-warehouse
would have been in SPARK_HOME/R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - Will check this today. I think if SPARK_HOME
is accessible I can just call list.files
with that as path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a couple of other test files would have called sparkR.session
already (binary_function, binaryFile, broadcast), so I'd propose adding a new test explicitly named to make sure it is called the very first, ie. https://github.com/apache/spark/pull/16330/files#diff-5ff1ba5d1751f3b1cc96a567e9ab25ffR18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it needs to run first ? because the default warehouse dir is in tempdir
even if spark.session is called before it shouldn't create any warehouse dir in SPARK_HOME/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right.
In this case we are specifically looking for spark-warehouse
; I guess I was referring to a general check to make sure going forward the list of files before running anything in the package should == list of files after running anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my bigger concern for that is that usually tests are run all at time - i.e. core, sql, hive and then python, R. And there are no guarantees that other module tests won't create files inside SPARK_HOME
afaik. So while we can check some basic things with our test, I dont think verifying a global property is always possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I agree completely
@gatorsmile The test error in HiveSparkSubmitSuite seems related. I am debugging locally |
@gatorsmile I think I figured out the problem with Now the warehouse path correctly points to It seems to me that the right solution here is to also clear the [1] spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala Line 824 in 1e5c51f
|
retest this please |
I checked the most two recent failed test cases in Jenkins. They are not related to the changes in the PR. In the local environment, I can reproduce the error you mentioned above. I assume what you said is the test case assert(new Path(defaultDbLocation) == new Path(spark.sharedState.warehousePath)) I printed out and found they are actually different.
The location of default database is still pointing to the original value of Also cc @yhuai |
After a research, for avoiding this flaky testcase, the simplest way is to remove the contents in |
Test build #70335 has finished for PR 16290 at commit
|
Generally, it looks good to me. The only concern is the external behavior change. Since Spark 2.0, users are able to get the warehouse location by the following two ways: spark.sql("set spark.sql.warehouse.dir").show() println(spark.conf.get("spark.sql.warehouse.dir")) After this PR, the output will be |
@cloud-fan Any thoughts on this ? |
R/pkg/R/sparkR.R
Outdated
@@ -376,6 +377,12 @@ sparkR.session <- function( | |||
overrideEnvs(sparkConfigMap, paramMap) | |||
} | |||
|
|||
# NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN requirements | |||
# See SPARK-18817 for more details | |||
if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rethinking it, we might not need to add an extra sql conf. We just need to know whether the value of spark.sql.warehouse.dir
is from the users or the original default. If it is the default, R can simply change it.
Maybe it is a good to-have feature for users to know whether the SQLConf value is from users or from the default. cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we can, SessionState.conf.settings
contains all the user-setted entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - I will make try to use SessionState
and see if it can avoid having to create a new option
Also add a unit test that checks if new table is created in tmpdir
Test build #73973 has started for PR 16290 at commit |
@gatorsmile @cloud-fan @felixcheung I looked at the SharedState code more closely and it looks like the only time the warehousePath can be set is when the initialization of shared state happens. So I modified the code to set the temp dir for SparkR during SparkSession initialization if it has not already been set. I verified that this works locally and with a unit test -- Is there anything I might be missing with this approach ? |
Test build #74074 has started for PR 16290 at commit |
so base on this comment #16330 (comment) |
Test FAILed. |
@felixcheung Yeah I think that sounds good. I can close this PR for now and we can revisit this if we have an issue in the future. Also I guess you will update #16330 for the derby.log change. Does this sound okay ? |
Yes that's the plan.
|
What changes were proposed in this pull request?
This PR sets the default warehouse dir to a temporary directory in SparkR to avoid creating directories in the working directory (see JIRA for more details).
To do this we introduce a new SQL config that is used to configure the default. For all other frontends, existing behavior is maintained
How was this patch tested?
Running unit tests locally. Manually with SparkR shell