Skip to content

[SPARK-18817][SPARKR][SQL] change derby log output to temp dir #16330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion R/pkg/R/sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -322,10 +322,19 @@ sparkRHive.init <- function(jsc = NULL) {
#' SparkSession or initializes a new SparkSession.
#' Additional Spark properties can be set in \code{...}, and these named parameters take priority
#' over values in \code{master}, \code{appName}, named lists of \code{sparkConfig}.
#' When called in an interactive session, this checks for the Spark installation, and, if not
#'
#' When called in an interactive session, this method checks for the Spark installation, and, if not
#' found, it will be downloaded and cached automatically. Alternatively, \code{install.spark} can
#' be called manually.
#'
#' A default warehouse is created automatically in the current directory when a managed table is
#' created via \code{sql} statement \code{CREATE TABLE}, for example. To change the location of the
#' warehouse, set the named parameter \code{spark.sql.warehouse.dir} to the SparkSession. Along with
#' the warehouse, an accompanied metastore may also be automatically created in the current
#' directory when a new SparkSession is initialized with \code{enableHiveSupport} set to
#' \code{TRUE}, which is the default. For more details, refer to Hive configuration at
#' \url{http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables}.
#'
#' For details on how to initialize and use SparkR, refer to SparkR programming guide at
#' \url{http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession}.
#'
Expand Down Expand Up @@ -381,6 +390,10 @@ sparkR.session <- function(
deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
}

if (!exists("spark.r.sql.derby.temp.dir", envir = sparkConfigMap)) {
sparkConfigMap[["spark.r.sql.derby.temp.dir"]] <- tempdir()
}

if (!exists(".sparkRjsc", envir = .sparkREnv)) {
retHome <- sparkCheckInstall(sparkHome, master, deployMode)
if (!is.null(retHome)) sparkHome <- retHome
Expand Down
34 changes: 34 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ unsetHiveContext <- function() {

# Tests for SparkSQL functions in SparkR

filesBefore <- list.files(path = sparkRDir, all.files = TRUE)
sparkSession <- sparkR.session()
sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getJavaSparkContext", sparkSession)

Expand Down Expand Up @@ -2909,6 +2910,39 @@ test_that("Collect on DataFrame when NAs exists at the top of a timestamp column
expect_equal(class(ldf3$col3), c("POSIXct", "POSIXt"))
})

compare_list <- function(list1, list2) {
# get testthat to show the diff by first making the 2 lists equal in length
expect_equal(length(list1), length(list2))
l <- max(length(list1), length(list2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lengths should be equal if we get to this line ? Or am I missing something ?

Copy link
Member Author

@felixcheung felixcheung Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is to show enough information in the log without having to rerun the check manually to find out what is different.

the first check will show the numeric values but it wouldn't say how exactly they are different.
the next check (or moved to compare_list() here) will get testthat to dump the delta too, but first it must set the 2 lists into the same size etc..

in fact, all of these are well tested in "Check masked functions" test in test_context.R, just duplicated here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's what it looks like

1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (@test_sparkSQL.R#2917)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 22 - 23 == -1


2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (@test_sparkSQL.R#2917)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
3/23 mismatches
x[21]: "unit-tests.out"
y[21]: "spark-warehouse"

x[22]: "WINDOWS.md"
y[22]: "unit-tests.out"

x[23]: NA
y[23]: "WINDOWS.md"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - that sounds good

length(list1) <- l
length(list2) <- l
expect_equal(sort(list1, na.last = TRUE), sort(list2, na.last = TRUE))
}

# This should always be the **very last test** in this test file.
test_that("No extra files are created in SPARK_HOME by starting session and making calls", {
# Check that it is not creating any extra file.
# Does not check the tempdir which would be cleaned up after.
filesAfter <- list.files(path = sparkRDir, all.files = TRUE)

expect_true(length(sparkRFilesBefore) > 0)
# first, ensure derby.log is not there
expect_false("derby.log" %in% filesAfter)
# second, ensure only spark-warehouse is created when calling SparkSession, enableHiveSupport = F
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused how these two setdiff commands map to with or without hive support. Can we make this a bit more easier to understand ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.
updated, hope it's better now.

# note: currently all other test files have enableHiveSupport = F, so we capture the list of files
# before creating a SparkSession with enableHiveSupport = T at the top of this test file
# (filesBefore). The test here is to compare that (filesBefore) against the list of files before
# any test is run in run-all.R (sparkRFilesBefore).
# sparkRWhitelistSQLDirs is also defined in run-all.R, and should contain only 2 whitelisted dirs,
# here allow the first value, spark-warehouse, in the diff, everything else should be exactly the
# same as before any test is run.
compare_list(sparkRFilesBefore, setdiff(filesBefore, sparkRWhitelistSQLDirs[[1]]))
# third, ensure only spark-warehouse and metastore_db are created when enableHiveSupport = T
# note: as the note above, after running all tests in this file while enableHiveSupport = T, we
# check the list of files again. This time we allow both whitelisted dirs to be in the diff.
compare_list(sparkRFilesBefore, setdiff(filesAfter, sparkRWhitelistSQLDirs))
})

unlink(parquetPath)
unlink(orcPath)
unlink(jsonPath)
Expand Down
6 changes: 6 additions & 0 deletions R/pkg/tests/run-all.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,12 @@ library(SparkR)
options("warn" = 2)

# Setup global test environment
sparkRDir <- file.path(Sys.getenv("SPARK_HOME"), "R")
sparkRFilesBefore <- list.files(path = sparkRDir, all.files = TRUE)
sparkRWhitelistSQLDirs <- c("spark-warehouse", "metastore_db")
invisible(lapply(sparkRWhitelistSQLDirs,
function(x) { unlink(file.path(sparkRDir, x), recursive = TRUE, force = TRUE)}))

install.spark()

test_package("SparkR")
9 changes: 9 additions & 0 deletions core/src/main/scala/org/apache/spark/api/r/RRDD.scala
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

package org.apache.spark.api.r

import java.io.File
import java.util.{Map => JMap}

import scala.collection.JavaConverters._
Expand Down Expand Up @@ -127,6 +128,14 @@ private[r] object RRDD {
sparkConf.setExecutorEnv(name.toString, value.toString)
}

if (sparkEnvirMap.containsKey("spark.r.sql.derby.temp.dir") &&
System.getProperty("derby.stream.error.file") == null) {
// This must be set before SparkContext is instantiated.
System.setProperty("derby.stream.error.file",
Seq(sparkEnvirMap.get("spark.r.sql.derby.temp.dir").toString, "derby.log")
.mkString(File.separator))
}

val jsc = new JavaSparkContext(sparkConf)
jars.foreach { jar =>
jsc.addJar(jar)
Expand Down