-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-20159][SPARKR][SQL] Support all catalog API in R #17483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
#' @param tableName a name of the table. | ||
#' @param path the path of files to load. | ||
#' @param source the name of external data source. | ||
#' @param schema the schema of the data for certain data source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is added to createExternalTable (won't work with json source otherwise)
everything else is simply moved as-is from SQLContext.R
Test build #75393 has finished for PR 17483 at commit
|
cc @gatorsmile for any SQL specific inputs @felixcheung I will take a look at this later today. Meanwhile in the PR description can you note down what are the new functions being added in this change ? |
dataFrame(sdf) | ||
} | ||
|
||
createExternalTable <- function(x, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, createExternalTable
is deprecated. See the PR: #16528
Let me make the corresponding changes in SQLContext too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'm aware. I'm not sure if we need to make changes to SQLContext - schema is required for json source but it's ok in Scala to use createTable instead.
It makes sense to add in R here though because
- there is no createTable method (hmm, reviewing that PR you reference, maybe we should add it)
- createTable just sounds too generic and too much like many existing R methods (in R, table is everywhere!), that I wasn't sure it's a good idea to add in R
- createExternalTable since 2.0 is decoupled from SQLContext or SparkSession - it doesn't take either as parameter and it's calling on catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that createTable
sounds very general, but I dont think its used by base R or any popular R package ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createExternalTable
is misleading now, because the table could be managed
if users did not provide the value of path
. Thus, we decided to rename it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I was just concerned that with data.table
, read.table
etc, table == data.frame in R as supposed to hive table
or managed table
, which could be fairly confusing.
anyway, I think I'll follow up with a PR for createTable
but as of now path
is optional for createExternalTable
, even though it's potentially misleading, it does work now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you drop a managed table both data and meta data will be deleted if you drop an external table only metadata is deleted, external table is a way to protect data against accidental drop commands.
Thus, it is a pretty important concept. It could be either Hive or Spark native one.
updated PR description! |
btw, @gatorsmile it looks like |
Test build #75417 has finished for PR 17483 at commit
|
Test build #75418 has finished for PR 17483 at commit
|
Test build #75420 has finished for PR 17483 at commit
|
Test build #75422 has finished for PR 17483 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. Change looks good otherwise
@@ -645,16 +645,17 @@ test_that("test tableNames and tables", { | |||
df <- read.json(jsonPath) | |||
createOrReplaceTempView(df, "table1") | |||
expect_equal(length(tableNames()), 1) | |||
tables <- tables() | |||
tables <- listTables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is tables()
deprecated now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, there are some differences of the output (most notability catalog.listTables returns a Dataset<Table>
- but I'm converting that into a DataFrame anyway), and I thought list* would be more consistent with other methods like listColumn()
, listDatabases()
> head(tables("default"))
database tableName isTemporary
1 default json FALSE
Warning message:
'tables' is deprecated.
Use 'listTables' instead.
See help("Deprecated")
> head(listTables("default"))
name database description tableType isTemporary
1 json default <NA> EXTERNAL FALSE
If you think it makes sense, we could make tables
an alias of listTables
- it's going to call slightly different code on the Scala side and there are new columns and one different column name being returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
"org.apache.spark.sql.catalyst.expressions.Abs") | ||
expect_error(listFunctions("foo_db"), | ||
"Error in listFunctions : analysis error - Database 'foo_db' does not exist") | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We dont have tests for recoverPartitions
refreshByPath
and refreshTable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sharp eyes :) I was planning to add tests.
I tested these manually, but the steps are more involved and these are only thin wrappers in R I think we should defer to scala tests.
dataFrame(sdf) | ||
} | ||
|
||
createExternalTable <- function(x, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that createTable
sounds very general, but I dont think its used by base R or any popular R package ?
@@ -846,6 +846,24 @@ captureJVMException <- function(e, method) { | |||
# Extract the first message of JVM exception. | |||
first <- strsplit(msg[2], "\r?\n\tat")[[1]][1] | |||
stop(paste0(rmsg, "analysis error - ", first), call. = FALSE) | |||
} else | |||
if (any(grep("org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: ", stacktrace))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I knew it. See the JIRA: https://issues.apache.org/jira/browse/SPARK-19952. @hvanhovell plans to remove it in 2.3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks
…s(), test for recoverPartitions/refresh*
Test build #75447 has finished for PR 17483 at commit
|
@gatorsmile I added a line about recoverPartitions, I think we should also be more clear in other language bindings? |
Test build #75451 has finished for PR 17483 at commit
|
Test build #75452 has finished for PR 17483 at commit
|
merged to master. thanks for the review! |
…createExternalTable ## What changes were proposed in this pull request? Following up on apache#17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#17511 from felixcheung/rceatetable.
#' @note tables since 1.4.0 | ||
tables.default <- function(databaseName = NULL) { | ||
sparkSession <- getSparkSession() | ||
jdf <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getTables", sparkSession, databaseName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I am not mistaken, the method getTables()
is not used any more in R. Can we remove it from r.SQLUtils
:
spark/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
Lines 219 to 226 in e8982ca
def getTables(sparkSession: SparkSession, databaseName: String): DataFrame = { | |
databaseName match { | |
case n: String if n != null && n.trim.nonEmpty => | |
Dataset.ofRows(sparkSession, ShowTablesCommand(Some(n), None)) | |
case _ => | |
Dataset.ofRows(sparkSession, ShowTablesCommand(None, None)) | |
} | |
} |
? cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I saw the PR. LGTM
### What changes were proposed in this pull request? Remove the unused method `getTables()` from `r.SQLUtils`. The method was used before the changes #17483 but R's `tables.default` was rewritten using `listTables()`: https://github.com/apache/spark/pull/17483/files#diff-2c01472a7bcb1d318244afcd621d726e00d36cd15dffe7e44fa96c54fce4cd9aR220-R223 ### Why are the changes needed? To improve code maintenance, and remove the dead code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By R tests. Closes #30527 from MaxGekk/remove-getTables-in-r-SQLUtils. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Add a set of catalog API in R
https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff
How was this patch tested?
manual tests, unit tests