Skip to content

[SPARK-20159][SPARKR][SQL] Support all catalog API in R #17483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

felixcheung
Copy link
Member

@felixcheung felixcheung commented Mar 30, 2017

What changes were proposed in this pull request?

Add a set of catalog API in R

"currentDatabase",
"listColumns",
"listDatabases",
"listFunctions",
"listTables",
"recoverPartitions",
"refreshByPath",
"refreshTable",
"setCurrentDatabase",

https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff

How was this patch tested?

manual tests, unit tests

#' @param tableName a name of the table.
#' @param path the path of files to load.
#' @param source the name of external data source.
#' @param schema the schema of the data for certain data source.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is added to createExternalTable (won't work with json source otherwise)
everything else is simply moved as-is from SQLContext.R

@SparkQA
Copy link

SparkQA commented Mar 30, 2017

Test build #75393 has finished for PR 17483 at commit e4808b6.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

cc @gatorsmile for any SQL specific inputs

@felixcheung I will take a look at this later today. Meanwhile in the PR description can you note down what are the new functions being added in this change ?

dataFrame(sdf)
}

createExternalTable <- function(x, ...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, createExternalTable is deprecated. See the PR: #16528

Let me make the corresponding changes in SQLContext too.

Copy link
Member Author

@felixcheung felixcheung Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm aware. I'm not sure if we need to make changes to SQLContext - schema is required for json source but it's ok in Scala to use createTable instead.

It makes sense to add in R here though because

  • there is no createTable method (hmm, reviewing that PR you reference, maybe we should add it)
  • createTable just sounds too generic and too much like many existing R methods (in R, table is everywhere!), that I wasn't sure it's a good idea to add in R
  • createExternalTable since 2.0 is decoupled from SQLContext or SparkSession - it doesn't take either as parameter and it's calling on catalog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that createTable sounds very general, but I dont think its used by base R or any popular R package ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createExternalTable is misleading now, because the table could be managed if users did not provide the value of path. Thus, we decided to rename it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I was just concerned that with data.table, read.table etc, table == data.frame in R as supposed to hive table or managed table, which could be fairly confusing.
anyway, I think I'll follow up with a PR for createTable but as of now path is optional for createExternalTable, even though it's potentially misleading, it does work now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you drop a managed table both data and meta data will be deleted if you drop an external table only metadata is deleted, external table is a way to protect data against accidental drop commands.

Thus, it is a pretty important concept. It could be either Hive or Spark native one.

@felixcheung
Copy link
Member Author

felixcheung commented Mar 31, 2017

updated PR description!

@felixcheung
Copy link
Member Author

felixcheung commented Mar 31, 2017

btw, @gatorsmile it looks like listColumns should throw NoSuchTableException and/or NoSuchDatabaseException instead of AnalysisException here

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75417 has finished for PR 17483 at commit 5ab5834.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75418 has finished for PR 17483 at commit 28195b9.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75420 has finished for PR 17483 at commit 9c768ae.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75422 has finished for PR 17483 at commit 5093891.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@shivaram shivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Change looks good otherwise

@@ -645,16 +645,17 @@ test_that("test tableNames and tables", {
df <- read.json(jsonPath)
createOrReplaceTempView(df, "table1")
expect_equal(length(tableNames()), 1)
tables <- tables()
tables <- listTables()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is tables() deprecated now ?

Copy link
Member Author

@felixcheung felixcheung Apr 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, there are some differences of the output (most notability catalog.listTables returns a Dataset<Table> - but I'm converting that into a DataFrame anyway), and I thought list* would be more consistent with other methods like listColumn(), listDatabases()

> head(tables("default"))
  database tableName isTemporary
1  default      json       FALSE
Warning message:
'tables' is deprecated.
Use 'listTables' instead.
See help("Deprecated")
> head(listTables("default"))
  name database description tableType isTemporary
1 json  default        <NA>  EXTERNAL       FALSE

If you think it makes sense, we could make tables an alias of listTables - it's going to call slightly different code on the Scala side and there are new columns and one different column name being returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

"org.apache.spark.sql.catalyst.expressions.Abs")
expect_error(listFunctions("foo_db"),
"Error in listFunctions : analysis error - Database 'foo_db' does not exist")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont have tests for recoverPartitions refreshByPath and refreshTable ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sharp eyes :) I was planning to add tests.

I tested these manually, but the steps are more involved and these are only thin wrappers in R I think we should defer to scala tests.

dataFrame(sdf)
}

createExternalTable <- function(x, ...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that createTable sounds very general, but I dont think its used by base R or any popular R package ?

@@ -846,6 +846,24 @@ captureJVMException <- function(e, method) {
# Extract the first message of JVM exception.
first <- strsplit(msg[2], "\r?\n\tat")[[1]][1]
stop(paste0(rmsg, "analysis error - ", first), call. = FALSE)
} else
if (any(grep("org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: ", stacktrace))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I knew it. See the JIRA: https://issues.apache.org/jira/browse/SPARK-19952. @hvanhovell plans to remove it in 2.3.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks

@SparkQA
Copy link

SparkQA commented Apr 1, 2017

Test build #75447 has finished for PR 17483 at commit 3c66930.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

felixcheung commented Apr 1, 2017

@gatorsmile I added a line about recoverPartitions, I think we should also be more clear in other language bindings?

Also open https://issues.apache.org/jira/browse/SPARK-20188

@SparkQA
Copy link

SparkQA commented Apr 1, 2017

Test build #75451 has finished for PR 17483 at commit 2aea0cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2017

Test build #75452 has finished for PR 17483 at commit aff13a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

merged to master. thanks for the review!

@asfgit asfgit closed this in 93dbfe7 Apr 2, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request Apr 6, 2017
…createExternalTable

## What changes were proposed in this pull request?

Following up on apache#17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes apache#17511 from felixcheung/rceatetable.
#' @note tables since 1.4.0
tables.default <- function(databaseName = NULL) {
sparkSession <- getSparkSession()
jdf <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getTables", sparkSession, databaseName)
Copy link
Member

@MaxGekk MaxGekk Nov 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I am not mistaken, the method getTables() is not used any more in R. Can we remove it from r.SQLUtils:

def getTables(sparkSession: SparkSession, databaseName: String): DataFrame = {
databaseName match {
case n: String if n != null && n.trim.nonEmpty =>
Dataset.ofRows(sparkSession, ShowTablesCommand(Some(n), None))
case _ =>
Dataset.ofRows(sparkSession, ShowTablesCommand(None, None))
}
}

? cc @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw the PR. LGTM

dongjoon-hyun pushed a commit that referenced this pull request Nov 29, 2020
### What changes were proposed in this pull request?
Remove the unused method `getTables()` from `r.SQLUtils`. The method was used before the changes #17483 but R's `tables.default` was rewritten using `listTables()`: https://github.com/apache/spark/pull/17483/files#diff-2c01472a7bcb1d318244afcd621d726e00d36cd15dffe7e44fa96c54fce4cd9aR220-R223

### Why are the changes needed?
To improve code maintenance, and remove the dead code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By R tests.

Closes #30527 from MaxGekk/remove-getTables-in-r-SQLUtils.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@felixcheung felixcheung deleted the rcatalog branch February 6, 2021 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants