-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-25446][R] Add schema_of_json() and schema_of_csv() to R #22939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
cc @felixcheung and @MaxGekk |
c4a78fc
to
52bae78
Compare
This comment has been minimized.
This comment has been minimized.
R/pkg/R/functions.R
Outdated
#' the same options as the JSON/CSV data source. Additionally \code{to_json} supports | ||
#' the "pretty" option which enables pretty JSON generation. In \code{arrays_zip}, | ||
#' this contains additional Columns of arrays to be merged. | ||
#' @param ... additional argument(s). In \code{to_json}, \code{from_json} and |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Test build #98446 has finished for PR 22939 at commit
|
retest this please |
R/pkg/R/functions.R
Outdated
#' also supported for the schema. | ||
#' \item \code{from_csv}: a DDL-formatted string | ||
#' also supported for the schema. Since Spark 3.0, \code{schema_of_json} or | ||
#' a DDL-formatted string literal can also be accepted. |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Shall we make it as separate PR? |
Makes sense. Let me separate it tomorrow. |
Test build #98449 has finished for PR 22939 at commit
|
5f0a3b6
to
3416ac7
Compare
Will make another PR after this gets merged to allow the cases below: df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, schema_of_json(head(df)$people_json))))
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_csv(df$people))
head(select(df, from_csv(df$people_json, schema_of_csv(head(df)$people_json))))
|
Test build #98457 has finished for PR 22939 at commit
|
R/pkg/R/functions.R
Outdated
#' @examples | ||
#' | ||
#' \dontrun{ | ||
#' json <- '{"name":"Bob"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid mixing '
and "
in doc
#' \item \code{to_json}, \code{from_json} and \code{schema_of_json}: this contains | ||
#' additional named properties to control how it is converted and accepts the | ||
#' same options as the JSON data source. | ||
#' \item \code{to_json}: it supports the "pretty" option which enables pretty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, how does pretty
work? is it pretty = TRUE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's there before but I'd like to suggest to give an example - doc or code example below.
it's a bit different from python/scala I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I added an example
R/pkg/R/functions.R
Outdated
#' @examples | ||
#' | ||
#' \dontrun{ | ||
#' csv <- "'Amsterdam,2018'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I"m a bit confused "'Amsterdam,2018'"
vs "Amsterdam,2018"
does the latter work?
if (class(x) == "character") { | ||
col <- callJStatic("org.apache.spark.sql.functions", "lit", x) | ||
} else { | ||
col <- x@jc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the use when x is a Column?
schema_of_csv(lit("Amsterdam,2018")))
seems a bit odd to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually related with Scala API. There are too many overridden versions of functions in function.scala
so we're trying to reduce it. Column is preferred over other specific types because Column can cover other expression cases.. in Python or R, they can be easily supported so other types and column are all supported. To cut it short, for consistency with Scala API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok but one use could be
select(df, schema_of_csv(df$schemaCol))
like an actual col not a literal string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea .. that was discussed at #22775. The usecase of schema_of_csv
or schema_of_json
will usually be like .. copying and pasting one record from the actual data manually. That's disallowed for now conservatively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are saying this select(df, schema_of_csv(df$schemaCol))
is not allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, lit
usage already works in many APIs although it looks a bit odd .. should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just that I thought the shortcut syntax in scala is nicer looking then lit("string")
in R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm .. do you mind if we go ahead for this one and talk later within 3.0? I think we're going to deal with this (general) problem within 3.0 if I am not mistaken. I need to make one followup after this anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe to think about the design of API in R and Scala and else where - what does it look like when the user passes in a column that is not a literal string? probably worthwhile to follow up separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I agree. It will throw an analysis exception in that case. I also sympathize the concerns here and somewhat we're unclear about this - so I just wanted to make it restricted in general for now.
I'm going to open another PR related with this as a followup (for #22939 (comment)). While I am there, I will test when the user passes in a column that is not a literal string.
Let me clean up and deal with other comments. |
Test build #98470 has finished for PR 22939 at commit
|
Test build #98550 has finished for PR 22939 at commit
|
Hey @felixcheung, it should be ready for another look. |
gentle ping, @felixcheung. |
Sorry for the delay, will do another pass in 1 or 2 days
|
Sure! |
Hey @felixcheung, have you found some time to take a look for this please? |
if (class(x) == "character") { | ||
col <- callJStatic("org.apache.spark.sql.functions", "lit", x) | ||
} else { | ||
col <- x@jc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe to think about the design of API in R and Scala and else where - what does it look like when the user passes in a column that is not a literal string? probably worthwhile to follow up separately.
Thank you, @felixcheung for approving this. |
Merged to master. |
@felixcheung, I tested when the user passes in a column that is not a literal string, and it shows the results as below:
|
Error looks reasonable...
|
## What changes were proposed in this pull request? This PR proposes to expose `schema_of_json` and `schema_of_csv` at R side. **`schema_of_json`**: ```r json <- '{"name":"Bob"}' df <- sql("SELECT * FROM range(1)") head(select(df, schema_of_json(json))) ``` ``` schema_of_json({"name":"Bob"}) 1 struct<name:string> ``` **`schema_of_csv`**: ```r csv <- "Amsterdam,2018" df <- sql("SELECT * FROM range(1)") head(select(df, schema_of_csv(csv))) ``` ``` schema_of_csv(Amsterdam,2018) 1 struct<_c0:string,_c1:int> ``` ## How was this patch tested? Manually tested, unit tests added, documentation manually built and verified. Closes apache#22939 from HyukjinKwon/SPARK-25446. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@HyukjinKwon can we add support for jsons where value has null: Example: json <- '{"firstName":"Bob", "lastName":null}' I am thinking that schema_of_json function can take an argument which casts all null as string ?
|
What changes were proposed in this pull request?
This PR proposes to expose
schema_of_json
andschema_of_csv
at R side.schema_of_json
:schema_of_csv
:How was this patch tested?
Manually tested, unit tests added, documentation manually built and verified.