Skip to content

[SPARK-25446][R] Add schema_of_json() and schema_of_csv() to R #22939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Nov 4, 2018

What changes were proposed in this pull request?

This PR proposes to expose schema_of_json and schema_of_csv at R side.

schema_of_json:

json <- '{"name":"Bob"}'
df <- sql("SELECT * FROM range(1)")
head(select(df, schema_of_json(json)))
  schema_of_json({"name":"Bob"})
1            struct<name:string>

schema_of_csv:

csv <- "Amsterdam,2018"
df <- sql("SELECT * FROM range(1)")
head(select(df, schema_of_csv(csv)))
  schema_of_csv(Amsterdam,2018)
1    struct<_c0:string,_c1:int>

How was this patch tested?

Manually tested, unit tests added, documentation manually built and verified.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

cc @felixcheung and @MaxGekk

@SparkQA

This comment has been minimized.

#' the same options as the JSON/CSV data source. Additionally \code{to_json} supports
#' the "pretty" option which enables pretty JSON generation. In \code{arrays_zip},
#' this contains additional Columns of arrays to be merged.
#' @param ... additional argument(s). In \code{to_json}, \code{from_json} and

This comment was marked as resolved.

This comment was marked as resolved.

@SparkQA
Copy link

SparkQA commented Nov 4, 2018

Test build #98446 has finished for PR 22939 at commit 5f0a3b6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

#' also supported for the schema.
#' \item \code{from_csv}: a DDL-formatted string
#' also supported for the schema. Since Spark 3.0, \code{schema_of_json} or
#' a DDL-formatted string literal can also be accepted.

This comment was marked as resolved.

@viirya
Copy link
Member

viirya commented Nov 4, 2018

In addition, it also proposes to make from_csv and from_json accept structType, DDL-formatted string, DDL-formatted string literal, and schema_of_[csv|json] as schema so that we can utilise both schema_of_json and schema_of_csv.

Shall we make it as separate PR?

@HyukjinKwon
Copy link
Member Author

Makes sense. Let me separate it tomorrow.

@SparkQA
Copy link

SparkQA commented Nov 4, 2018

Test build #98449 has finished for PR 22939 at commit 5f0a3b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Will make another PR after this gets merged to allow the cases below:

df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, schema_of_json(head(df)$people_json))))
  from_json(people_json)
1                    Bob
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_csv(df$people))
head(select(df, from_csv(df$people_json, schema_of_csv(head(df)$people_json))))
  from_csv(people_json)
1                   Bob

@SparkQA
Copy link

SparkQA commented Nov 5, 2018

Test build #98457 has finished for PR 22939 at commit c0a9384.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @examples
#'
#' \dontrun{
#' json <- '{"name":"Bob"}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid mixing ' and " in doc

#' \item \code{to_json}, \code{from_json} and \code{schema_of_json}: this contains
#' additional named properties to control how it is converted and accepts the
#' same options as the JSON data source.
#' \item \code{to_json}: it supports the "pretty" option which enables pretty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, how does pretty work? is it pretty = TRUE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's there before but I'd like to suggest to give an example - doc or code example below.
it's a bit different from python/scala I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I added an example

#' @examples
#'
#' \dontrun{
#' csv <- "'Amsterdam,2018'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"m a bit confused "'Amsterdam,2018'" vs "Amsterdam,2018"
does the latter work?

if (class(x) == "character") {
col <- callJStatic("org.apache.spark.sql.functions", "lit", x)
} else {
col <- x@jc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the use when x is a Column?
schema_of_csv(lit("Amsterdam,2018"))) seems a bit odd to me...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually related with Scala API. There are too many overridden versions of functions in function.scala so we're trying to reduce it. Column is preferred over other specific types because Column can cover other expression cases.. in Python or R, they can be easily supported so other types and column are all supported. To cut it short, for consistency with Scala API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but one use could be

select(df, schema_of_csv(df$schemaCol))

like an actual col not a literal string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea .. that was discussed at #22775. The usecase of schema_of_csv or schema_of_json will usually be like .. copying and pasting one record from the actual data manually. That's disallowed for now conservatively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are saying this select(df, schema_of_csv(df$schemaCol)) is not allowed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, lit usage already works in many APIs although it looks a bit odd .. should be okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just that I thought the shortcut syntax in scala is nicer looking then lit("string") in R

Copy link
Member Author

@HyukjinKwon HyukjinKwon Nov 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm .. do you mind if we go ahead for this one and talk later within 3.0? I think we're going to deal with this (general) problem within 3.0 if I am not mistaken. I need to make one followup after this anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe to think about the design of API in R and Scala and else where - what does it look like when the user passes in a column that is not a literal string? probably worthwhile to follow up separately.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I agree. It will throw an analysis exception in that case. I also sympathize the concerns here and somewhat we're unclear about this - so I just wanted to make it restricted in general for now.

I'm going to open another PR related with this as a followup (for #22939 (comment)). While I am there, I will test when the user passes in a column that is not a literal string.

@HyukjinKwon
Copy link
Member Author

Let me clean up and deal with other comments.

@SparkQA
Copy link

SparkQA commented Nov 5, 2018

Test build #98470 has finished for PR 22939 at commit c582757.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 7, 2018

Test build #98550 has finished for PR 22939 at commit 7c8e226.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hey @felixcheung, it should be ready for another look.

@HyukjinKwon
Copy link
Member Author

gentle ping, @felixcheung.

@felixcheung
Copy link
Member

felixcheung commented Nov 21, 2018 via email

@HyukjinKwon
Copy link
Member Author

Sure!

@HyukjinKwon
Copy link
Member Author

Hey @felixcheung, have you found some time to take a look for this please?

if (class(x) == "character") {
col <- callJStatic("org.apache.spark.sql.functions", "lit", x)
} else {
col <- x@jc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe to think about the design of API in R and Scala and else where - what does it look like when the user passes in a column that is not a literal string? probably worthwhile to follow up separately.

@HyukjinKwon
Copy link
Member Author

Thank you, @felixcheung for approving this.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@asfgit asfgit closed this in 66b2046 Nov 30, 2018
@HyukjinKwon
Copy link
Member Author

@felixcheung, I tested when the user passes in a column that is not a literal string, and it shows the results as below:

> json <- '{"name":"Bob"}'
> df <- sql("SELECT * FROM range(1)")
> head(select(df, schema_of_json(df$id)))
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'schema_of_json(`id`)' due to data type mismatch: The input json should be a string literal and not null; however, got `id`.;;
'Project [schema_of_json(id#0L) AS schema_of_json(id)#2]
+- Project [id#0L]
   +- Range (0, 1, step=1, splits=None)

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...
> csv <- "Amsterdam,2018"
> df <- sql("SELECT * FROM range(1)")
> head(select(df, schema_of_csv(df$id)))
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'schema_of_csv(`id`)' due to data type mismatch: The input csv should be a string literal and not null; however, got `id`.;;
'Project [schema_of_csv(id#3L) AS schema_of_csv(id)#5]
+- Project [id#3L]
   +- Range (0, 1, step=1, splits=None)

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...

@felixcheung
Copy link
Member

felixcheung commented Nov 30, 2018 via email

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This PR proposes to expose `schema_of_json` and `schema_of_csv` at R side.

**`schema_of_json`**:

```r
json <- '{"name":"Bob"}'
df <- sql("SELECT * FROM range(1)")
head(select(df, schema_of_json(json)))
```

```
  schema_of_json({"name":"Bob"})
1            struct<name:string>
```

**`schema_of_csv`**:

```r
csv <- "Amsterdam,2018"
df <- sql("SELECT * FROM range(1)")
head(select(df, schema_of_csv(csv)))
```

```
  schema_of_csv(Amsterdam,2018)
1    struct<_c0:string,_c1:int>
```

## How was this patch tested?

Manually tested, unit tests added, documentation manually built and verified.

Closes apache#22939 from HyukjinKwon/SPARK-25446.

Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@SnchitGrover
Copy link

SnchitGrover commented Dec 10, 2019

@HyukjinKwon can we add support for jsons where value has null:

Example: json <- '{"firstName":"Bob", "lastName":null}'

I am thinking that schema_of_json function can take an argument which casts all null as string ?

def schema_of_json(json: Column, cast_null_as_string : Option[Boolean] ) = new SchemaOfJson(json.expr, cast_null_as_string.getOrElse(false))

@HyukjinKwon HyukjinKwon deleted the SPARK-25446 branch March 3, 2020 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants