[SPARK-25446][R] Add schema_of_json() and schema_of_csv() to R #22939

HyukjinKwon · 2018-11-04T09:16:32Z

What changes were proposed in this pull request?

This PR proposes to expose schema_of_json and schema_of_csv at R side.

schema_of_json:

json <- '{"name":"Bob"}'
df <- sql("SELECT * FROM range(1)")
head(select(df, schema_of_json(json)))

  schema_of_json({"name":"Bob"})
1            struct<name:string>

schema_of_csv:

csv <- "Amsterdam,2018"
df <- sql("SELECT * FROM range(1)")
head(select(df, schema_of_csv(csv)))

  schema_of_csv(Amsterdam,2018)
1    struct<_c0:string,_c1:int>

How was this patch tested?

Manually tested, unit tests added, documentation manually built and verified.

HyukjinKwon · 2018-11-04T09:23:03Z

cc @felixcheung and @MaxGekk

R/pkg/R/functions.R

-#'            the same options as the JSON/CSV data source. Additionally \code{to_json} supports
-#'            the "pretty" option which enables pretty JSON generation. In \code{arrays_zip},
-#'            this contains additional Columns of arrays to be merged.
+#' @param ... additional argument(s). In \code{to_json}, \code{from_json} and


SparkQA · 2018-11-04T12:07:13Z

Test build #98446 has finished for PR 22939 at commit 5f0a3b6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-04T12:25:32Z

retest this please

R/pkg/R/functions.R

-#'              also supported for the schema.
-#'          \item \code{from_csv}: a DDL-formatted string
+#'              also supported for the schema. Since Spark 3.0, \code{schema_of_json} or
+#'              a DDL-formatted string literal can also be accepted.


viirya · 2018-11-04T13:58:47Z

In addition, it also proposes to make from_csv and from_json accept structType, DDL-formatted string, DDL-formatted string literal, and schema_of_[csv|json] as schema so that we can utilise both schema_of_json and schema_of_csv.

Shall we make it as separate PR?

HyukjinKwon · 2018-11-04T14:29:23Z

Makes sense. Let me separate it tomorrow.

SparkQA · 2018-11-04T16:01:22Z

Test build #98449 has finished for PR 22939 at commit 5f0a3b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-05T03:09:06Z

Will make another PR after this gets merged to allow the cases below:

df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, schema_of_json(head(df)$people_json))))

  from_json(people_json)
1                    Bob

df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_csv(df$people))
head(select(df, from_csv(df$people_json, schema_of_csv(head(df)$people_json))))

  from_csv(people_json)
1                   Bob

SparkQA · 2018-11-05T03:49:38Z

Test build #98457 has finished for PR 22939 at commit c0a9384.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-11-05T07:09:00Z

R/pkg/R/functions.R

+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'


I think we should avoid mixing ' and " in doc

felixcheung · 2018-11-05T07:11:34Z

R/pkg/R/functions.R

+#'          \item \code{to_json}, \code{from_json} and \code{schema_of_json}: this contains
+#'              additional named properties to control how it is converted and accepts the
+#'              same options as the JSON data source.
+#'          \item \code{to_json}: it supports the "pretty" option which enables pretty


actually, how does pretty work? is it pretty = TRUE?

I know it's there before but I'd like to suggest to give an example - doc or code example below.
it's a bit different from python/scala I think

OK. I added an example

felixcheung · 2018-11-05T07:12:47Z

R/pkg/R/functions.R

+#' @examples
+#'
+#' \dontrun{
+#' csv <- "'Amsterdam,2018'"


I"m a bit confused "'Amsterdam,2018'" vs "Amsterdam,2018"
does the latter work?

felixcheung · 2018-11-05T07:15:54Z

R/pkg/R/functions.R

+            if (class(x) == "character") {
+              col <- callJStatic("org.apache.spark.sql.functions", "lit", x)
+            } else {
+              col <- x@jc


what's the use when x is a Column?
schema_of_csv(lit("Amsterdam,2018"))) seems a bit odd to me...

That's actually related with Scala API. There are too many overridden versions of functions in function.scala so we're trying to reduce it. Column is preferred over other specific types because Column can cover other expression cases.. in Python or R, they can be easily supported so other types and column are all supported. To cut it short, for consistency with Scala API.

ok but one use could be

select(df, schema_of_csv(df$schemaCol))

like an actual col not a literal string?

Yea .. that was discussed at #22775. The usecase of schema_of_csv or schema_of_json will usually be like .. copying and pasting one record from the actual data manually. That's disallowed for now conservatively.

you are saying this select(df, schema_of_csv(df$schemaCol)) is not allowed?

BTW, lit usage already works in many APIs although it looks a bit odd .. should be okay.

just that I thought the shortcut syntax in scala is nicer looking then lit("string") in R

Hmm .. do you mind if we go ahead for this one and talk later within 3.0? I think we're going to deal with this (general) problem within 3.0 if I am not mistaken. I need to make one followup after this anyway.

maybe to think about the design of API in R and Scala and else where - what does it look like when the user passes in a column that is not a literal string? probably worthwhile to follow up separately.

Yea, I agree. It will throw an analysis exception in that case. I also sympathize the concerns here and somewhat we're unclear about this - so I just wanted to make it restricted in general for now.

I'm going to open another PR related with this as a followup (for #22939 (comment)). While I am there, I will test when the user passes in a column that is not a literal string.

HyukjinKwon · 2018-11-05T09:55:36Z

Let me clean up and deal with other comments.

SparkQA · 2018-11-05T16:25:43Z

Test build #98470 has finished for PR 22939 at commit c582757.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-07T12:39:43Z

Test build #98550 has finished for PR 22939 at commit 7c8e226.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-09T03:43:03Z

Hey @felixcheung, it should be ready for another look.

HyukjinKwon · 2018-11-21T03:41:08Z

gentle ping, @felixcheung.

felixcheung · 2018-11-21T08:17:13Z

Sorry for the delay, will do another pass in 1 or 2 days

HyukjinKwon · 2018-11-21T08:26:28Z

Sure!

HyukjinKwon · 2018-11-26T16:09:31Z

Hey @felixcheung, have you found some time to take a look for this please?

felixcheung · 2018-11-27T17:33:15Z

R/pkg/R/functions.R

+            if (class(x) == "character") {
+              col <- callJStatic("org.apache.spark.sql.functions", "lit", x)
+            } else {
+              col <- x@jc


maybe to think about the design of API in R and Scala and else where - what does it look like when the user passes in a column that is not a literal string? probably worthwhile to follow up separately.

HyukjinKwon · 2018-11-30T02:25:50Z

Thank you, @felixcheung for approving this.

HyukjinKwon · 2018-11-30T02:26:01Z

Merged to master.

HyukjinKwon · 2018-11-30T03:58:35Z

@felixcheung, I tested when the user passes in a column that is not a literal string, and it shows the results as below:

> json <- '{"name":"Bob"}'
> df <- sql("SELECT * FROM range(1)")
> head(select(df, schema_of_json(df$id)))
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'schema_of_json(`id`)' due to data type mismatch: The input json should be a string literal and not null; however, got `id`.;;
'Project [schema_of_json(id#0L) AS schema_of_json(id)#2]
+- Project [id#0L]
   +- Range (0, 1, step=1, splits=None)

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...

> csv <- "Amsterdam,2018"
> df <- sql("SELECT * FROM range(1)")
> head(select(df, schema_of_csv(df$id)))
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'schema_of_csv(`id`)' due to data type mismatch: The input csv should be a string literal and not null; however, got `id`.;;
'Project [schema_of_csv(id#3L) AS schema_of_csv(id)#5]
+- Project [id#3L]
   +- Range (0, 1, step=1, splits=None)

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...

felixcheung · 2018-11-30T06:12:16Z

Error looks reasonable...

## What changes were proposed in this pull request? This PR proposes to expose `schema_of_json` and `schema_of_csv` at R side. **`schema_of_json`**: ```r json <- '{"name":"Bob"}' df <- sql("SELECT * FROM range(1)") head(select(df, schema_of_json(json))) ``` ``` schema_of_json({"name":"Bob"}) 1 struct<name:string> ``` **`schema_of_csv`**: ```r csv <- "Amsterdam,2018" df <- sql("SELECT * FROM range(1)") head(select(df, schema_of_csv(csv))) ``` ``` schema_of_csv(Amsterdam,2018) 1 struct<_c0:string,_c1:int> ``` ## How was this patch tested? Manually tested, unit tests added, documentation manually built and verified. Closes apache#22939 from HyukjinKwon/SPARK-25446. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

SnchitGrover · 2019-12-10T10:00:01Z

@HyukjinKwon can we add support for jsons where value has null:

Example: json <- '{"firstName":"Bob", "lastName":null}'

I am thinking that schema_of_json function can take an argument which casts all null as string ?

def schema_of_json(json: Column, cast_null_as_string : Option[Boolean] ) = new SchemaOfJson(json.expr, cast_null_as_string.getOrElse(false))

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the SPARK-25446 branch from c4a78fc to 52bae78 Compare November 4, 2018 09:25

This comment has been minimized.

Sign in to view

MaxGekk reviewed Nov 4, 2018

View reviewed changes

viirya reviewed Nov 4, 2018

View reviewed changes

HyukjinKwon added 2 commits November 5, 2018 10:35

Add schema_of_json() and schema_of_csv() to R

2e1d693

Address comments

3416ac7

HyukjinKwon force-pushed the SPARK-25446 branch from 5f0a3b6 to 3416ac7 Compare November 5, 2018 03:05

Remove another test

c0a9384

felixcheung reviewed Nov 5, 2018

View reviewed changes

Address Felix's comments

c582757

Address felix's comments

7c8e226

felixcheung approved these changes Nov 27, 2018

View reviewed changes

asfgit closed this in 66b2046 Nov 30, 2018

HyukjinKwon deleted the SPARK-25446 branch March 3, 2020 01:20

[SPARK-25446][R] Add schema_of_json() and schema_of_csv() to R #22939

[SPARK-25446][R] Add schema_of_json() and schema_of_csv() to R #22939

Uh oh!

Conversation

HyukjinKwon commented Nov 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

This comment has been minimized.

HyukjinKwon commented Nov 4, 2018

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

SparkQA commented Nov 4, 2018

Uh oh!

HyukjinKwon commented Nov 4, 2018

Uh oh!

This comment was marked as resolved.

Uh oh!

viirya commented Nov 4, 2018

Uh oh!

HyukjinKwon commented Nov 4, 2018

Uh oh!

SparkQA commented Nov 4, 2018

Uh oh!

HyukjinKwon commented Nov 5, 2018

Uh oh!

SparkQA commented Nov 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 5, 2018

Uh oh!

SparkQA commented Nov 5, 2018

Uh oh!

SparkQA commented Nov 7, 2018

Uh oh!

HyukjinKwon commented Nov 9, 2018

Uh oh!

HyukjinKwon commented Nov 21, 2018

Uh oh!

felixcheung commented Nov 21, 2018 via email

HyukjinKwon commented Nov 4, 2018 •

edited

Loading

HyukjinKwon Nov 12, 2018 •

edited

Loading

HyukjinKwon Nov 30, 2018 •

edited

Loading

SnchitGrover commented Dec 10, 2019 •

edited

Loading