Skip to content

[SPARK-17699] Support for parsing JSON string columns #15274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

marmbrus
Copy link
Contributor

Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions from_json that converts a string column into a nested StructType with a user specified schema.

Example usage:

val df = Seq("""{"a": 1}""").toDS()
val schema = new StructType().add("a", IntegerType)

df.select(from_json($"value", schema) as 'json) // => [json: <a: int>]

This PR adds support for java, scala and python. I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it). I left SQL out for now, because I'm not sure how users would specify a schema.

@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66016 has finished for PR 15274 at commit 62f56a7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Converts an json input string to a [[StructType]] with the specified schema.
*/
case class JsonToStruct(schema: StructType, options: Map[String, String], child: Expression)
Copy link
Contributor

@hvanhovell hvanhovell Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this implement ExpectsInputTypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, it definitly should. Let me update.

@rxin
Copy link
Contributor

rxin commented Sep 28, 2016

Might want to send a dev list email to solicit feedback on the API?

@marmbrus
Copy link
Contributor Author

Emailed the list. Seems like a popular feature so far :)

@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66048 has finished for PR 15274 at commit 983def2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66052 has finished for PR 15274 at commit 360b97b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@marmbrus I just wonder if adding to_json make senses (although maybe it should be done in another PR). Just curious. I am just imaging the case to write out dataframes by some data sources not supporting nested structured types.

@yhuai
Copy link
Contributor

yhuai commented Sep 29, 2016

LGTM. Merging to master.

@asfgit asfgit closed this in fe33121 Sep 29, 2016
@marmbrus
Copy link
Contributor Author

@HyukjinKwon absolutely. I actually changed the name from json_parser to from_json in anticipation of adding to_json :)

@DanielMe
Copy link

DanielMe commented Oct 17, 2016

@marmbrus: Is there any workaround I can use to achieve a similar effect in 1.6?

@yhuai
Copy link
Contributor

yhuai commented Oct 17, 2016

@DanielMe The best options for 1.6 are get_json_object and json_tuple (their docs can be found at https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.functions$).

@DanielMe
Copy link

@yhuai thanks! My impression was that get_json_object does not convert json arrays to ArrayTypes, maybe I misunderstood the way it's supposed to be used though.

@yhuai
Copy link
Contributor

yhuai commented Oct 18, 2016

@DanielMe oh, I see. get_json_object will not parse json array. You need to have a UDF to do that for Spark 1.6.

@gatorsmile
Copy link
Member

gatorsmile commented Jan 30, 2017

Actually, to specify the schema in SQL language, maybe we can use a JSON string. A little bit odd. So far, nobody is asking for it, I guess. Let us see whether users need it in SQL

@Sazpaimon
Copy link

@gatorsmile Alternatively, one can use do what brickhouse's from_json Hive UDF does ( https://gist.github.com/jeromebanks/8855408#file-gistfile1-sql )

(For the record, I actually need this in SQL)

@gatorsmile
Copy link
Member

Based on the comment @marmbrus in a JIRA, we prefer to using our DDL format. For example, like what we did for CREATE TABLE, we can specify the schema using a int, b string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants