[SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD #1598

davies · 2014-07-26T01:16:09Z

Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.

This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.

Then we can access them by row.field3.field5[0] or row.field6[5].field7

It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.

You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:

ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))

Or you could use Row to create a class just like namedtuple, for example:

Person = Row("name", "age")
ctx.inferSchema(rdd.map(lambda x: Person(*x)))

Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The schema should be StructType, see the API docs for details.

schema = StructType([StructField("name, StringType, True),
StructType("age", IntegerType, True)])
ctx.applySchema(rdd, schema)

PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.

SparkQA · 2014-07-26T01:18:53Z

QA tests have started for PR 1598. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17211/consoleFull

marmbrus · 2014-07-26T02:39:27Z

Can you add [SQL] to these PRs as well?

SparkQA · 2014-07-26T03:00:47Z

QA results for PR 1598:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class List(list):
class Dict(dict):
class Row(tuple):
class Row(tuple):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17211/consoleFull

yhuai · 2014-07-29T04:41:59Z

With this PR, what does a StructType represent? namedtuple or array? Do we still keep the Row class in PySpark?

davies · 2014-07-29T05:09:39Z

A StructType is presented as an namedtuple in Python, which is called Row.

The Row is generated according schema, there is no predefined Row class, so it's better to keep it internal.

Conflicts: python/pyspark/sql.py sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

SparkQA · 2014-07-30T00:33:56Z

QA tests have started for PR 1598. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17397/consoleFull

SparkQA · 2014-07-30T01:03:57Z

QA tests have started for PR 1598. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17399/consoleFull

SparkQA · 2014-07-30T01:48:26Z

QA results for PR 1598:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class List(list):
class Dict(dict):
class Row(tuple):
class Row(tuple):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17397/consoleFull

SparkQA · 2014-07-30T02:19:19Z

QA results for PR 1598:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class List(list):
class Dict(dict):
class Row(tuple):
class Row(tuple):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17399/consoleFull

Conflicts: python/pyspark/sql.py

SparkQA · 2014-07-30T21:19:06Z

QA tests have started for PR 1598. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17488/consoleFull

davies · 2014-07-30T21:22:53Z

@yhuai @marmbrus @mateiz plz take a look at this, thx!

mateiz · 2014-07-30T22:14:12Z

python/pyspark/sql.py

+        >>> srdd2.collect()
+        [Row(f1=1, f2=u'row1', f3=Row(field4=11, field5=None), f4=None), \
+Row(f1=2, f2=None, f3=Row(field4=22, field5=[10, 11]), f4=[Row(field7=u'row2')]), \
+Row(f1=None, f2=u'row3', f3=Row(field4=33, field5=[]), f4=None)]


Breaking the doc comment like this is kind of weird; could you instead do a for r in srdd2.collect(): print r and get one per line?

SparkQA · 2014-07-30T22:24:34Z

QA results for PR 1598:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class DataType(object):
class PrimitiveType(DataType):
class StringType(PrimitiveType):
class BinaryType(PrimitiveType):
class BooleanType(PrimitiveType):
class TimestampType(PrimitiveType):
class DecimalType(PrimitiveType):
class DoubleType(PrimitiveType):
class FloatType(PrimitiveType):
class ByteType(PrimitiveType):
class IntegerType(PrimitiveType):
class LongType(PrimitiveType):
class ShortType(PrimitiveType):
class ArrayType(DataType):
class MapType(DataType):
class StructField(DataType):
class StructType(DataType):
class List(list):
class Dict(dict):
class Row(tuple):
class Row(tuple):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17488/consoleFull

mateiz · 2014-07-30T22:25:26Z

python/pyspark/sql.py

+        cls = _create_cls(self.schema())
+        return map(cls, rows)
+
+    # convert Row in JavaSchemaRDD into namedtuple, let access fields easier


You should expand this comment a bit, e.g. "Convert each object in the RDD to a Row with the right class for this SchemaRDD, so that fields can be accessed as attributes." Also this needs to appear in some kind of class comment at the top, e.g. say "This class receives raw tuples from Java but assigns a class to it in all its data-collection methods (mapPartitionsWithIndex, collect, take, etc) so that PySpark sees them as Row objects with named fields".

mateiz · 2014-07-30T22:25:58Z

Made some comments on it from the Python side. @JoshRosen you may also want to take a look at the named tuple / class generation stuff here.

yhuai · 2014-07-30T23:00:27Z

python/pyspark/sql.py


-
-class StructType(object):
+class StructType(DataType):
    """Spark SQL StructType

    The data type representing namedtuple values.


Should we change it to The data type representing rows.?

marmbrus · 2014-08-01T18:52:27Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+   * Convert an RDD of serialized Python tuple to Array (no recursive conversions).
+   * It is only used by pyspark.sql.
+   */
+  def pythonToJavaArray(pyRDD: JavaRDD[Array[Byte]], batched: Boolean): JavaRDD[Array[_]] = {


private[spark]?

The whole PythonRDD is private, so does it still need this?

Ah, I did not realize that. It could still perhaps be marked protected (to prevent other spark users from depending on it directly), but thats not as big of a deal.

marmbrus · 2014-08-01T19:19:21Z

python/pyspark/sql.py

+        ...     double=1.0, long=1L, boolean=True, list=[1, 2, 3],
+        ...     time=datetime(2010, 1, 1, 1, 1, 1), dict={"a": 1})])
+        >>> srdd = sqlCtx.inferSchema(allTypes).map(lambda x: (x.int, x.string,
+        ... x.double, x.long, x.boolean, x.time, x.dict["a"], x.list))


It would be great to also add a SQL test here to make sure that types are matching up with those expected in the execution engine. (though we might change the names to avoid conflict with reserved words, as we have not implemented identifier escaping). In particular the complex nested ones like dict and list. Also it would be good to add a nested Row to the input types.

Something like:

srdd.registerAsTable("pythonData") sqlCtx.sql("SELECT dict['a'], list[0], nested.nestedField").collect() ...

marmbrus · 2014-08-01T19:36:12Z

This is looking really good to me! I'm very excited to have much more complete support for SQL in pyspark. A few minor comments on docs and testing, but I think we can merge this soon.

yhuai · 2014-08-01T19:40:41Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

-        JMapWrapper(converted)
+      case (c: java.util.Map[_, _], MapType(keyType, valueType, _)) => c.map {
+          case (key, value) => (convert(key, keyType), convert(value, valueType))
+        }.toMap


Should we update the part of case (c: java.util.Map[_, _], struct: StructType) as well?

Will case (c: java.util.Map[_, _], struct: StructType) happen with your change? How do we handle inner structs?

The Row() in python will be convert into tuple(), so It's fine to remove this case.

SparkQA · 2014-08-01T21:24:12Z

QA tests have started for PR 1598. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17700/consoleFull

SparkQA · 2014-08-01T22:04:13Z

QA tests have started for PR 1598. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17703/consoleFull

SparkQA · 2014-08-01T23:03:00Z

QA results for PR 1598:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class DataType(object):
class PrimitiveType(DataType):
class StringType(PrimitiveType):
class BinaryType(PrimitiveType):
class BooleanType(PrimitiveType):
class TimestampType(PrimitiveType):
class DecimalType(PrimitiveType):
class DoubleType(PrimitiveType):
class FloatType(PrimitiveType):
class ByteType(PrimitiveType):
class IntegerType(PrimitiveType):
class LongType(PrimitiveType):
class ShortType(PrimitiveType):
class ArrayType(DataType):
class MapType(DataType):
class StructField(DataType):
class StructType(DataType):
class List(list):
class Dict(dict):
class Row(tuple):
class Row(tuple):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17700/consoleFull

SparkQA · 2014-08-01T23:26:14Z

QA results for PR 1598:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class DataType(object):
class PrimitiveType(DataType):
class StringType(PrimitiveType):
class BinaryType(PrimitiveType):
class BooleanType(PrimitiveType):
class TimestampType(PrimitiveType):
class DecimalType(PrimitiveType):
class DoubleType(PrimitiveType):
class FloatType(PrimitiveType):
class ByteType(PrimitiveType):
class IntegerType(PrimitiveType):
class LongType(PrimitiveType):
class ShortType(PrimitiveType):
class ArrayType(DataType):
class MapType(DataType):
class StructField(DataType):
class StructType(DataType):
class List(list):
class Dict(dict):
class Row(tuple):
class Row(tuple):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17703/consoleFull

marmbrus · 2014-08-02T01:48:04Z

Thanks for working on this! I've merge it to master.

@deprecated

Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes. This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance. root |-- field1: integer (nullable = true) |-- field2: string (nullable = true) |-- field3: struct (nullable = true) | |-- field4: integer (nullable = true) | |-- field5: array (nullable = true) | | |-- element: integer (containsNull = false) |-- field6: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- field7: string (nullable = true) Then we can access them by row.field3.field5[0] or row.field6[5].field7 It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType. You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as: ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1])) Or you could use Row to create a class just like namedtuple, for example: Person = Row("name", "age") ctx.inferSchema(rdd.map(lambda x: Person(*x))) Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details. schema = StructType([StructField("name, StringType, True), StructType("age", IntegerType, True)]) ctx.applySchema(rdd, schema) PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable. Author: Davies Liu <davies.liu@gmail.com> Closes apache#1598 from davies/nested and squashes the following commits: f1d15b6 [Davies Liu] verify schema with the first few rows 8852aaf [Davies Liu] check type of schema abe9e6e [Davies Liu] address comments 61b2292 [Davies Liu] add @deprecated to pythonToJavaMap 1e5b801 [Davies Liu] improve cache of classes 51aa135 [Davies Liu] use Row to infer schema e9c0d5c [Davies Liu] remove string typed schema 353a3f2 [Davies Liu] fix code style 63de8f8 [Davies Liu] fix typo c79ca67 [Davies Liu] fix serialization of nested data 6b258b5 [Davies Liu] fix pep8 9d8447c [Davies Liu] apply schema provided by string of names f5df97f [Davies Liu] refactor, address comments 9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python 84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested 0eaaf56 [Davies Liu] fix doc tests b3559b4 [Davies Liu] use generated Row instead of namedtuple c4ddc30 [Davies Liu] fix conflict between name of fields and variables 7f6f251 [Davies Liu] address all comments d69d397 [Davies Liu] refactor 2cc2d45 [Davies Liu] refactor 182fb46 [Davies Liu] refactor bc6e9e1 [Davies Liu] switch to new Schema API 547bf3e [Davies Liu] Merge branch 'master' into nested a435b5a [Davies Liu] add docs and code refactor 2c8debc [Davies Liu] Merge branch 'master' into nested 644665a [Davies Liu] use tuple and namedtuple for schemardd

Boson 0.2.5-beta includes the notIn parquet fix: - build: Upgrade Arrow to 25.0.0 (pie/boson#599) - feat: Support ansi mode of `sum` kernel (pie/boson#600) - build: Upgrade Parquet to 1.12.0.15-dev-apple (pie/boson#602) Note this only affect when Boson is enabled in Spark.

use tuple and namedtuple for schemardd

644665a

davies changed the title ~~[WIP] [SPARK-2010] [PySpark] support nested structure in SchemaRDD~~ [WIP] [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD Jul 26, 2014

Merge branch 'master' into nested

2c8debc

Conflicts: python/pyspark/sql.py sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

add docs and code refactor

a435b5a

davies added 5 commits July 30, 2014 11:24

Merge branch 'master' into nested

547bf3e

Conflicts: python/pyspark/sql.py

switch to new Schema API

bc6e9e1

refactor

182fb46

refactor

2cc2d45

refactor

d69d397

davies changed the title ~~[WIP] [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD~~ [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD (part 1) Jul 30, 2014

mateiz reviewed Jul 30, 2014
View reviewed changes

yhuai reviewed Jul 30, 2014
View reviewed changes

marmbrus reviewed Aug 1, 2014
View reviewed changes

add @deprecated to pythonToJavaMap

61b2292

marmbrus reviewed Aug 1, 2014
View reviewed changes

yhuai reviewed Aug 1, 2014
View reviewed changes

address comments

abe9e6e

davies added 2 commits August 1, 2014 14:32

check type of schema

8852aaf

verify schema with the first few rows

f1d15b6

asfgit closed this in 880eabe Aug 2, 2014

davies deleted the nested branch September 15, 2014 22:18

HyukjinKwon mentioned this pull request Mar 27, 2017

[SPARK-20098][PYSPARK] dataType's typeName fix #17435

Closed

[SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD #1598

[SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD #1598

Uh oh!

Conversation

davies commented Jul 26, 2014

Uh oh!

SparkQA commented Jul 26, 2014

Uh oh!

marmbrus commented Jul 26, 2014

Uh oh!

SparkQA commented Jul 26, 2014

Uh oh!

yhuai commented Jul 29, 2014

Uh oh!

davies commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

davies commented Jul 30, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jul 30, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Aug 1, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2014

Uh oh!

SparkQA commented Aug 1, 2014

Uh oh!

SparkQA commented Aug 1, 2014

Uh oh!

SparkQA commented Aug 1, 2014

Uh oh!

marmbrus commented Aug 2, 2014

Uh oh!

Uh oh!