Skip to content

Conversation

rozza
Copy link
Member

@rozza rozza commented Jun 6, 2024

Added a new configuration: schemaHints.
Users can now supply schema to enforce the schema information about known field types when inferring schema.

Supports the following Spark formats:

  • DDL: value STRING,count INT
  • SQL DDL: STRUCT<value: STRING, count: INT>
  • Simple String DDL: struct<value:string,count:int>
  • JSON: {"type":"struct","fields":[ {"name":"value","type":"string","nullable":true}, {"name":"count","type":"integer","nullable":true}]}

To create DDL or Json schema strings simply use the Spark shell:

import org.apache.spark.sql.types._
val mySchema = StructType(Seq(StructField("value", StringType), StructField("count", IntegerType)))

mySchema.toDDL
mySchema.sql
mySchema.simpleString
mySchema.json

Or in PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
mySchema = StructType([ StructField('value',  StringType(), True), StructField('count', IntegerType(), True)])

mySchema.simpleString()
mySchema.json()

SPARK-365

@rozza rozza requested review from a team and vbabanin and removed request for a team June 6, 2024 10:25
Added a new configuration: `schemaHints`.
Users can now supply schema to enforce the schema information about known field types when inferring schema.

Supports the following Spark formats:
  - DDL: `value STRING,count INT`
  - SQL DDL: `STRUCT<value: STRING, count: INT>`
  - JSON:
    ```{"type":"struct","fields":[
         {"name":"value","type":"string","nullable":true},
         {"name":"count","type":"integer","nullable":true}]}```

To create DDL or Json schema strings simply use the Spark shell:

```
import org.apache.spark.sql.types._
val mySchema = StructType(Seq(StructField("value", StringType), StructField("count", IntegerType)))

mySchema.toDDL
mySchema.sql
mySchema.simpleString
mySchema.json
```

Or in PySpark:

```
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
mySchema = StructType([ StructField('value',  StringType(), True), StructField('count', IntegerType(), True)])

mySchema.simpleString()
mySchema.json()
```

SPARK-365
rozza and others added 2 commits June 24, 2024 16:29
…maTest.java

Co-authored-by: Viacheslav Babanin <frest0512@gmail.com>
@rozza rozza requested a review from vbabanin June 24, 2024 16:07
@vbabanin
Copy link
Member

LGTM!

@rozza rozza merged commit de9753b into mongodb:main Jun 25, 2024
@rozza rozza deleted the SPARK-365 branch June 25, 2024 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants