Added Schema hints for use when inferring schemas. #118

rozza · 2024-06-06T10:24:47Z

Added a new configuration: schemaHints.
Users can now supply schema to enforce the schema information about known field types when inferring schema.

Supports the following Spark formats:

DDL: value STRING,count INT
SQL DDL: STRUCT<value: STRING, count: INT>
Simple String DDL: struct<value:string,count:int>
JSON: {"type":"struct","fields":[ {"name":"value","type":"string","nullable":true}, {"name":"count","type":"integer","nullable":true}]}

To create DDL or Json schema strings simply use the Spark shell:

import org.apache.spark.sql.types._
val mySchema = StructType(Seq(StructField("value", StringType), StructField("count", IntegerType)))

mySchema.toDDL
mySchema.sql
mySchema.simpleString
mySchema.json

Or in PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
mySchema = StructType([ StructField('value',  StringType(), True), StructField('count', IntegerType(), True)])

mySchema.simpleString()
mySchema.json()

SPARK-365

Added a new configuration: `schemaHints`. Users can now supply schema to enforce the schema information about known field types when inferring schema. Supports the following Spark formats: - DDL: `value STRING,count INT` - SQL DDL: `STRUCT<value: STRING, count: INT>` - JSON: ```{"type":"struct","fields":[ {"name":"value","type":"string","nullable":true}, {"name":"count","type":"integer","nullable":true}]}``` To create DDL or Json schema strings simply use the Spark shell: ``` import org.apache.spark.sql.types._ val mySchema = StructType(Seq(StructField("value", StringType), StructField("count", IntegerType))) mySchema.toDDL mySchema.sql mySchema.simpleString mySchema.json ``` Or in PySpark: ``` from pyspark.sql.types import StructType, StructField, StringType, IntegerType mySchema = StructType([ StructField('value', StringType(), True), StructField('count', IntegerType(), True)]) mySchema.simpleString() mySchema.json() ``` SPARK-365

src/main/java/com/mongodb/spark/sql/connector/schema/InferSchema.java

src/test/java/com/mongodb/spark/sql/connector/schema/InferSchemaTest.java

…maTest.java Co-authored-by: Viacheslav Babanin <frest0512@gmail.com>

vbabanin · 2024-06-24T16:11:32Z

LGTM!

rozza requested review from a team and vbabanin and removed request for a team June 6, 2024 10:25

rozza force-pushed the SPARK-365 branch from bc52f7d to b7b47d5 Compare June 6, 2024 10:27

vbabanin reviewed Jun 14, 2024

View reviewed changes

rozza and others added 2 commits June 24, 2024 16:29

Update src/test/java/com/mongodb/spark/sql/connector/schema/InferSche…

1780f68

…maTest.java Co-authored-by: Viacheslav Babanin <frest0512@gmail.com>

Code review updates

b871bfb

rozza requested a review from vbabanin June 24, 2024 16:07

vbabanin approved these changes Jun 24, 2024

View reviewed changes

rozza merged commit de9753b into mongodb:main Jun 25, 2024

rozza deleted the SPARK-365 branch June 25, 2024 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added Schema hints for use when inferring schemas. #118

Added Schema hints for use when inferring schemas. #118

Uh oh!

rozza commented Jun 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbabanin commented Jun 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added Schema hints for use when inferring schemas. #118

Added Schema hints for use when inferring schemas. #118

Uh oh!

Conversation

rozza commented Jun 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbabanin commented Jun 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants