Added parse mode support when reading data from MongoDB. #119

rozza · 2024-06-12T11:20:55Z

Adds the mode configuration allowing for different parsing strategies when handling documents that don't match the expected schema during reads.

The options are:

FAILFAST (default) throw an exception when parsing a document that doesn't match the schema.
PERMISSIVE Sets any invalid fields to null. Combine with the columnNameOfCorruptRecord configuration if you want to store any invalid documents as an extended json string.
DROPMALFORMED ignores the whole document.

Adds the columnNameOfCorruptRecord configuration whic extends the PERMISSIVE mode. When configured it saves the whole invalid document as extended json in that column, as long as its defined in the Schema. Inferred schemas will add the columnNameOfCorruptRecord column if set and the mode is PERMISSIVE.

Note: Names derive from existing spark json configurations, from where this feature takes inspiration.

SPARK-327

Adds the `mode` configuration allowing for different parsing strategies when handling documents that don't match the expected schema during reads. The options are: - `FAILFAST` (default) throw an exception when parsing a document that doesn't match the schema. - `PERMISSIVE` Sets any invalid fields to `null`. Combine with the `columnNameOfCorruptRecord` configuration if you want to store any invalid documents as an extended json string. - `DROPMALFORMED` ignores the whole document. Adds the `columnNameOfCorruptRecord` configuration whic extends the `PERMISSIVE` mode. When configured it saves the whole invalid document as extended json in that column, as long as its defined in the Schema. Inferred schemas will add the `columnNameOfCorruptRecord` column if set and the `mode` is `PERMISSIVE`. Note: Names derive from existing spark json configurations, from where this feature takes inspiration. SPARK-327

rozza · 2024-06-12T11:28:03Z

src/integrationTest/java/com/mongodb/spark/sql/connector/read/AbstractMongoStreamTest.java


  @ParameterizedTest
-  @ValueSource(strings = {"SINGLE", "MULTIPLE", "ALL"})
+  @ValueSource(strings = {"SINGLE", "MULTIPLE"})


This fixes a bug where it was checking the writes against the source db and not the sink db! As you cannot test reading from all collections and writing to that db at the same time. I opted to remove the ALL step. Added SPARK-429 to track.

rozza · 2024-06-12T11:29:54Z

src/integrationTest/java/com/mongodb/spark/sql/connector/read/AbstractMongoStreamTest.java

      final Consumer<MongoConfig> setup,
      final Consumer<MongoConfig>... consumers) {
-    testStreamingQuery(writeFormat, mongoConfig, DEFAULT_SCHEMA, null, setup, consumers);
+    testStreamingQuery(writeFormat, mongoConfig, DEFAULT_SCHEMA, null, null, setup, consumers);


The changes below are test infrastructure related eg adding stream listener support and / or simplify the testing.

rozza · 2024-06-12T11:30:37Z

src/integrationTest/java/com/mongodb/spark/sql/connector/read/AbstractMongoStreamTest.java

      MongoCollection<BsonDocument> collection = writeConfig.withClient(client -> client
          .getDatabase(writeConfig.getDatabaseName())
-          .getCollection(collectionName(), BsonDocument.class));
+          .getCollection(writeConfig.getCollectionName(), BsonDocument.class));


Use the write config to source the collection name to write to!!

rozza · 2024-06-12T11:30:55Z

src/integrationTest/java/com/mongodb/spark/sql/connector/read/AbstractMongoStreamTest.java

    }
    options.put(
        ReadConfig.READ_PREFIX + ReadConfig.COLLECTION_NAME_CONFIG, collectionsConfigOptionValue);
+    options.put(


Sets the write config collection name.

vbabanin · 2024-06-26T01:01:49Z

src/main/java/com/mongodb/spark/sql/connector/schema/BsonDocumentToRowConverter.java

+            if (!isPermissive) {
+              throw e;
+            }
+            valueMap.put(field.name(), null);


Currently, if one out of N fields in a subdocument is corrupted, the entire subdocument is marked as null. I'm not entirely sure if this behavior aligns with Spark specifications, but it might be more reasonable to mark only the corrupted fields as such, rather than the whole subdocument. What are your thoughts?

Great question. I initially thought this wasn't defined and in a closed source module. However, having searched the spark open source code it is native to their json importer. So I checked the behavior in pyspark and it appears it is recursive!

>>> from pyspark.sql.types import _parse_datatype_string >>> schema = _parse_datatype_string('struct<a:string,b:struct<c:string,d:bigint>,_corrupt_record:string>') >>> >>> json = ['{"a":"ok","b":{"c":"c","d":10}}','{"a":"bad","b":{"c":"c","d":"bad"}}'] >>> stringdf = sc.parallelize(json) >>> df = spark.read.option('mode', 'PERMISSIVE').schema(schema).json(stringdf) >>> df.show() +---+---------+--------------------+ | a| b| _corrupt_record| +---+---------+--------------------+ | ok| {c, 10}| NULL| |bad|{c, NULL}|{"a":"bad","b":{"...| +---+---------+--------------------+

I will update :)

Done and it was simplier than expected to add 👍

Looks very neat!

vbabanin · 2024-06-26T08:01:33Z

src/main/java/com/mongodb/spark/sql/connector/config/ReadConfig.java

+    DROPMALFORMED;
+
+    static ParseMode fromString(final String userParseMode) {
+      try {


If null is provided through the option() method, it currently results in a NullPointerException. To make this more user-friendly and consistent, we could throw a ConfigException , which provides clearer feedback.

Suggested change

try {

validateConfig(userParseMode, Objects::nonNull, () -> "The userParseMode can't be null");

try {

I ended up reverting as this method is internal and only called via the readConfig method where there is a default value for the parse method.

…g.java Co-authored-by: Viacheslav Babanin <frest0512@gmail.com>

…age from ReadConfig includes a default value

…container

rozza · 2024-06-27T12:23:51Z

src/main/java/com/mongodb/spark/sql/connector/config/ReadConfig.java

+    DROPMALFORMED;
+
+    static ParseMode fromString(final String userParseMode) {
+      try {


I ended up reverting as this method is internal and only called via the readConfig method where there is a default value for the parse method.

rozza · 2024-06-27T12:24:20Z

src/main/java/com/mongodb/spark/sql/connector/schema/BsonDocumentToRowConverter.java

+            if (!isPermissive) {
+              throw e;
+            }
+            valueMap.put(field.name(), null);


Done and it was simplier than expected to add 👍

rozza · 2024-06-27T12:25:53Z

src/main/java/com/mongodb/spark/sql/connector/schema/BsonDocumentToRowConverter.java

            map.put(k, convertBsonValue(createFieldPath(fieldName, k), dataType.valueType(), v)));

-    return JavaScala.asScala(map);
+    return scala.collection.immutable.Map$.MODULE$.from(JavaScala.asScala(map));


Found this bug when converting a row to json and Spark errored due to it not being the correct (immutable) scala map type.

vbabanin · 2024-06-28T07:33:45Z

src/main/java/com/mongodb/spark/sql/connector/schema/BsonDocumentToRowConverter.java

In PERMISSIVE mode, I see that corrupted fields are set to null, regardless of specifying them as non-nullable.

However, in covertToRow() method when non-nullable field defined in schema is missing in the document, we throw an exception: com.mongodb.spark.sql.connector.exceptions.DataException: Missing field 'fieldName'

if (hasField || field.nullable()) { values.add( hasField ? convertBsonValue(fullFieldPath, field.dataType(), bsonDocument.get(field.name())) : null); } else { throw missingFieldException(fullFieldPath, bsonDocument); }

Given the possibility that users might have unstructured data in DB, where documents may vary in field counts, should we consider setting absent fields expected by the schema as null when operating in PERMISSIVE mode? I'm concerned that throwing exceptions in these situations could create issues for some users.

In similar contexts, Spark's strategy with

CSV files sets missing fields to null when a record has fewer tokens than the schema expects.

Avro, in PERMISSIVE mode: 'Corrupt records are processed as null result. Therefore, the data schema is forced to be fully nullable, which might be different from the one user provided,' implying a schema that is forced to full nullability.

(Note: I have not tested the above scenarios from the Spark's documentation).

As another option, we might also consider introducing another mode that allows setting nulls on missing fields without triggering failures. What are your thoughts?

Done, now force nullability. This is inline with json parsing in Spark.

vbabanin

LGTM!

rozza commented Jun 12, 2024

View reviewed changes

rozza requested a review from vbabanin June 12, 2024 11:31

rozza added 2 commits June 25, 2024 10:52

Add annotations for test compilation

b6e7f04

Add annotations as a test implementation dependency

b7e537e

vbabanin reviewed Jun 26, 2024

View reviewed changes

rozza and others added 3 commits June 26, 2024 18:06

Update src/main/java/com/mongodb/spark/sql/connector/config/ReadConfi…

d28399f

…g.java Co-authored-by: Viacheslav Babanin <frest0512@gmail.com>

Revert validation - as its internal method with and the fromString us…

458b158

…age from ReadConfig includes a default value

Handle nested corrupted data insitu rather than nullifying the whole …

91d311a

…container

rozza commented Jun 27, 2024

View reviewed changes

rozza requested a review from vbabanin June 27, 2024 12:26

Fix scala 2.12 / 2.13 compat

1804194

vbabanin reviewed Jun 28, 2024

View reviewed changes

Force nullability for permissive schemas

c4cfedb

rozza requested a review from vbabanin July 9, 2024 16:44

vbabanin approved these changes Jul 12, 2024

View reviewed changes

Merge branch 'main' into SPARK-327

98c4c53

rozza merged commit c4043ae into mongodb:main Jul 15, 2024

rozza deleted the SPARK-327 branch July 15, 2024 10:58

-      try {
+     validateConfig(userParseMode, Objects::nonNull, () -> "The userParseMode can't be null");
+      try {

Added parse mode support when reading data from MongoDB. #119

Added parse mode support when reading data from MongoDB. #119

Uh oh!

Conversation

rozza commented Jun 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbabanin Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbabanin Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbabanin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vbabanin Jun 28, 2024 •

edited

Loading

vbabanin Jun 28, 2024 •

edited

Loading