[SPARK-55716][SQL] Fix V1 file source NOT NULL constraint enforcement#54517
Open
yaooqinn wants to merge 1 commit intoapache:masterfrom
Open
[SPARK-55716][SQL] Fix V1 file source NOT NULL constraint enforcement#54517yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn wants to merge 1 commit intoapache:masterfrom
Conversation
V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. The root cause: 1. `DataSource.resolveRelation()` calls `dataSchema.asNullable` (SPARK-13738) for read safety, stripping NOT NULL recursively. 2. `CreateDataSourceTableCommand` stores this all-nullable schema in the catalog, permanently losing NOT NULL info. 3. `PreprocessTableInsertion` never injects `AssertNotNull` because the schema is all-nullable. Fix: - `CreateDataSourceTableCommand`: preserve user-specified nullability via recursive merging into the resolved schema. - `PreprocessTableInsertion`: restore nullability flags from catalog schema before null checks. - Add legacy config `spark.sql.legacy.allowNullInsertForFileSourceTables` (default false) for backward compatibility. Covers top-level and nested types (array elements, struct fields, map values). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
30258f5 to
8aeab4f
Compare
Member
Author
|
cc @dongjoon-hyun @cloud-fan @gengliangwang, Is the fix direction correct? is this a genuine bug or design choice. I haven't found any public discussions on this area. |
Member
|
Hi, @yaooqinn
For me, this PR seems to introduce a new feature instead of bugs. cc @aokolnychyi , @peter-toth , too. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. This PR fixes the issue by:
CreateDataSourceTableCommand: Preserve user-specified nullability by recursively merging nullability flags from the user schema into the resolveddataSource.schema(which has CharVarchar normalization, metadata, etc.). Previously it storeddataSource.schemadirectly, which is all-nullable due toDataSource.resolveRelation()callingdataSchema.asNullable.PreprocessTableInsertion: Restore nullability flags from the catalog schema before null checks. This ensuresAssertNotNullis injected when needed. Gated behind a legacy config flag.Legacy config:
spark.sql.legacy.allowNullInsertForFileSourceTables(defaultfalse) for backward compatibility.Why are the changes needed?
The root cause has two parts:
DataSource.resolveRelation()callsdataSchema.asNullable(added in SPARK-13738 for read safety), stripping all NOT NULL constraints recursively.CreateDataSourceTableCommandstores this all-nullable schema in the catalog, permanently losing NOT NULL information.PreprocessTableInsertionnever injectsAssertNotNullfor V1 file source tables.Note:
InsertableRelation(e.g.,SimpleInsertSource) does NOT have this problem because it preserves the original schema (SPARK-24583).Does this PR introduce any user-facing change?
Yes. V1 file source tables (parquet/orc/json) will now enforce NOT NULL constraints during INSERT operations, matching the behavior of V2 tables. A legacy config is provided for backward compatibility.
How was this patch tested?
Added 7 new tests in
InsertSuite:Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored with GitHub Copilot.