Skip to content

[SPARK-55716][SQL] Fix V1 file source NOT NULL constraint enforcement#54517

Open
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-55716
Open

[SPARK-55716][SQL] Fix V1 file source NOT NULL constraint enforcement#54517
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-55716

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 26, 2026

What changes were proposed in this pull request?

V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. This PR fixes the issue by:

  1. CreateDataSourceTableCommand: Preserve user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema (which has CharVarchar normalization, metadata, etc.). Previously it stored dataSource.schema directly, which is all-nullable due to DataSource.resolveRelation() calling dataSchema.asNullable.

  2. PreprocessTableInsertion: Restore nullability flags from the catalog schema before null checks. This ensures AssertNotNull is injected when needed. Gated behind a legacy config flag.

  3. Legacy config: spark.sql.legacy.allowNullInsertForFileSourceTables (default false) for backward compatibility.

Why are the changes needed?

The root cause has two parts:

  • DataSource.resolveRelation() calls dataSchema.asNullable (added in SPARK-13738 for read safety), stripping all NOT NULL constraints recursively.
  • CreateDataSourceTableCommand stores this all-nullable schema in the catalog, permanently losing NOT NULL information.
  • As a result, PreprocessTableInsertion never injects AssertNotNull for V1 file source tables.

Note: InsertableRelation (e.g., SimpleInsertSource) does NOT have this problem because it preserves the original schema (SPARK-24583).

Does this PR introduce any user-facing change?

Yes. V1 file source tables (parquet/orc/json) will now enforce NOT NULL constraints during INSERT operations, matching the behavior of V2 tables. A legacy config is provided for backward compatibility.

How was this patch tested?

Added 7 new tests in InsertSuite:

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with GitHub Copilot.

V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. The root cause:

1. `DataSource.resolveRelation()` calls `dataSchema.asNullable` (SPARK-13738) for read safety, stripping NOT NULL recursively.
2. `CreateDataSourceTableCommand` stores this all-nullable schema in the catalog, permanently losing NOT NULL info.
3. `PreprocessTableInsertion` never injects `AssertNotNull` because the schema is all-nullable.

Fix:
- `CreateDataSourceTableCommand`: preserve user-specified nullability via recursive merging into the resolved schema.
- `PreprocessTableInsertion`: restore nullability flags from catalog schema before null checks.
- Add legacy config `spark.sql.legacy.allowNullInsertForFileSourceTables` (default false) for backward compatibility.

Covers top-level and nested types (array elements, struct fields, map values).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yaooqinn
Copy link
Member Author

cc @dongjoon-hyun @cloud-fan @gengliangwang, Is the fix direction correct? is this a genuine bug or design choice. I haven't found any public discussions on this area.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 26, 2026

Hi, @yaooqinn

  • Apache Spark didn't claim to support SQL CONSTRAINT for v1 yet, did it? IIUC, it should be handled explicitly by the user in FROM SELECT clause of that INSERT statement.
  • SPARK-51207: SPIP: Constraints in DSv2 was a fairly new feature of Apache Spark 4.1.0 only.

For me, this PR seems to introduce a new feature instead of bugs.

cc @aokolnychyi , @peter-toth , too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants