-
Notifications
You must be signed in to change notification settings - Fork 495
[oneshot] SS-49 Add validation for NOT NULL columns to COPY FROM query planning #34862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[oneshot] SS-49 Add validation for NOT NULL columns to COPY FROM query planning #34862
Conversation
|
this PR is "stacked" on top of #34849, so it technically includes the changes from there, as well. |
bffa533 to
c9096b1
Compare
ggevay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote some comments.
Also, it seems that the data that we read from the file can also have nulls, in which case we still crash. For example, adding the following to copy-from-s3-minio.td crashes (hello,,world means 3 fields, of which the middle one is null):
> CREATE TABLE t5 (a text, b text not null, c text)
$ s3-file-upload bucket=copytos3 key=csv/with_null.csv
hello,,world
$ s3-set-presigned-url bucket=copytos3 key=csv/with_null.csv var-name=nulls_in_csv_url
> COPY INTO t5 (a,b,c) FROM '${nulls_in_csv_url}' (FORMAT CSV);
So, it seems we should do some validation when we have the actual data at hand. (In which case we might not even need a validation in planning or sequencing at all, since the later validation could cover also those nulls that come from default values.)
| if let Some(expr) = source_mfp.expressions.get(expr_idx) { | ||
| // Check if the expression is a NULL literal. | ||
| // A NULL literal is represented as Literal(Ok(empty_row), _) | ||
| if matches!(expr, mz_expr::MirScalarExpr::Literal(Ok(row), _) if row.iter().next().map(|d| d.is_null()).unwrap_or(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just checks for a null literal, but there can be a more complicated expression in the tables default for the column that evaluates to null. For example, if I modify t1_not_null's definition in copy-from-s3-minio.td to the following, then we still crash:
> CREATE TABLE t1_not_null (a text, b int, c float NOT NULL DEFAULT 5 + null);
We could maybe do constant folding on the expression. At the HIR level, this can be achieved by calling simplify_to_literal. There is an MIR expression here, not an HIR expression, but I think this validation could be done during HIR planning instead of here in sequencing, see my later comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is true, but validation during HIR planning is a bit tricky as mentioned in my other reply. It does seem that maybe this should be done at actual decoding time, but the error handling there doesn't really allow for non-fatal errors as of now, so it would take a bit of a redesign. Is there a good way to do constant folding on a MIR expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MIR, you can do .reduce(&[]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem that maybe this should be done at actual decoding time, but the error handling there doesn't really allow for non-fatal errors as of now, so it would take a bit of a redesign.
Well, solving this seems unavoidable, as far as I can see. Seems like quite a serious problem if we just crash when there is a null in the input data in the wrong column. But I'm a bit out of my expertise at this point, so you might want to consult someone on the Sources & Sinks Team.
| let source_mfp = return_if_err!(source_mfp, ctx); | ||
|
|
||
| // Validate that all non-nullable columns in the target table will be populated. | ||
| // This is a runtime check to complement the planning-time validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do this check in HIR planning, e.g., in plan_copy_item? At first glance, it looks easier than digging out stuff from the MFP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't quite work in plan_copy_item because that code path is hit when running COPY ... FROM STDIN, where we have no way to look at a schema ahead of time, and (I think) no way to check the source of the copy data.
I originally tried to do this there, but it caused false errors on COPY FROM STDIN. Whereas sequence_copy_from is only hit when copying from url/s3.
Also I'm actually not sure if this comment is actually inaccurate, would this be considered runtime? more like coordination/sequencing I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how does COPY FROM STDIN solve the same problem?
But even if we have to special-case COPY FROM STDIN in plan_copy_item by something like an extra boolean flag, I'd still say that it's good to do validation as early as possible. (But also, it seems we'll have to do validation also when we have the actual data at hand, in which case just looking at the defaults during planning or sequencing could be superseeded.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm actually not sure if this comment is actually inaccurate, would this be considered runtime? more like coordination/sequencing I guess?
Yeah, I wouldn't consider this "runtime". I'd say it's sequencing, which is between planning and runtime.
| "column {} is NOT NULL but has no corresponding value in the MFP projection", | ||
| col_name | ||
| ); | ||
| return ctx.retire(Err(AdapterError::Unstructured(anyhow::anyhow!(msg)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of AdapterError::Unstructured, you could use AdapterError::ConstraintViolation.
464a325 to
9712972
Compare
Currently, if a
NOT NULLcolumn in the destination table does not have a default, and is not specified to be copied into by theCOPY FROMstatement, materialize will panic at the time of Row encoding, which is very undesirable behavior. By checking in the adapter sequencing logic, we can reject theCOPY FROMstatement earlier on with a descriptive error message.Motivation
https://github.com/MaterializeInc/database-issues/issues/9886
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.