-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validation for metrics config during parquet write #2048
Add validation for metrics config during parquet write #2048
Conversation
@@ -226,6 +226,9 @@ private CompressionCodecName codec() { | |||
set("parquet.avro.write-old-list-structure", "false"); | |||
MessageType type = ParquetSchemaUtil.convert(schema, name); | |||
|
|||
// Check that our metrics make sense | |||
metricsConfig.validateProperties(schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right place to do the validation?
If a user adds a bad property or performs some schema update that causes a validation error, that would break all writes to the table. To me, it doesn't seem like we are catching the problem early enough and possibly allowing a typo to break scheduled jobs.
What do you think about adding this validation when altering the table? UpdateProperties
could check whether any properties starting with write.metadata.metrics
were modified and run this. Similarly, UpdateSchema
could run this as well, although I think that we should probably modify UpdateSchema
to simply update the properties for column renames (if that's easily done).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll move this over to UpdateProperties/UpdateSchema tomorrow :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also cover table creation. I feel that we should fail table creation if the metrics config refers to non-existing columns. I have seen so many cases when people clone an example but forget to update the metrics config and they realize the problem only after ingestion is done and file skipping is not working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Table creation is a good time to validate this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put this in for table creation & schema + property evolution. Let me know if I hooked in the correct places.
Thanks for working on this, @holdenk! It would be great to get this gap fixed. |
Thanks for the quick review :) I'll circle back tomorrow with the changes :) |
I think I've addressed the comments, although it's possible the hooks I've chosen are not the best places but it seems to cover the right areas with the tests. Let me know if there is a better place to put the hooks for validation. |
@rdblue @shardulm94 @RussellSpitzer, not related to this PR but a general question: the metrics config relies on column names instead of column ids as it is controlled through table properties. I was wondering whether it is safe to do so. I considered the following cases:
Overall, it seems safe to me. Any cases I missed? |
core/src/test/java/org/apache/iceberg/TestSchemaAndMappingUpdate.java
Outdated
Show resolved
Hide resolved
… uses newTableMetaData when making new tables, rename validateProperties to validateReferencedColumns
spark/src/test/java/org/apache/iceberg/spark/source/TestWriteMetricsConfig.java
Outdated
Show resolved
Hide resolved
@@ -89,6 +89,11 @@ public UpdateProperties defaultFormat(FileFormat format) { | |||
|
|||
newProperties.putAll(updates); | |||
|
|||
// Validate the metrics | |||
if (base != null && base.schema() != null) { | |||
MetricsConfig.fromProperties(newProperties).validateReferencedColumns(base.schema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is tested. Could you add a test to the property update tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered in the same test as the other one (the exception is on a different line though):
org.apache.iceberg.TestSchemaAndMappingUpdate > testModificationWithMetricsMetrics[formatVersion = 1] FAILED
java.lang.AssertionError: No exception was thrown (Creating metrics for non-existent column fails), expected: org.apache.iceberg.exceptions.ValidationException
at org.junit.Assert.fail(Assert.java:88)
at org.apache.iceberg.AssertHelpers.assertThrows(AssertHelpers.java:65)
at org.apache.iceberg.TestSchemaAndMappingUpdate.testModificationWithMetricsMetrics(TestSchemaAndMappingUpdate.java:169)
org.apache.iceberg.TestSchemaAndMappingUpdate > testModificationWithMetricsMetrics[formatVersion = 2] FAILED
java.lang.AssertionError: No exception was thrown (Creating metrics for non-existent column fails), expected: org.apache.iceberg.exceptions.ValidationException
at org.junit.Assert.fail(Assert.java:88)
at org.apache.iceberg.AssertHelpers.assertThrows(AssertHelpers.java:65)
at org.apache.iceberg.TestSchemaAndMappingUpdate.testModificationWithMetricsMetrics(TestSchemaAndMappingUpdate.java:169)
MetricsConfig.fromProperties(base.properties()).validateReferencedColumns(newSchema); | ||
} | ||
|
||
return newSchema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this path is tested. Could you add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is currently covered by testModificationWithMetricsMetrics
, commenting out the check results in the following error when running gradle build:
org.apache.iceberg.TestSchemaAndMappingUpdate > testModificationWithMetricsMetrics[formatVersion = 1] FAILED
java.lang.AssertionError: No exception was thrown (Deleting a column with metrics fails), expected: org.apache.iceberg.exceptions.ValidationException
at org.junit.Assert.fail(Assert.java:88)
at org.apache.iceberg.AssertHelpers.assertThrows(AssertHelpers.java:65)
at org.apache.iceberg.TestSchemaAndMappingUpdate.testModificationWithMetricsMetrics(TestSchemaAndMappingUpdate.java:179)
org.apache.iceberg.TestSchemaAndMappingUpdate > testModificationWithMetricsMetrics[formatVersion = 2] FAILED
java.lang.AssertionError: No exception was thrown (Deleting a column with metrics fails), expected: org.apache.iceberg.exceptions.ValidationException
at org.junit.Assert.fail(Assert.java:88)
at org.apache.iceberg.AssertHelpers.assertThrows(AssertHelpers.java:65)
at org.apache.iceberg.TestSchemaAndMappingUpdate.testModificationWithMetricsMetrics(TestSchemaAndMappingUpdate.java:179)
LGTM apart from the existing comments. |
…ta constructor call change, update the error in metrics config validation to give users a better pointer to the key
Thanks, @holdenk! Really nice to have this done. |
Validate that the metrics config refers to actual columns in the schema during write.
This is for issue #1548.
The discussion in the ticket suggested calling it during table creation, but I moved it to parquet write since we don't parse the metrics config until the write stage not in the create call. Still new to the code base so if that isn't the tradeoff we want hella ok to refactor.