Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation for metrics config during parquet write #2048

Merged

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Jan 8, 2021

Validate that the metrics config refers to actual columns in the schema during write.

This is for issue #1548.

The discussion in the ticket suggested calling it during table creation, but I moved it to parquet write since we don't parse the metrics config until the write stage not in the create call. Still new to the code base so if that isn't the tradeoff we want hella ok to refactor.

@@ -226,6 +226,9 @@ private CompressionCodecName codec() {
set("parquet.avro.write-old-list-structure", "false");
MessageType type = ParquetSchemaUtil.convert(schema, name);

// Check that our metrics make sense
metricsConfig.validateProperties(schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right place to do the validation?

If a user adds a bad property or performs some schema update that causes a validation error, that would break all writes to the table. To me, it doesn't seem like we are catching the problem early enough and possibly allowing a typo to break scheduled jobs.

What do you think about adding this validation when altering the table? UpdateProperties could check whether any properties starting with write.metadata.metrics were modified and run this. Similarly, UpdateSchema could run this as well, although I think that we should probably modify UpdateSchema to simply update the properties for column renames (if that's easily done).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll move this over to UpdateProperties/UpdateSchema tomorrow :)

Copy link
Contributor

@aokolnychyi aokolnychyi Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also cover table creation. I feel that we should fail table creation if the metrics config refers to non-existing columns. I have seen so many cases when people clone an example but forget to update the metrics config and they realize the problem only after ingestion is done and file skipping is not working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Table creation is a good time to validate this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put this in for table creation & schema + property evolution. Let me know if I hooked in the correct places.

@rdblue
Copy link
Contributor

rdblue commented Jan 8, 2021

Thanks for working on this, @holdenk! It would be great to get this gap fixed.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 8, 2021

Thanks for the quick review :) I'll circle back tomorrow with the changes :)

@holdenk
Copy link
Contributor Author

holdenk commented Jan 11, 2021

I think I've addressed the comments, although it's possible the hooks I've chosen are not the best places but it seems to cover the right areas with the tests. Let me know if there is a better place to put the hooks for validation.

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Jan 12, 2021

@rdblue @shardulm94 @RussellSpitzer, not related to this PR but a general question: the metrics config relies on column names instead of column ids as it is controlled through table properties. I was wondering whether it is safe to do so.

I considered the following cases:

  • While writing new data, it should be always OK to follow the name-based approach if the config is up-to-date.
  • While importing data (assuming we have a correct name mapping), we will assign correct ids to columns, use the ids to find the current aliases in the schema, so this should work.
  • When fixing (re-importing) metadata (assuming we have renamed a column with a reliable id), we will just use the column id in the file to find the current name, so this should work.

Overall, it seems safe to me. Any cases I missed?

… uses newTableMetaData when making new tables, rename validateProperties to validateReferencedColumns
@@ -89,6 +89,11 @@ public UpdateProperties defaultFormat(FileFormat format) {

newProperties.putAll(updates);

// Validate the metrics
if (base != null && base.schema() != null) {
MetricsConfig.fromProperties(newProperties).validateReferencedColumns(base.schema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is tested. Could you add a test to the property update tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered in the same test as the other one (the exception is on a different line though):

org.apache.iceberg.TestSchemaAndMappingUpdate > testModificationWithMetricsMetrics[formatVersion = 1] FAILED
    java.lang.AssertionError: No exception was thrown (Creating metrics for non-existent column fails), expected: org.apache.iceberg.exceptions.ValidationException
        at org.junit.Assert.fail(Assert.java:88)
        at org.apache.iceberg.AssertHelpers.assertThrows(AssertHelpers.java:65)
        at org.apache.iceberg.TestSchemaAndMappingUpdate.testModificationWithMetricsMetrics(TestSchemaAndMappingUpdate.java:169)

org.apache.iceberg.TestSchemaAndMappingUpdate > testModificationWithMetricsMetrics[formatVersion = 2] FAILED
    java.lang.AssertionError: No exception was thrown (Creating metrics for non-existent column fails), expected: org.apache.iceberg.exceptions.ValidationException
        at org.junit.Assert.fail(Assert.java:88)
        at org.apache.iceberg.AssertHelpers.assertThrows(AssertHelpers.java:65)
        at org.apache.iceberg.TestSchemaAndMappingUpdate.testModificationWithMetricsMetrics(TestSchemaAndMappingUpdate.java:169)

MetricsConfig.fromProperties(base.properties()).validateReferencedColumns(newSchema);
}

return newSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this path is tested. Could you add one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is currently covered by testModificationWithMetricsMetrics, commenting out the check results in the following error when running gradle build:

org.apache.iceberg.TestSchemaAndMappingUpdate > testModificationWithMetricsMetrics[formatVersion = 1] FAILED
    java.lang.AssertionError: No exception was thrown (Deleting a column with metrics fails), expected: org.apache.iceberg.exceptions.ValidationException
        at org.junit.Assert.fail(Assert.java:88)
        at org.apache.iceberg.AssertHelpers.assertThrows(AssertHelpers.java:65)
        at org.apache.iceberg.TestSchemaAndMappingUpdate.testModificationWithMetricsMetrics(TestSchemaAndMappingUpdate.java:179)

org.apache.iceberg.TestSchemaAndMappingUpdate > testModificationWithMetricsMetrics[formatVersion = 2] FAILED
    java.lang.AssertionError: No exception was thrown (Deleting a column with metrics fails), expected: org.apache.iceberg.exceptions.ValidationException
        at org.junit.Assert.fail(Assert.java:88)
        at org.apache.iceberg.AssertHelpers.assertThrows(AssertHelpers.java:65)
        at org.apache.iceberg.TestSchemaAndMappingUpdate.testModificationWithMetricsMetrics(TestSchemaAndMappingUpdate.java:179)

@aokolnychyi
Copy link
Contributor

LGTM apart from the existing comments.

…ta constructor call change, update the error in metrics config validation to give users a better pointer to the key
@rdblue rdblue merged commit fffdb69 into apache:master Jan 21, 2021
@rdblue
Copy link
Contributor

rdblue commented Jan 21, 2021

Thanks, @holdenk! Really nice to have this done.

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants