-
Notifications
You must be signed in to change notification settings - Fork 1.5k
GH-3142: Add strict validation for unsigned types in ExampleParquetWriter #3271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
||
public static class Builder extends ParquetWriter.Builder<Group, Builder> { | ||
private MessageType type = null; | ||
private boolean strictUnsignedIntegerValidation = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this should be true always. At least should be true by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will will update if there are no other changes requested, cc @shangxinli @wgtmac could you please review thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this default value to true thanks!
/** | ||
* Integration test for strict unsigned integer validation in ExampleParquetWriter. | ||
*/ | ||
public class TestStrictUnsignedIntegerValidation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases cover all unsigned types
and both valid and invalid scenarios. Consider adding a test case for
the maximum values of each unsigned type (255 for UINT_8, 65535 for
UINT_16, etc.) to ensure boundary conditions are properly handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated thanks!
Type currentType = getCurrentFieldType(); | ||
if (currentType != null && currentType.isPrimitive()) { | ||
LogicalTypeAnnotation logicalType = currentType.asPrimitiveType().getLogicalTypeAnnotation(); | ||
if (logicalType instanceof LogicalTypeAnnotation.IntLogicalTypeAnnotation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using LogicalTypeAnnotationVisitor
here and in validateUnsignedLong method, instead of instanceof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated if you could PTAL
* to unsigned integer fields (UINT_8, UINT_16, UINT_32, UINT_64) are within | ||
* the valid unsigned range before delegating to the wrapped consumer. | ||
*/ | ||
public class ValidatingUnsignedIntegerRecordConsumer extends RecordConsumer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is enough to modify lines below to add the check:
WDYT? @parthchandra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's a good idea. Keep in mind that we will get the validation in any place where a writer is created withValidation, not just in the example. I'm in favor of it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is consistent with other validations. I just think that adding a new class for unsigned integer is an overkill, especially when we already have a ValidatingRecordConsumer.java that serves the same purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sure, will update to piggyback on ValidatingRecordConsumer
aeb8b4f
to
566a374
Compare
cc @parthchandra @wgtmac could you PTAL thanks |
d8cc9ca
to
6beeef3
Compare
private static final int UINT_16_MAX_VALUE = 65535; | ||
|
||
private final RecordConsumer delegate; | ||
private final boolean strictUnsignedIntegerValidation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be validated at all times? This is already the ValidatingRecordConsumer so it might need to validate everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there are several test breaking and existing code built with the assumption of no validation. Since we put the changes directly within ValidatingRecordConsumer
it causes existing behaviour to break. I added this with a feature flag so new users can start using the validation. I'll investigate as a followup if we can cleanup/fix existing code thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help investigate why existing test break? Are all of them failed because of unsigned integer overflow? If yes, they are actually bugs and should be fixed in this PR so we don't have to add a separate flag for unsigned integer.
private static final Logger LOG = LoggerFactory.getLogger(ValidatingRecordConsumer.class); | ||
|
||
private static final int UINT_8_MAX_VALUE = 255; | ||
private static final int UINT_16_MAX_VALUE = 65535; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only 8 and 16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UINT_8, UINT_16, and UINT_32 annotate an int32 value, UINT_64 annotatyes a i64 value. For UINT_8, and UINT16 valid values are in the range [0, UINT_8_MAX_VALUE] and [0, UINT_16_MAX_VALUE] respectively. For UINT_32 and UINT_64, all non negative values are valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that was the reason thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
private static final Logger LOG = LoggerFactory.getLogger(ValidatingRecordConsumer.class); | ||
|
||
private static final int UINT_8_MAX_VALUE = 255; | ||
private static final int UINT_16_MAX_VALUE = 65535; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UINT_8, UINT_16, and UINT_32 annotate an int32 value, UINT_64 annotatyes a i64 value. For UINT_8, and UINT16 valid values are in the range [0, UINT_8_MAX_VALUE] and [0, UINT_16_MAX_VALUE] respectively. For UINT_32 and UINT_64, all non negative values are valid.
private static final int UINT_16_MAX_VALUE = 65535; | ||
|
||
private final RecordConsumer delegate; | ||
private final boolean strictUnsignedIntegerValidation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
6beeef3
to
f3cfe41
Compare
cc @wgtmac this is ready thanks! |
Summary:
Testing: