Skip to content

Conversation

ArnavBalyan
Copy link
Member

@ArnavBalyan ArnavBalyan commented Aug 19, 2025

Summary:

Testing:

  • Added UTs


public static class Builder extends ParquetWriter.Builder<Group, Builder> {
private MessageType type = null;
private boolean strictUnsignedIntegerValidation = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this should be true always. At least should be true by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will will update if there are no other changes requested, cc @shangxinli @wgtmac could you please review thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this default value to true thanks!

/**
* Integration test for strict unsigned integer validation in ExampleParquetWriter.
*/
public class TestStrictUnsignedIntegerValidation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases cover all unsigned types
and both valid and invalid scenarios. Consider adding a test case for
the maximum values of each unsigned type (255 for UINT_8, 65535 for
UINT_16, etc.) to ensure boundary conditions are properly handled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks!

Type currentType = getCurrentFieldType();
if (currentType != null && currentType.isPrimitive()) {
LogicalTypeAnnotation logicalType = currentType.asPrimitiveType().getLogicalTypeAnnotation();
if (logicalType instanceof LogicalTypeAnnotation.IntLogicalTypeAnnotation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using LogicalTypeAnnotationVisitor here and in validateUnsignedLong method, instead of instanceof.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated if you could PTAL

* to unsigned integer fields (UINT_8, UINT_16, UINT_32, UINT_64) are within
* the valid unsigned range before delegating to the wrapped consumer.
*/
public class ValidatingUnsignedIntegerRecordConsumer extends RecordConsumer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good idea. Keep in mind that we will get the validation in any place where a writer is created withValidation, not just in the example. I'm in favor of it though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is consistent with other validations. I just think that adding a new class for unsigned integer is an overkill, especially when we already have a ValidatingRecordConsumer.java that serves the same purpose.

Copy link
Member Author

@ArnavBalyan ArnavBalyan Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure, will update to piggyback on ValidatingRecordConsumer

@ArnavBalyan
Copy link
Member Author

cc @parthchandra @wgtmac could you PTAL thanks

@ArnavBalyan ArnavBalyan force-pushed the arnavb/test-pqt branch 2 times, most recently from d8cc9ca to 6beeef3 Compare August 29, 2025 08:01
private static final int UINT_16_MAX_VALUE = 65535;

private final RecordConsumer delegate;
private final boolean strictUnsignedIntegerValidation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be validated at all times? This is already the ValidatingRecordConsumer so it might need to validate everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there are several test breaking and existing code built with the assumption of no validation. Since we put the changes directly within ValidatingRecordConsumer it causes existing behaviour to break. I added this with a feature flag so new users can start using the validation. I'll investigate as a followup if we can cleanup/fix existing code thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help investigate why existing test break? Are all of them failed because of unsigned integer overflow? If yes, they are actually bugs and should be fixed in this PR so we don't have to add a separate flag for unsigned integer.

private static final Logger LOG = LoggerFactory.getLogger(ValidatingRecordConsumer.class);

private static final int UINT_8_MAX_VALUE = 255;
private static final int UINT_16_MAX_VALUE = 65535;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only 8 and 16?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UINT_8, UINT_16, and UINT_32 annotate an int32 value, UINT_64 annotatyes a i64 value. For UINT_8, and UINT16 valid values are in the range [0, UINT_8_MAX_VALUE] and [0, UINT_16_MAX_VALUE] respectively. For UINT_32 and UINT_64, all non negative values are valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that was the reason thanks!

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

private static final Logger LOG = LoggerFactory.getLogger(ValidatingRecordConsumer.class);

private static final int UINT_8_MAX_VALUE = 255;
private static final int UINT_16_MAX_VALUE = 65535;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UINT_8, UINT_16, and UINT_32 annotate an int32 value, UINT_64 annotatyes a i64 value. For UINT_8, and UINT16 valid values are in the range [0, UINT_8_MAX_VALUE] and [0, UINT_16_MAX_VALUE] respectively. For UINT_32 and UINT_64, all non negative values are valid.

private static final int UINT_16_MAX_VALUE = 65535;

private final RecordConsumer delegate;
private final boolean strictUnsignedIntegerValidation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ArnavBalyan
Copy link
Member Author

cc @wgtmac this is ready thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExampleParquetWriter may write illegal values for uint_8 and uint_16
5 participants