-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce code duplication in VectorizedParquetDefinitionLevelReader #11661
base: main
Are you sure you want to change the base?
Conversation
int idx, | ||
int numValues, | ||
ArrowBuf validityBuffer) { | ||
int bufferIdx = idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incremented in the method body and incrementing a parameter (idx
) is a checkstyle violation. Hence this variable.
The same applies to 3 other methods below.
@nastra can you please review this? |
...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java
Outdated
Show resolved
Hide resolved
int left = numValsToRead; | ||
while (left > 0) { | ||
if (currentCount == 0) { | ||
readNextGroup(); | ||
} | ||
int numValues = Math.min(left, currentCount); | ||
|
||
byte[] byteArray = null; | ||
if (typeWidth > -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the previous code this was only done for the BaseReader
and not the NumericBaseReader
but now it's generally done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. We initialize a byte array and it gets passed to nextRleBatch
and nextPackedBatch
; these two methods are abstract and overridden by NumericBaseReader
and BaseReader
; in NumericBaseReader
, these two methods simply ignore the byteArray
that is passed.
@nastra thank you for reviewing this. I have done some renaming as you suggested. |
Hi @nastra, I have addressed your feedback last week. Can you please review this again? |
1f08116
to
e7ffac8
Compare
@nastra @RussellSpitzer ping. |
Hi @RussellSpitzer, I see that @nastra requested a review from you on this. I believe he wants a second pair of eyes to check that the refactor is correct. Can you please review? |
@nastra any chance of moving this forward? |
@RussellSpitzer @amogh-jahagirdar @pvary could you take a look at this one please? |
@amogh-jahagirdar can you please take a look? |
In
VectorizedParquetDefinitionLevelReader
, there are a number of nested reader classes; these extend one of two nested base classes,NumericBaseReader
andBaseReader
. Each of these base classes definenextBatch
andnextDictEncodedBatch
methods. These 4 methods (NumericBaseReader::nextBatch
,NumericBaseReader::nextDictEncodedBatch
,BaseReader::nextBatch
,BaseReader::nextDictEncodedBatch
) are actually structurally similar:We can separate out the differences into what gets called in the RLE and PACKED cases for each of the two base classes, and even there, for dict encoded batches, there is no difference in logic between
NumericBaseReader
andBaseReader
. Thus there is quite a bit of duplication that can be reduced.This PR simply refactors the common logic into a superclass (
CommonBaseReader
) ofNumericBaseReader
andBaseReader
, in anextCommonBatch
method thatnextBatch
andnextDictEncodedBatch
delegate to.No tests are added as there is no functionality change. The methods are exercised by existing tests.
Additional notes:
NumericBaseReader
andBaseReader
both define abstractnextDictEncodedVal
methods with the same set of parameters but in different order! In this refactor, we define a commonnextDictEncodedVal
for them inCommonBaseReader
.NumericBaseReader
andBaseReader
each define abstractnextVal
methods as well, but with different sets of parameters, and these we have left alone.