Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce code duplication in VectorizedParquetDefinitionLevelReader #11661

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

wypoon
Copy link
Contributor

@wypoon wypoon commented Nov 27, 2024

In VectorizedParquetDefinitionLevelReader, there are a number of nested reader classes; these extend one of two nested base classes, NumericBaseReader and BaseReader. Each of these base classes define nextBatch and nextDictEncodedBatch methods. These 4 methods (NumericBaseReader::nextBatch, NumericBaseReader::nextDictEncodedBatch, BaseReader::nextBatch, BaseReader::nextDictEncodedBatch) are actually structurally similar:

      int idx = startOffset;
      int left = numValsToRead;
      while (left > 0) {
        if (currentCount == 0) {
          readNextGroup();
        }
        int numValues = Math.min(left, currentCount);
        ...
        switch (mode) {
          case RLE:
            ...
          case PACKED:
            ...
        }
        left -= numValues;
        currentCount -= numValues;
      }

We can separate out the differences into what gets called in the RLE and PACKED cases for each of the two base classes, and even there, for dict encoded batches, there is no difference in logic between NumericBaseReader and BaseReader. Thus there is quite a bit of duplication that can be reduced.

This PR simply refactors the common logic into a superclass (CommonBaseReader) of NumericBaseReader and BaseReader, in a nextCommonBatch method that nextBatch and nextDictEncodedBatch delegate to.

No tests are added as there is no functionality change. The methods are exercised by existing tests.

Additional notes:
NumericBaseReader and BaseReader both define abstract nextDictEncodedVal methods with the same set of parameters but in different order! In this refactor, we define a common nextDictEncodedVal for them in CommonBaseReader.
NumericBaseReader and BaseReader each define abstract nextVal methods as well, but with different sets of parameters, and these we have left alone.

@github-actions github-actions bot added the arrow label Nov 27, 2024
int idx,
int numValues,
ArrowBuf validityBuffer) {
int bufferIdx = idx;
Copy link
Contributor Author

@wypoon wypoon Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incremented in the method body and incrementing a parameter (idx) is a checkstyle violation. Hence this variable.
The same applies to 3 other methods below.

@wypoon
Copy link
Contributor Author

wypoon commented Nov 27, 2024

@nastra can you please review this?
My impetus for the refactor is that I want to implement Parquet page-skipping, and with the refactor, a core change in logic can be applied in one method instead of 4 similarly structured methods.

int left = numValsToRead;
while (left > 0) {
if (currentCount == 0) {
readNextGroup();
}
int numValues = Math.min(left, currentCount);

byte[] byteArray = null;
if (typeWidth > -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the previous code this was only done for the BaseReader and not the NumericBaseReader but now it's generally done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. We initialize a byte array and it gets passed to nextRleBatch and nextPackedBatch; these two methods are abstract and overridden by NumericBaseReader and BaseReader; in NumericBaseReader, these two methods simply ignore the byteArray that is passed.

@wypoon
Copy link
Contributor Author

wypoon commented Dec 3, 2024

@nastra thank you for reviewing this. I have done some renaming as you suggested.

@wypoon
Copy link
Contributor Author

wypoon commented Dec 12, 2024

Hi @nastra, I have addressed your feedback last week. Can you please review this again?

@wypoon
Copy link
Contributor Author

wypoon commented Dec 18, 2024

@nastra @RussellSpitzer ping.

@wypoon
Copy link
Contributor Author

wypoon commented Jan 7, 2025

Hi @RussellSpitzer, I see that @nastra requested a review from you on this. I believe he wants a second pair of eyes to check that the refactor is correct. Can you please review?

@wypoon
Copy link
Contributor Author

wypoon commented Feb 4, 2025

@nastra any chance of moving this forward?

@nastra
Copy link
Contributor

nastra commented Feb 5, 2025

@RussellSpitzer @amogh-jahagirdar @pvary could you take a look at this one please?

@wypoon
Copy link
Contributor Author

wypoon commented Mar 6, 2025

@amogh-jahagirdar can you please take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants