Skip to content

DRILL-5356: Refactor Parquet Record Reader#789

Closed
paul-rogers wants to merge 2 commits intoapache:masterfrom
paul-rogers:DRILL-5356
Closed

DRILL-5356: Refactor Parquet Record Reader#789
paul-rogers wants to merge 2 commits intoapache:masterfrom
paul-rogers:DRILL-5356

Conversation

@paul-rogers
Copy link
Contributor

The Parquet reader is Drill's premier data source and has worked very well
for many years. As with any piece of code, it has grown in complexity over
that time and has become hard to understand and maintain.

In work in another project, we found that Parquet is accidentally creating
"low density" batches: record batches with little actual data compared to
the amount of memory allocated. We'd like to fix that.

However, the current complexity of the reader code creates a barrier to
making improvements: the code is so complex that it is often better to
leave bugs unfixed, or risk spending large amounts of time struggling to
make small changes.

This commit offers to help revitalize the Parquet reader. Functionality is
identical to the code in master; but code has been pulled apart into
various classes each of which focuses on one part of the task: building
up a schema, keeping track of read state, a strategy for reading various
combinations of records, etc. The idea is that it is easier to understand
several small, focused classes than one huge, complex class. Indeed, the
idea of small, focused classes is common in the industry; it is nothing new.

Unit tests pass with the change. Since no logic has chanaged, we only moved
lines of code, that is a good indication that everything still works.

MaterializedField field;
// ParquetMetadataConverter metaConverter = new ParquetMetadataConverter();
// FileMetaData fileMetaData;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of commenting, remove these lines if not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was being paranoid, but sure, removed the lines.

if (! fieldSelected(colMd.field)) {
continue;
}
columnMd.add(colMd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we rename columnMd as columnsMetadata and colMd as columnMetadata. It is confusing to infer that columnMd is column metadata for all columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. And, taking a fresh look at the fields in this class, I found I could get rid of a few that were either redundant, or used only locally in a single method.

}

public ParquetSchema(OptionManager options, Collection<SchemaPath> selectedCols) {
this.options = options;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear which constructor is supposed to be used when. Please add some comments. why is rowGroupIndex not needed for the second case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged constructors as suggested and added comments. Please let me know where additional comments are needed to clarify what's happening.

if (isStarQuery()) {
schema = new ParquetSchema(fragmentContext.getOptions(), rowGroupIndex);
} else {
schema = new ParquetSchema(fragmentContext.getOptions(), getColumns());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to pass rowGroupIndex in one case and not other ? can we add comments here ? Is it possible to have a single constructor for ParquetSchema ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added comments.

public void testVariableWidth() throws Exception {
String sql = "SELECT s_name, s_address, s_phone, s_comment\n" +
"FROM `cp`.`tpch/supplier.parquet` LIMIT 20";
client.queryBuilder().sql(sql).printCsv();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to comment this line ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fluent style is not self-explanatory? Build a query, using a SQL statement, that prints CSV to the console?

Actually, just commented out the line since it was primarily to help me debug the test; the real testing is below.

public void testMissing() throws Exception {
String sql = "SELECT s_suppkey, bogus\n" +
"FROM `cp`.`tpch/supplier.parquet` LIMIT 20";
client.queryBuilder().sql(sql).printCsv();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to comment this line ? This test is not doing anything. If we plan to fix it later, add comments accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I did comment, explaining why the rest of the test is commented out, and how we'll eventually implement the "all nulls" test.

}

public ParquetSchema schema() { return schema; }
public List<ColumnReader<?>> getReaders() { return columnStatuses; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, should we rename this function getColumnReaders ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I also renamed "columnStatuses" since I could never figure out what that meant. Now is "columnReaders".

@Override
protected long getReadCount(ColumnReader<?> firstColumnStatus) {
if (readState.mockRecordsRead == readState.schema().getGroupRecordCount()) {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving mockRecordsRead to this class instead of keeping it in readState ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was being lazy. We have a mock read count and a "real" read count, but only one or the other is used. Got rid of the mock count and just used the same record count variable for all cases.

Copy link
Contributor

@ppadma ppadma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM.


/**
* Strategy for reading mock records. (What are these?)
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add brief explanation what these are instead of "what are these ?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Finally found out what this means. Thanks Jinfeng!

*/
private int getDataTypeLength() {
if (! isFixedLength()) {
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use static final instead of -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original code, but sure, since I'm mucking about, I defined UNDEFINED_LENGTH.

}

@SuppressWarnings("resource") FixedWidthRepeatedReader makeRepeatedFixedWidthReader(ParquetRecordReader reader, int recordsPerBatch) throws Exception {
final RepeatedValueVector repeatedVector = RepeatedValueVector.class.cast(vector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enter after @SuppressWarnings("resource")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
}
}
schema.buildSchema(footer, batchSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be pass footer in the constructor of ParquetSchema itself ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement. Thanks!

.run();
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
return false;
}
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a new line after end of function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@paul-rogers
Copy link
Contributor Author

Squashed and rebased on master. @parthchandra, Padma would like you to review this in addition to her review.

This PR would be incredibly helpful for the next step in the project to fix memory fragmentation: we will implement code to limit batch size -- kind of like the original intent, but this time it should work. That work will be much easier to do on top of the refactored code than on top of the original code.

@parthchandra
Copy link
Contributor

+1 LGTM. Thanks for the cleanup Paul!

@sudheeshkatkam
Copy link
Contributor

Are the changes only in 1494915 (to cherry pick)?

@parthchandra
Copy link
Contributor

I took the entire patch and applied it to master (use git am -3). Git manages to figure out that the commits are already applied. One commit caused a merge conflict and I skipped it. In the end it left me with only the one commit.

@paul-rogers
Copy link
Contributor Author

Thanks. I'll clean up the messy commits today. Not sure how it picked up the other six commits...

The Parquet reader is Drill's premier data source and has worked very well
for many years. As with any piece of code, it has grown in complexity over
that time and has become hard to understand and maintain.

In work in another project, we found that Parquet is accidentally creating
"low density" batches: record batches with little actual data compared to
the amount of memory allocated. We'd like to fix that.

However, the current complexity of the reader code creates a barrier to
making improvements: the code is so complex that it is often better to
leave bugs unfixed, or risk spending large amounts of time struggling to
make small changes.

This commit offers to help revitalize the Parquet reader. Functionality is
identical to the code in master; but code has been pulled apart into
various classes each of which focuses on one part of the task: building
up a schema, keeping track of read state, a strategy for reading various
combinations of records, etc. The idea is that it is easier to understand
several small, focused classes than one huge, complex class. Indeed, the
idea of small, focused classes is common in the industry; it is nothing new.

Unit tests pass with the change. Since no logic has chanaged, we only moved
lines of code, that is a good indication that everything still works.

Also includes fixes based on review comments.
@paul-rogers
Copy link
Contributor Author

Cleaned up the multi-commit mess, rebased on the latest master, and fixed minor issues raised in code review comments. Should be read to commit.

sudheeshkatkam pushed a commit to sudheeshkatkam/drill that referenced this pull request May 26, 2017
The Parquet reader is Drill's premier data source and has worked very well
for many years. As with any piece of code, it has grown in complexity over
that time and has become hard to understand and maintain.

In work in another project, we found that Parquet is accidentally creating
"low density" batches: record batches with little actual data compared to
the amount of memory allocated. We'd like to fix that.

However, the current complexity of the reader code creates a barrier to
making improvements: the code is so complex that it is often better to
leave bugs unfixed, or risk spending large amounts of time struggling to
make small changes.

This commit offers to help revitalize the Parquet reader. Functionality is
identical to the code in master; but code has been pulled apart into
various classes each of which focuses on one part of the task: building
up a schema, keeping track of read state, a strategy for reading various
combinations of records, etc. The idea is that it is easier to understand
several small, focused classes than one huge, complex class. Indeed, the
idea of small, focused classes is common in the industry; it is nothing new.

Unit tests pass with the change. Since no logic has chanaged, we only moved
lines of code, that is a good indication that everything still works.

Also includes fixes based on review comments.

closes apache#789
jinfengni pushed a commit to jinfengni/incubator-drill that referenced this pull request Jun 1, 2017
The Parquet reader is Drill's premier data source and has worked very well
for many years. As with any piece of code, it has grown in complexity over
that time and has become hard to understand and maintain.

In work in another project, we found that Parquet is accidentally creating
"low density" batches: record batches with little actual data compared to
the amount of memory allocated. We'd like to fix that.

However, the current complexity of the reader code creates a barrier to
making improvements: the code is so complex that it is often better to
leave bugs unfixed, or risk spending large amounts of time struggling to
make small changes.

This commit offers to help revitalize the Parquet reader. Functionality is
identical to the code in master; but code has been pulled apart into
various classes each of which focuses on one part of the task: building
up a schema, keeping track of read state, a strategy for reading various
combinations of records, etc. The idea is that it is easier to understand
several small, focused classes than one huge, complex class. Indeed, the
idea of small, focused classes is common in the industry; it is nothing new.

Unit tests pass with the change. Since no logic has chanaged, we only moved
lines of code, that is a good indication that everything still works.

Also includes fixes based on review comments.

closes apache#789
@asfgit asfgit closed this in 676ea88 Jun 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants