DRILL-5356: Refactor Parquet Record Reader#789
DRILL-5356: Refactor Parquet Record Reader#789paul-rogers wants to merge 2 commits intoapache:masterfrom
Conversation
| MaterializedField field; | ||
| // ParquetMetadataConverter metaConverter = new ParquetMetadataConverter(); | ||
| // FileMetaData fileMetaData; | ||
|
|
There was a problem hiding this comment.
instead of commenting, remove these lines if not needed.
There was a problem hiding this comment.
Was being paranoid, but sure, removed the lines.
| if (! fieldSelected(colMd.field)) { | ||
| continue; | ||
| } | ||
| columnMd.add(colMd); |
There was a problem hiding this comment.
I suggest that we rename columnMd as columnsMetadata and colMd as columnMetadata. It is confusing to infer that columnMd is column metadata for all columns.
There was a problem hiding this comment.
Fixed. And, taking a fresh look at the fields in this class, I found I could get rid of a few that were either redundant, or used only locally in a single method.
| } | ||
|
|
||
| public ParquetSchema(OptionManager options, Collection<SchemaPath> selectedCols) { | ||
| this.options = options; |
There was a problem hiding this comment.
It is not clear which constructor is supposed to be used when. Please add some comments. why is rowGroupIndex not needed for the second case ?
There was a problem hiding this comment.
Merged constructors as suggested and added comments. Please let me know where additional comments are needed to clarify what's happening.
| if (isStarQuery()) { | ||
| schema = new ParquetSchema(fragmentContext.getOptions(), rowGroupIndex); | ||
| } else { | ||
| schema = new ParquetSchema(fragmentContext.getOptions(), getColumns()); |
There was a problem hiding this comment.
why do we need to pass rowGroupIndex in one case and not other ? can we add comments here ? Is it possible to have a single constructor for ParquetSchema ?
There was a problem hiding this comment.
Fixed. Added comments.
| public void testVariableWidth() throws Exception { | ||
| String sql = "SELECT s_name, s_address, s_phone, s_comment\n" + | ||
| "FROM `cp`.`tpch/supplier.parquet` LIMIT 20"; | ||
| client.queryBuilder().sql(sql).printCsv(); |
There was a problem hiding this comment.
do you want to comment this line ?
There was a problem hiding this comment.
The fluent style is not self-explanatory? Build a query, using a SQL statement, that prints CSV to the console?
Actually, just commented out the line since it was primarily to help me debug the test; the real testing is below.
| public void testMissing() throws Exception { | ||
| String sql = "SELECT s_suppkey, bogus\n" + | ||
| "FROM `cp`.`tpch/supplier.parquet` LIMIT 20"; | ||
| client.queryBuilder().sql(sql).printCsv(); |
There was a problem hiding this comment.
Do you want to comment this line ? This test is not doing anything. If we plan to fix it later, add comments accordingly.
There was a problem hiding this comment.
This one I did comment, explaining why the rest of the test is commented out, and how we'll eventually implement the "all nulls" test.
| } | ||
|
|
||
| public ParquetSchema schema() { return schema; } | ||
| public List<ColumnReader<?>> getReaders() { return columnStatuses; } |
There was a problem hiding this comment.
for clarity, should we rename this function getColumnReaders ?
There was a problem hiding this comment.
Fixed. I also renamed "columnStatuses" since I could never figure out what that meant. Now is "columnReaders".
| @Override | ||
| protected long getReadCount(ColumnReader<?> firstColumnStatus) { | ||
| if (readState.mockRecordsRead == readState.schema().getGroupRecordCount()) { | ||
| return 0; |
There was a problem hiding this comment.
How about moving mockRecordsRead to this class instead of keeping it in readState ?
There was a problem hiding this comment.
Was being lazy. We have a mock read count and a "real" read count, but only one or the other is used. Got rid of the mock count and just used the same record count variable for all cases.
|
|
||
| /** | ||
| * Strategy for reading mock records. (What are these?) | ||
| */ |
There was a problem hiding this comment.
Please add brief explanation what these are instead of "what are these ?"
There was a problem hiding this comment.
Fixed. Finally found out what this means. Thanks Jinfeng!
| */ | ||
| private int getDataTypeLength() { | ||
| if (! isFixedLength()) { | ||
| return -1; |
There was a problem hiding this comment.
Use static final instead of -1.
There was a problem hiding this comment.
Original code, but sure, since I'm mucking about, I defined UNDEFINED_LENGTH.
| } | ||
|
|
||
| @SuppressWarnings("resource") FixedWidthRepeatedReader makeRepeatedFixedWidthReader(ParquetRecordReader reader, int recordsPerBatch) throws Exception { | ||
| final RepeatedValueVector repeatedVector = RepeatedValueVector.class.cast(vector); |
| } | ||
| } | ||
| } | ||
| schema.buildSchema(footer, batchSize); |
There was a problem hiding this comment.
May be pass footer in the constructor of ParquetSchema itself ?
There was a problem hiding this comment.
Nice improvement. Thanks!
| .run(); | ||
| } | ||
|
|
||
|
|
| } | ||
| return false; | ||
| } | ||
| /** |
There was a problem hiding this comment.
add a new line after end of function
|
Squashed and rebased on master. @parthchandra, Padma would like you to review this in addition to her review. This PR would be incredibly helpful for the next step in the project to fix memory fragmentation: we will implement code to limit batch size -- kind of like the original intent, but this time it should work. That work will be much easier to do on top of the refactored code than on top of the original code. |
|
+1 LGTM. Thanks for the cleanup Paul! |
|
Are the changes only in 1494915 (to cherry pick)? |
|
I took the entire patch and applied it to master (use git am -3). Git manages to figure out that the commits are already applied. One commit caused a merge conflict and I skipped it. In the end it left me with only the one commit. |
|
Thanks. I'll clean up the messy commits today. Not sure how it picked up the other six commits... |
The Parquet reader is Drill's premier data source and has worked very well for many years. As with any piece of code, it has grown in complexity over that time and has become hard to understand and maintain. In work in another project, we found that Parquet is accidentally creating "low density" batches: record batches with little actual data compared to the amount of memory allocated. We'd like to fix that. However, the current complexity of the reader code creates a barrier to making improvements: the code is so complex that it is often better to leave bugs unfixed, or risk spending large amounts of time struggling to make small changes. This commit offers to help revitalize the Parquet reader. Functionality is identical to the code in master; but code has been pulled apart into various classes each of which focuses on one part of the task: building up a schema, keeping track of read state, a strategy for reading various combinations of records, etc. The idea is that it is easier to understand several small, focused classes than one huge, complex class. Indeed, the idea of small, focused classes is common in the industry; it is nothing new. Unit tests pass with the change. Since no logic has chanaged, we only moved lines of code, that is a good indication that everything still works. Also includes fixes based on review comments.
|
Cleaned up the multi-commit mess, rebased on the latest master, and fixed minor issues raised in code review comments. Should be read to commit. |
The Parquet reader is Drill's premier data source and has worked very well for many years. As with any piece of code, it has grown in complexity over that time and has become hard to understand and maintain. In work in another project, we found that Parquet is accidentally creating "low density" batches: record batches with little actual data compared to the amount of memory allocated. We'd like to fix that. However, the current complexity of the reader code creates a barrier to making improvements: the code is so complex that it is often better to leave bugs unfixed, or risk spending large amounts of time struggling to make small changes. This commit offers to help revitalize the Parquet reader. Functionality is identical to the code in master; but code has been pulled apart into various classes each of which focuses on one part of the task: building up a schema, keeping track of read state, a strategy for reading various combinations of records, etc. The idea is that it is easier to understand several small, focused classes than one huge, complex class. Indeed, the idea of small, focused classes is common in the industry; it is nothing new. Unit tests pass with the change. Since no logic has chanaged, we only moved lines of code, that is a good indication that everything still works. Also includes fixes based on review comments. closes apache#789
The Parquet reader is Drill's premier data source and has worked very well for many years. As with any piece of code, it has grown in complexity over that time and has become hard to understand and maintain. In work in another project, we found that Parquet is accidentally creating "low density" batches: record batches with little actual data compared to the amount of memory allocated. We'd like to fix that. However, the current complexity of the reader code creates a barrier to making improvements: the code is so complex that it is often better to leave bugs unfixed, or risk spending large amounts of time struggling to make small changes. This commit offers to help revitalize the Parquet reader. Functionality is identical to the code in master; but code has been pulled apart into various classes each of which focuses on one part of the task: building up a schema, keeping track of read state, a strategy for reading various combinations of records, etc. The idea is that it is easier to understand several small, focused classes than one huge, complex class. Indeed, the idea of small, focused classes is common in the industry; it is nothing new. Unit tests pass with the change. Since no logic has chanaged, we only moved lines of code, that is a good indication that everything still works. Also includes fixes based on review comments. closes apache#789
The Parquet reader is Drill's premier data source and has worked very well
for many years. As with any piece of code, it has grown in complexity over
that time and has become hard to understand and maintain.
In work in another project, we found that Parquet is accidentally creating
"low density" batches: record batches with little actual data compared to
the amount of memory allocated. We'd like to fix that.
However, the current complexity of the reader code creates a barrier to
making improvements: the code is so complex that it is often better to
leave bugs unfixed, or risk spending large amounts of time struggling to
make small changes.
This commit offers to help revitalize the Parquet reader. Functionality is
identical to the code in master; but code has been pulled apart into
various classes each of which focuses on one part of the task: building
up a schema, keeping track of read state, a strategy for reading various
combinations of records, etc. The idea is that it is easier to understand
several small, focused classes than one huge, complex class. Indeed, the
idea of small, focused classes is common in the industry; it is nothing new.
Unit tests pass with the change. Since no logic has chanaged, we only moved
lines of code, that is a good indication that everything still works.