DRILL-5356: Refactor Parquet Record Reader by paul-rogers · Pull Request #789 · apache/drill

paul-rogers · 2017-03-19T19:53:32Z

The Parquet reader is Drill's premier data source and has worked very well
for many years. As with any piece of code, it has grown in complexity over
that time and has become hard to understand and maintain.

In work in another project, we found that Parquet is accidentally creating
"low density" batches: record batches with little actual data compared to
the amount of memory allocated. We'd like to fix that.

However, the current complexity of the reader code creates a barrier to
making improvements: the code is so complex that it is often better to
leave bugs unfixed, or risk spending large amounts of time struggling to
make small changes.

This commit offers to help revitalize the Parquet reader. Functionality is
identical to the code in master; but code has been pulled apart into
various classes each of which focuses on one part of the task: building
up a schema, keeping track of read state, a strategy for reading various
combinations of records, etc. The idea is that it is easier to understand
several small, focused classes than one huge, complex class. Indeed, the
idea of small, focused classes is common in the industry; it is nothing new.

Unit tests pass with the change. Since no logic has chanaged, we only moved
lines of code, that is a good indication that everything still works.

ppadma · 2017-03-28T22:59:46Z

...xec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java

-    MaterializedField field;
+//    ParquetMetadataConverter metaConverter = new ParquetMetadataConverter();
+//    FileMetaData fileMetaData;



instead of commenting, remove these lines if not needed.

Was being paranoid, but sure, removed the lines.

ppadma · 2017-03-28T23:11:34Z

...java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetSchema.java

+      if (! fieldSelected(colMd.field)) {
+        continue;
+      }
+      columnMd.add(colMd);


I suggest that we rename columnMd as columnsMetadata and colMd as columnMetadata. It is confusing to infer that columnMd is column metadata for all columns.

Fixed. And, taking a fresh look at the fields in this class, I found I could get rid of a few that were either redundant, or used only locally in a single method.

ppadma · 2017-03-28T23:16:11Z

...java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetSchema.java

+  }
+
+  public ParquetSchema(OptionManager options, Collection<SchemaPath> selectedCols) {
+    this.options = options;


It is not clear which constructor is supposed to be used when. Please add some comments. why is rowGroupIndex not needed for the second case ?

Merged constructors as suggested and added comments. Please let me know where additional comments are needed to clarify what's happening.

ppadma · 2017-03-28T23:24:35Z

...xec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java

+    if (isStarQuery()) {
+      schema = new ParquetSchema(fragmentContext.getOptions(), rowGroupIndex);
+    } else {
+      schema = new ParquetSchema(fragmentContext.getOptions(), getColumns());


why do we need to pass rowGroupIndex in one case and not other ? can we add comments here ? Is it possible to have a single constructor for ParquetSchema ?

Fixed. Added comments.

ppadma · 2017-03-29T12:53:37Z

exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetInternalsTest.java

+  public void testVariableWidth() throws Exception {
+    String sql = "SELECT s_name, s_address, s_phone, s_comment\n" +
+                 "FROM `cp`.`tpch/supplier.parquet` LIMIT 20";
+    client.queryBuilder().sql(sql).printCsv();


do you want to comment this line ?

The fluent style is not self-explanatory? Build a query, using a SQL statement, that prints CSV to the console?

Actually, just commented out the line since it was primarily to help me debug the test; the real testing is below.

ppadma · 2017-03-29T12:54:18Z

exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetInternalsTest.java

+  public void testMissing() throws Exception {
+    String sql = "SELECT s_suppkey, bogus\n" +
+                 "FROM `cp`.`tpch/supplier.parquet` LIMIT 20";
+    client.queryBuilder().sql(sql).printCsv();


Do you want to comment this line ? This test is not doing anything. If we plan to fix it later, add comments accordingly.

This one I did comment, explaining why the rest of the test is commented out, and how we'll eventually implement the "all nulls" test.

ppadma · 2017-03-29T14:34:50Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ReadState.java

+  }
+
+  public ParquetSchema schema() { return schema; }
+  public List<ColumnReader<?>> getReaders() { return columnStatuses; }


for clarity, should we rename this function getColumnReaders ?

Fixed. I also renamed "columnStatuses" since I could never figure out what that meant. Now is "columnReaders".

ppadma · 2017-03-29T14:39:15Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/BatchReader.java

+    @Override
+    protected long getReadCount(ColumnReader<?> firstColumnStatus) {
+      if (readState.mockRecordsRead == readState.schema().getGroupRecordCount()) {
+        return 0;


How about moving mockRecordsRead to this class instead of keeping it in readState ?

Was being lazy. We have a mock read count and a "real" read count, but only one or the other is used. Got rid of the mock count and just used the same record count variable for all cases.

ppadma

Overall, LGTM.

ppadma · 2017-04-06T21:03:21Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/BatchReader.java

+
+  /**
+   * Strategy for reading mock records. (What are these?)
+   */


Please add brief explanation what these are instead of "what are these ?"

Fixed. Finally found out what this means. Thanks Jinfeng!

ppadma · 2017-04-06T21:08:23Z

...c/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetColumnMetadata.java

+   */
+  private int getDataTypeLength() {
+    if (! isFixedLength()) {
+      return -1;


Use static final instead of -1.

Original code, but sure, since I'm mucking about, I defined UNDEFINED_LENGTH.

ppadma · 2017-04-06T21:35:31Z

...c/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetColumnMetadata.java

+  }
+
+  @SuppressWarnings("resource") FixedWidthRepeatedReader makeRepeatedFixedWidthReader(ParquetRecordReader reader, int recordsPerBatch) throws Exception {
+    final RepeatedValueVector repeatedVector = RepeatedValueVector.class.cast(vector);


Enter after @SuppressWarnings("resource")

ppadma · 2017-04-06T21:56:00Z

...xec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java

-          }
-        }
-      }
+      schema.buildSchema(footer, batchSize);


May be pass footer in the constructor of ParquetSchema itself ?

Nice improvement. Thanks!

ppadma · 2017-04-06T22:04:14Z

exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetInternalsTest.java

+      .run();
+  }
+
+


remove extra line

ppadma · 2017-04-06T22:08:39Z

...java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetSchema.java

+    }
+    return false;
+  }
+  /**


add a new line after end of function

paul-rogers · 2017-05-02T20:01:20Z

Squashed and rebased on master. @parthchandra, Padma would like you to review this in addition to her review.

This PR would be incredibly helpful for the next step in the project to fix memory fragmentation: we will implement code to limit batch size -- kind of like the original intent, but this time it should work. That work will be much easier to do on top of the refactored code than on top of the original code.

parthchandra · 2017-05-25T00:23:56Z

+1 LGTM. Thanks for the cleanup Paul!

sudheeshkatkam · 2017-05-25T16:41:37Z

Are the changes only in 1494915 (to cherry pick)?

parthchandra · 2017-05-25T16:53:34Z

I took the entire patch and applied it to master (use git am -3). Git manages to figure out that the commits are already applied. One commit caused a merge conflict and I skipped it. In the end it left me with only the one commit.

paul-rogers · 2017-05-25T17:14:10Z

Thanks. I'll clean up the messy commits today. Not sure how it picked up the other six commits...

The Parquet reader is Drill's premier data source and has worked very well for many years. As with any piece of code, it has grown in complexity over that time and has become hard to understand and maintain. In work in another project, we found that Parquet is accidentally creating "low density" batches: record batches with little actual data compared to the amount of memory allocated. We'd like to fix that. However, the current complexity of the reader code creates a barrier to making improvements: the code is so complex that it is often better to leave bugs unfixed, or risk spending large amounts of time struggling to make small changes. This commit offers to help revitalize the Parquet reader. Functionality is identical to the code in master; but code has been pulled apart into various classes each of which focuses on one part of the task: building up a schema, keeping track of read state, a strategy for reading various combinations of records, etc. The idea is that it is easier to understand several small, focused classes than one huge, complex class. Indeed, the idea of small, focused classes is common in the industry; it is nothing new. Unit tests pass with the change. Since no logic has chanaged, we only moved lines of code, that is a good indication that everything still works. Also includes fixes based on review comments.

paul-rogers · 2017-05-25T22:24:40Z

Cleaned up the multi-commit mess, rebased on the latest master, and fixed minor issues raised in code review comments. Should be read to commit.

The Parquet reader is Drill's premier data source and has worked very well for many years. As with any piece of code, it has grown in complexity over that time and has become hard to understand and maintain. In work in another project, we found that Parquet is accidentally creating "low density" batches: record batches with little actual data compared to the amount of memory allocated. We'd like to fix that. However, the current complexity of the reader code creates a barrier to making improvements: the code is so complex that it is often better to leave bugs unfixed, or risk spending large amounts of time struggling to make small changes. This commit offers to help revitalize the Parquet reader. Functionality is identical to the code in master; but code has been pulled apart into various classes each of which focuses on one part of the task: building up a schema, keeping track of read state, a strategy for reading various combinations of records, etc. The idea is that it is easier to understand several small, focused classes than one huge, complex class. Indeed, the idea of small, focused classes is common in the industry; it is nothing new. Unit tests pass with the change. Since no logic has chanaged, we only moved lines of code, that is a good indication that everything still works. Also includes fixes based on review comments. closes apache#789

ppadma reviewed Mar 29, 2017

View reviewed changes

ppadma reviewed Apr 6, 2017

View reviewed changes

paul-rogers force-pushed the DRILL-5356 branch from f5fa87f to 1494915 Compare May 2, 2017 19:59

paul-rogers force-pushed the DRILL-5356 branch from 5154475 to 8e276f4 Compare May 25, 2017 22:23

Renamed file to avoid case-sensitivity issue

4b52332

asfgit closed this in 676ea88 Jun 3, 2017

Conversation

paul-rogers commented Mar 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppadma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paul-rogers commented May 2, 2017

Uh oh!

parthchandra commented May 25, 2017

Uh oh!

sudheeshkatkam commented May 25, 2017

Uh oh!

parthchandra commented May 25, 2017

Uh oh!

paul-rogers commented May 25, 2017

Uh oh!

paul-rogers commented May 25, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development