Flink: Add Orc value reader, writer implementations #1255

openinx · 2020-07-27T13:17:10Z

No description provided.

openinx · 2020-07-29T13:52:53Z

I've fixed the broken unit test in the lates patch. It can be reviewed now and I will remove the WIP tag.. let's wait the travis test result.

openinx · 2020-07-30T02:11:00Z

Rebase to resolve the conflicts with #1197

simonsssu · 2020-07-29T13:12:01Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcSchemaVisitor.java

+    return visit(flinkType, schema.asStruct(), visitor);
+  }
+
+  private static <T> T visit(LogicalType flinkType, Type iType, FlinkOrcSchemaVisitor<T> visitor) {


It seems the flinkType here was just used for exception msg at the end of primitive method right ?

The logicalType is mainly used for getting fields from list/map/struct. you could see ListWriter, MapWriter, StructWriter. we will generate a elementGetter for ListWriter and use it to read the element of a list.

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcWriters.java

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcReaders.java

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcSchemaVisitor.java

openinx · 2020-08-06T02:53:37Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcWriter.java

+        case STRING:
+          return FlinkOrcWriters.strings();
+        case UUID:
+          return GenericOrcWriters.uuids();


for UUID type, we flink should return GenericOrcWriters.bytes() instead of GenericOrcWriters.uuids(), will fix this in next path. The reader will also need to fix.

openinx · 2020-08-11T07:39:55Z

@rdsr @rdblue any other concerns ? Thanks.

rdsr · 2020-08-14T05:01:34Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkSchemaVisitor.java

+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+
+abstract class FlinkSchemaVisitor<T> {


I think I understand why u needed to build a FlinkSchemaVisitor. Unlike Spark, it seems in Flink, there's no common super interface for internal datatypes [list, map, struct], like SpecializedGetters for Spark. So we had to know upfront what is the type to pass to struct, map and list writers, whereas in Spark we can simply pass the parent SpecializedGetters object and get the right data type from within in.

rdsr

The high level approach look fines to me.

openinx · 2020-08-18T13:21:37Z

Ping @rdblue , could this be merged into master branch now ?

rdblue · 2020-08-18T22:56:01Z

data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java

@@ -87,7 +87,7 @@ private GenericOrcWriters() {
    return UUIDWriter.INSTANCE;
  }

-  public static OrcValueWriter<byte[]> fixed() {
+  public static OrcValueWriter<byte[]> bytes() {


Should the FixedWriter class also be renamed to BytesWriter?

Sounds good.

rdblue · 2020-08-18T22:57:00Z

data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java

@@ -326,7 +326,7 @@ public void nonNullWrite(int rowId, LocalDateTime data, ColumnVector output) {
    public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
      // TODO: validate precision and scale from schema
      ((DecimalColumnVector) output).vector[rowId]
-          .setFromLongAndScale(data.unscaledValue().longValueExact(), scale);
+          .setFromLongAndScale(data.unscaledValue().longValueExact(), data.scale());


What about the TODO to check the scale matches the column's scale? As long as we're updating this, does it make sense to fix that, since we just had a decimal scale problem?

Actually, we don't need to change this now, because this merged patch has fixed it. 6f96b36#diff-b1b07b15f036000a3f2bed76fdd9f961R334

rdblue · 2020-08-18T23:00:41Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcReaders.java

+      ListColumnVector listVector = (ListColumnVector) vector;
+      int offset = (int) listVector.offsets[row];
+      int length = (int) listVector.lengths[row];
+      List<T> elements = Lists.newArrayListWithExpectedSize(length);


Can this list be reused? Allocating a new ArrayList each time could lead to poor performance.

One way to reuse the ArrayList is: make it as a ThreadLocal<List>, then each thread will share the same List instance. when reading ArrayData, we get the ThreadLocal list and clear it (the array list's space won't shrink and only free the references to elements). Then read values into the list. One thing need to concern is the size of list: if we read an ArrayData with many elements, then the ThreadLocal list may expand to a huge list too, that would waste much memory. I did not get a good idea to handle such case, I also see other orc readers are allocating the list, maybe we could handle this in a separate issue.

rdblue · 2020-08-18T23:44:59Z

flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkOrcReaderWriter.java

+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @Override
+  protected void writeAndValidate(Schema schema) throws IOException {


It looks like this validates records read by the reader against records written by the reader and subsequently read by the reader. I think it should validate the reader and writer separately. I think it should have to parts:

Write using generics, read using Flink, and use assertEquals(schema, genericRow, flinkRow)

Write using generics, read with Flink, write with Flink, read with generics and use assertEquals(genericRow, endGenericRow).

That way, we're always comparing results against the generics that were originally generated. I think we already have the assertEquals code to do it.

You testing method is correct, but we don't have assertEquals(schema, genericRow, flinkRow) before Junjie's parquet readers & writers patch get in. So I changed to another way to verify the data:

generate List<records> by random generater;

convert the records to RowData list;

writer records from step1 to orc file, and reading them into RowData list, and compare with RowData from step2;

write RowData from step2 into orc file, and reading them into records, and compare with Records from step1.

@openinx , Sorry to block you so long. Now it is merged. You might want to take a look.

Since your TestHelpers got merged, then I don't have to write the AssertHelpers now, could reuse your work. Thanks for the work, I've updated the patch.

openinx · 2020-08-19T16:24:38Z

I've rebased the patch and rewrite the unit tests by asserting readed records/RowData with the generated records/RowData. Also I add the ORC to the parameterized unit tests. Let's see the result from travis. If all tests pass, @rdblue you may want to take a final look.

rdblue · 2020-08-19T17:38:11Z

Looks like tests are failing because ORC wasn't added to the appender factory. Should be an easy fix?

openinx · 2020-08-20T00:55:36Z

Sure, let's see travis result again.

openinx · 2020-08-20T00:58:04Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcWriter.java

@@ -91,6 +91,12 @@ private WriteBuilder() {
        case BOOLEAN:
          return GenericOrcWriters.booleans();
        case INTEGER:
+          switch (flinkPrimitive.getTypeRoot()) {
+            case TINYINT:
+              return GenericOrcWriters.bytes();


This fixed the broken unit tests in flink: 2ec4ce6#diff-6820fc22b4e5cbfa4a1c029bf5c8c789L255.

we may need to add similar UT in spark so that we could write the tinyint and smallint to spark correctly, I will take a look.

rdblue · 2020-08-20T16:17:34Z

Thanks @openinx! I merged this.

openinx changed the title ~~[WIP] Flink: Add Orc value reader, writer implementations~~ Flink: Add Orc value reader, writer implementations Jul 29, 2020

simonsssu reviewed Jul 30, 2020

View reviewed changes

rdsr reviewed Aug 5, 2020

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcWriters.java Outdated Show resolved Hide resolved

rdsr reviewed Aug 5, 2020

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcReaders.java Show resolved Hide resolved

rdsr reviewed Aug 5, 2020

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcSchemaVisitor.java Outdated Show resolved Hide resolved

openinx commented Aug 6, 2020

View reviewed changes

openinx mentioned this pull request Aug 7, 2020

Flink: Refactor to replace Row type with RowData type in write path. #1305

Closed

rdsr reviewed Aug 14, 2020

View reviewed changes

rdsr approved these changes Aug 14, 2020

View reviewed changes

rdblue reviewed Aug 18, 2020

View reviewed changes

probot-autolabeler bot added data flink labels Aug 19, 2020

openinx commented Aug 20, 2020

View reviewed changes

openinx added 7 commits August 20, 2020 14:56

Flink: Add Orc value reader, writer implementations

0144a1f

Add flink orc writers.

7e03783

Add unit tests.

11f5a08

Fix the broken unit tests.

b77a713

Addressing comments

dbc4cce

Addressing the comment and fixing the uuid type issues.

9cf929e

Addressing the comments

7ffb1a9

openinx added 2 commits August 20, 2020 14:56

Fix the broken unit tests

75c2434

Rebase to use the flink TestHelpers.

21dd824

rdblue merged commit 311f2a1 into apache:master Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Add Orc value reader, writer implementations #1255

Flink: Add Orc value reader, writer implementations #1255

openinx commented Jul 27, 2020

openinx commented Jul 29, 2020

openinx commented Jul 30, 2020

simonsssu Jul 29, 2020

openinx Aug 5, 2020

openinx Aug 6, 2020

openinx commented Aug 11, 2020

rdsr Aug 14, 2020

rdsr left a comment

openinx commented Aug 18, 2020

rdblue Aug 18, 2020

openinx Aug 19, 2020

rdblue Aug 18, 2020

openinx Aug 19, 2020

rdblue Aug 18, 2020

openinx Aug 19, 2020

rdblue Aug 18, 2020

openinx Aug 19, 2020

chenjunjiedada Aug 20, 2020

openinx Aug 20, 2020

openinx commented Aug 19, 2020

rdblue commented Aug 19, 2020

openinx commented Aug 20, 2020

openinx Aug 20, 2020

rdblue commented Aug 20, 2020

Flink: Add Orc value reader, writer implementations #1255

Flink: Add Orc value reader, writer implementations #1255

Conversation

openinx commented Jul 27, 2020

openinx commented Jul 29, 2020

openinx commented Jul 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openinx commented Aug 11, 2020

Choose a reason for hiding this comment

rdsr left a comment

Choose a reason for hiding this comment

openinx commented Aug 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openinx commented Aug 19, 2020

rdblue commented Aug 19, 2020

openinx commented Aug 20, 2020

Choose a reason for hiding this comment

rdblue commented Aug 20, 2020