Refactor reusable map data in Spark and Flink parquet readers #1331 #1572

holdenk · 2020-10-09T21:15:57Z

No description provided.

rdblue · 2020-10-10T00:51:25Z

parquet/src/main/java/org/apache/iceberg/parquet/ReusableArrayData.java

+
+package org.apache.iceberg.parquet;
+
+public interface ReusableArrayData {


Was it necessary to make this an interface? Because this does allocation, it seems much harder to make it an interface than to make it an abstract class that handles allocation and the values array internally.

Yeah, the problem is Spark's base class is an abstract class and Flink's is an interface :/

I was thinking about this today. Originally I had assumed that I could not have the flink one inherit from Spark because flink & spark wouldn't want to take dependencies on each other. But since the flink one is an interface, I'll give it a shot and see if I can make this an abstract class.

Circled back, and yeah I think this does have to be an interface :/

Yes, I think so as well. That just means I'll need to be a bit more careful with the review since it is more complicated than I thought. I should have time today, hopefully.

Is it possible that design the ReusableArrayData as a generic class ReusableArrayData<T> ?

and theFlinkReusableArrayData and SparkReusableArrayData would have a private member which is ReusableArrayData<Object> , then both of them could delegate methods such as capacity(), setNumElements(), getNumElements().

Making the ReusableArrayData an interface just confused me a bit, for me it should have a complete implementation , although its type is generic.

I don't think so since we can't do multiple inheritence with the class.

rdblue · 2020-10-10T00:56:50Z

parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java

+    }
+    keys().update(size(), key);
+    values().update(size(), value);
+    setNumElements(size() + 1);


Does this need to set this every time a new pair is added? I think it would be better to set the valid size just before returning, to avoid extra work.

I might be misinterpreting your question / concern, but it seems like at least in the SparkReusableMapData this is the case.

If I'm reading this correctly, addPair is overwritten and then setNumElements is only called on buildMap. Again, I might be misinterpreting the question / concern but for a default implementation this seems fine given that it seems to be overridden on the Spark side and on the Flink side too (where theres a buildMap function).

private static class MapReader<K, V> extends RepeatedKeyValueReader<MapData, SparkReusableMapData, K, V> { // Removed to the bare minimum .... @Override protected void addPair(SparkReusableMapData map, K key, V value) { if (writePos >= map.capacity()) { map.grow(); } map.keys.values[writePos] = key; map.values.values[writePos] = value; writePos += 1; } @Override protected MapData buildMap(SparkReusableMapData map) { map.setNumElements(writePos); return map; }

Yes, my point is that we don't need to set the number of valid elements until the end, just before MapData is returned.

So yes we could do that, but we'd be keeping track of the write position in the MapReader so this just goves the grow code into the base class. I can swap it back around though.

Oh right because this is an interface we need to call a method anyways since we can't have a concrete type in the interface.

rdblue · 2020-10-10T00:57:40Z

spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java

@@ -655,6 +653,11 @@ public int numElements() {
      return numElements;
    }

+    @Override
+    public int getNumElements() {


Before, this was called numElements. Could you revert the name change?

It looks like the function numElements is still there, just above this one

It looks like the function numElements is still there (it's part of the catalyst abstract class), in the lines just above these ones.

But unless there's something going on with Spark, I agree with @rdblue.

I looked to see if getNumElements were needed on the Flink side, as Flink highly favors POJOs for performance reasons within their own serialization that they can infer and then for style reasons they also tend to use POJO-y like names at times even when its not always necessarily needed in the current moment, but the Flink ArrayData interface uses size instead, so keeping the iceberg ReusableArrayData interface to use numElements seems like it would kill two birds with one stone by also fulfilling the catalyst contract.

But maybe there's something we're not seeing.

Yeah this is a good point, I'll simplify the interface so we don't have numElements and getNumElements.

kbendick · 2020-10-11T04:22:24Z

This is an important task so thank you for taking it on @holdenk

holdenk · 2020-10-14T23:35:59Z

Thanks y'all for the reviews. Sorry I'm a little slow responding to this, I'm currently dealing with a race condition inside of some new Spark code but I'm hoping to circle back to this PR before the end of the week.

rdblue · 2020-10-16T19:14:23Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

-    protected RowData buildStruct(GenericRowData struct) {
-      return struct;
-    }
+  private static class FlinkReusableMapData implements ReusableMapData, MapData {


Was the order of classes changed? It looks like that may be why there are more changes in this diff.

Let me see if I can minimize the changes in the diff this weekend.

openinx · 2020-10-19T03:08:48Z

parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java

+    values().setNumElements(numElements);
+  }
+
+  int size();


nit: how about making this method to return the keys().getNumElements() by default ? then we may don't have to implement it in subclasses now.

I tried that, doesn't work since size is required by the Spark/Flink class/interface.

…arquet

…e common code to interface with default methods to allow multiple inheritence.

rdblue · 2020-12-17T01:03:31Z

I just talked with @holdenk directly about this and we agreed that it probably isn't the right direction to go because the shared code needs to be carried by an interface instead of an abstract class. The interface can't handle its own state, so it ends up having way more method calls to get state from the child classes. That splits state that should be managed by a single class across multiple places (e.g. buffer growth) and could introduce unnecessary dispatch costs due to the calls.

We agreed that it is cleaner to have some code duplication instead, so I'll close this issue. Thanks for investigating it and working on the prototype, @holdenk!

rdblue reviewed Oct 10, 2020

View reviewed changes

rdblue reviewed Oct 16, 2020

View reviewed changes

openinx reviewed Oct 19, 2020

View reviewed changes

holdenk and others added 5 commits December 16, 2020 10:13

First pass at moving ReusableArray/MapData out of Flink into shared p…

3b4ca47

…arquet

Spark uses base abstract classes rather than interfaces, so rework th…

6a9060d

…e common code to interface with default methods to allow multiple inheritence.

Get Spark refactored to use the shared interface

5f92830

Remove un-needed cast

f7b7947

re-arrange & minimize changes

d7c33c5

holdenk force-pushed the refactor-ReusableMapData-in-spark-and-flink-parquet-readers-1331 branch from c62215b to d7c33c5 Compare December 16, 2020 19:41

github-actions bot added flink parquet spark labels Dec 16, 2020

re-ordering

a765cb7

rdblue closed this Dec 17, 2020


		package org.apache.iceberg.parquet;

		public interface ReusableArrayData {

Refactor reusable map data in Spark and Flink parquet readers #1331 #1572

Refactor reusable map data in Spark and Flink parquet readers #1331 #1572

Uh oh!

Conversation

holdenk commented Oct 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Oct 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick commented Oct 11, 2020

Uh oh!

holdenk commented Oct 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 17, 2020

Uh oh!

Uh oh!

kbendick Oct 12, 2020 •

edited

Loading

kbendick Oct 10, 2020 •

edited

Loading