[SPARK-27650][SQL] separate the row iterator functionality from ColumnarBatch #24546

cloud-fan · 2019-05-07T10:46:09Z

What changes were proposed in this pull request?

ColumnarBatch is user-facing, we should expose as fewer details as possible.

ColumnarBatch is used to carry the data from data source to Spark. Accessing the data in a row-wise style is not its responsibility.

This PR creates ColumnarBatchRowView, and moves the row iterator functionality from ColumnarBatch to ColumnarBatchRowView.

This PR avoids referring to internal classes(MutableColumnarRow) in ColumnarBatch, so that we can move DS v2 APIs to catalyst module later in #24416

How was this patch tested?

existing tests

cloud-fan · 2019-05-07T10:46:54Z

cc @ueshin @rdblue @gatorsmile @gengliangwang

SparkQA · 2019-05-07T10:58:16Z

Test build #105214 has finished for PR 24546 at commit 2145dc8.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class ColumnarBatchRowView

SparkQA · 2019-05-07T16:00:16Z

Test build #105218 has finished for PR 24546 at commit 6fcc10c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class ColumnarBatchRowView

SparkQA · 2019-05-07T20:51:17Z

Test build #105227 has finished for PR 24546 at commit 8eed58c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class ColumnarBatchRowView

rdblue · 2019-05-07T21:11:57Z

Is this necessary to make #24416 possible? If so, why?

cloud-fan · 2019-05-08T02:41:33Z

@rdblue As I mentioned above, this PR avoids referring to internal classes(MutableColumnarRow) in ColumnarBatch. MutableColumnarRow refers to other internal classes, without this PR, we need to move many internal classes to catalyst package.

kiszk · 2019-05-08T06:20:19Z

I agree with this direction regarding the separation. Can we make columnarBatch as a public class like ColumnVector?

cloud-fan · 2019-05-08T06:31:17Z

ColumnarBatch is a public class. It's in the same package as ColumnVector.

viirya · 2019-05-08T09:55:34Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java

- * batch so that Spark can access the data row by row. Instance of it is meant to be reused during
- * the entire data loading process.
+ * This class wraps multiple {@link ColumnVector}s as a table-like data batch. Instance of it is
+ * meant to be reused during the entire data loading process.
 */
 @Evolving
 public final class ColumnarBatch {
  private int numRows;


Is it still proper to carry this info here now? Row-wise access isn't at ColumnarBatch anymore.

Spark needs to know the row count to read the columnar data.

viirya · 2019-05-08T10:02:29Z

project/MimaExcludes.scala

-    ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.mllib.feature.IDF#DocumentFrequencyAggregator.idf")
+    ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.mllib.feature.IDF#DocumentFrequencyAggregator.idf"),
+
+    // [SPARK-27650][SQL] sepate the row iterator functionality from ColumnarBatch


nit typo: sepate -> separate

SparkQA · 2019-05-08T14:33:07Z

Test build #105256 has finished for PR 24546 at commit acd0aaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-05-08T15:57:54Z

@cloud-fan, I was asking for a more thorough explanation. The PR description just says that this avoids referring to MutableColumnarRow in the new class, not that it changes some structure needed to avoid that reference. Can you explain how it is structured today that is a problem and what this changes?

kiszk · 2019-05-08T19:27:41Z

@cloud-fan Sorry for my mistake. I wanted to say public API instead of public class like ColumnVector.

kiszk · 2019-05-08T19:33:35Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatchRowView.java

+  // Staging row returned from `getRow`.
+  private final MutableColumnarRow row;
+
+  public ColumnarBatchRowView(ColumnarBatch batch) {


Does it make sense to create another constructor not to allocate MutableColumnarRow as an optimization?
This is because most of the use cases are to immediately call rowIterator() that obviously never calls getRow().

cloud-fan · 2019-05-09T02:52:45Z

The PR description just says that this avoids referring to MutableColumnarRow in the new class

This avoids referring to MutableColumnarRow in the old class(ColumnarBatch), so that ColumnarBatch does not refer to any internal classes and can be moved to the catalyst package. The related functionality that needs MutableColumnarRow is moved the new class ColumnarBatchRowView , and the new class is internal. @rdblue please let me know if you need further explanation.

@kiszk The responsibility of ColumnarBatch is just to carry the columnar data to Spark. We can make it an interface, but I'd image all the people will have very similar implementations. I think it's better to keep it as a class, and users can use it directly.

rdblue · 2019-05-09T23:29:40Z

@cloud-fan, I don't think that separating the iterator functionality from ColumnarBatch is the right approach.

For implementations to actually use the columnar API in practice, this iterator is really useful. For example, sources need to build tests to validate batches and those tests need a way to read through a ColumnarBatch. Using InternalRow to access and validate each row makes sense, and it is better if implementations can use the same code that Spark would use to produce the rows. The iterator itself doesn't need to be removed because it uses only public (Iterator) or effectively public (InternalRow) classes.

I think it would be better to either use a different InternalRow implementation (that is read-only to avoid depending on WritableColumnVector), or to move MutableColumnarRow but mark it private and continue to use it as the concrete implementation of InternalRow.

I don't see a good reason to remove useful functionality from ColumnarBatch just to keep an implementation class in a different module.

cloud-fan · 2019-05-10T14:10:36Z

@rdblue that's a good point. I checked the code base, MutableColumnarRow does need to be mutable, when it's used in HashAggregateExec. But the row returned by ColumnarBatch.rowIterator doesn't need to be mutable. Let me create a special row to replace MutableColumnarRow in ColumnarBatch.

cloud-fan · 2019-05-10T14:41:54Z

I've created #24581, please take a look, thanks!

## What changes were proposed in this pull request? To move DS v2 API to the catalyst module, we can't refer to an internal class (`MutableColumnarRow`) in `ColumnarBatch`. This PR creates a read-only version of `MutableColumnarRow`, and use it in `ColumnarBatch`. close apache#24546 ## How was this patch tested? existing tests Closes apache#24581 from cloud-fan/mutable-row. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

cloud-fan force-pushed the vector branch 2 times, most recently from ef4088a to 6fcc10c Compare May 7, 2019 14:10

sepate the row iterator functionality from ColumnarBatch

8eed58c

cloud-fan force-pushed the vector branch from 6fcc10c to 8eed58c Compare May 7, 2019 18:21

viirya reviewed May 8, 2019

View reviewed changes

cloud-fan changed the title ~~[SPARK-27650][SQL] sepate the row iterator functionality from ColumnarBatch~~ [SPARK-27650][SQL] separate the row iterator functionality from ColumnarBatch May 8, 2019

address comment

acd0aaa

kiszk reviewed May 8, 2019

View reviewed changes

cloud-fan mentioned this pull request May 10, 2019

[SPARK-27675][SQL] do not use MutableColumnarRow in ColumnarBatch #24581

Closed

HyukjinKwon closed this in 9ff77b1 May 12, 2019

GulajavaMinistudio mentioned this pull request May 12, 2019

fixed issue where tests GulajavaMinistudio/spark#576

Merged

[SPARK-27650][SQL] separate the row iterator functionality from ColumnarBatch #24546

[SPARK-27650][SQL] separate the row iterator functionality from ColumnarBatch #24546

Uh oh!

Conversation

cloud-fan commented May 7, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented May 7, 2019

Uh oh!

SparkQA commented May 7, 2019

Uh oh!

SparkQA commented May 7, 2019

Uh oh!

SparkQA commented May 7, 2019

Uh oh!

rdblue commented May 7, 2019

Uh oh!

cloud-fan commented May 8, 2019

Uh oh!

kiszk commented May 8, 2019

Uh oh!

cloud-fan commented May 8, 2019

Uh oh!

viirya May 8, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 8, 2019

Choose a reason for hiding this comment

Uh oh!

viirya May 8, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 8, 2019

Uh oh!

rdblue commented May 8, 2019

Uh oh!

kiszk commented May 8, 2019

Uh oh!

kiszk May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented May 9, 2019

Uh oh!

cloud-fan commented May 10, 2019

Uh oh!

cloud-fan commented May 10, 2019

Uh oh!

Uh oh!

kiszk May 8, 2019 •

edited

Loading

cloud-fan commented May 9, 2019 •

edited

Loading