Skip to content

[SPARK-27650][SQL] separate the row iterator functionality from ColumnarBatch #24546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

ColumnarBatch is user-facing, we should expose as fewer details as possible.

ColumnarBatch is used to carry the data from data source to Spark. Accessing the data in a row-wise style is not its responsibility.

This PR creates ColumnarBatchRowView, and moves the row iterator functionality from ColumnarBatch to ColumnarBatchRowView.

This PR avoids referring to internal classes(MutableColumnarRow) in ColumnarBatch, so that we can move DS v2 APIs to catalyst module later in #24416

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105214 has finished for PR 24546 at commit 2145dc8.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class ColumnarBatchRowView

@cloud-fan cloud-fan force-pushed the vector branch 2 times, most recently from ef4088a to 6fcc10c Compare May 7, 2019 14:10
@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105218 has finished for PR 24546 at commit 6fcc10c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class ColumnarBatchRowView

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105227 has finished for PR 24546 at commit 8eed58c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class ColumnarBatchRowView

@rdblue
Copy link
Contributor

rdblue commented May 7, 2019

Is this necessary to make #24416 possible? If so, why?

@cloud-fan
Copy link
Contributor Author

@rdblue As I mentioned above, this PR avoids referring to internal classes(MutableColumnarRow) in ColumnarBatch. MutableColumnarRow refers to other internal classes, without this PR, we need to move many internal classes to catalyst package.

@kiszk
Copy link
Member

kiszk commented May 8, 2019

I agree with this direction regarding the separation. Can we make columnarBatch as a public class like ColumnVector?

@cloud-fan
Copy link
Contributor Author

ColumnarBatch is a public class. It's in the same package as ColumnVector.

* batch so that Spark can access the data row by row. Instance of it is meant to be reused during
* the entire data loading process.
* This class wraps multiple {@link ColumnVector}s as a table-like data batch. Instance of it is
* meant to be reused during the entire data loading process.
*/
@Evolving
public final class ColumnarBatch {
private int numRows;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still proper to carry this info here now? Row-wise access isn't at ColumnarBatch anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark needs to know the row count to read the columnar data.

ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.mllib.feature.IDF#DocumentFrequencyAggregator.idf")
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.mllib.feature.IDF#DocumentFrequencyAggregator.idf"),

// [SPARK-27650][SQL] sepate the row iterator functionality from ColumnarBatch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo: sepate -> separate

@cloud-fan cloud-fan changed the title [SPARK-27650][SQL] sepate the row iterator functionality from ColumnarBatch [SPARK-27650][SQL] separate the row iterator functionality from ColumnarBatch May 8, 2019
@SparkQA
Copy link

SparkQA commented May 8, 2019

Test build #105256 has finished for PR 24546 at commit acd0aaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented May 8, 2019

@cloud-fan, I was asking for a more thorough explanation. The PR description just says that this avoids referring to MutableColumnarRow in the new class, not that it changes some structure needed to avoid that reference. Can you explain how it is structured today that is a problem and what this changes?

@kiszk
Copy link
Member

kiszk commented May 8, 2019

@cloud-fan Sorry for my mistake. I wanted to say public API instead of public class like ColumnVector.

// Staging row returned from `getRow`.
private final MutableColumnarRow row;

public ColumnarBatchRowView(ColumnarBatch batch) {
Copy link
Member

@kiszk kiszk May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to create another constructor not to allocate MutableColumnarRow as an optimization?
This is because most of the use cases are to immediately call rowIterator() that obviously never calls getRow().

@cloud-fan
Copy link
Contributor Author

cloud-fan commented May 9, 2019

The PR description just says that this avoids referring to MutableColumnarRow in the new class

This avoids referring to MutableColumnarRow in the old class(ColumnarBatch), so that ColumnarBatch does not refer to any internal classes and can be moved to the catalyst package. The related functionality that needs MutableColumnarRow is moved the new class ColumnarBatchRowView , and the new class is internal. @rdblue please let me know if you need further explanation.

@kiszk The responsibility of ColumnarBatch is just to carry the columnar data to Spark. We can make it an interface, but I'd image all the people will have very similar implementations. I think it's better to keep it as a class, and users can use it directly.

@rdblue
Copy link
Contributor

rdblue commented May 9, 2019

@cloud-fan, I don't think that separating the iterator functionality from ColumnarBatch is the right approach.

For implementations to actually use the columnar API in practice, this iterator is really useful. For example, sources need to build tests to validate batches and those tests need a way to read through a ColumnarBatch. Using InternalRow to access and validate each row makes sense, and it is better if implementations can use the same code that Spark would use to produce the rows. The iterator itself doesn't need to be removed because it uses only public (Iterator) or effectively public (InternalRow) classes.

I think it would be better to either use a different InternalRow implementation (that is read-only to avoid depending on WritableColumnVector), or to move MutableColumnarRow but mark it private and continue to use it as the concrete implementation of InternalRow.

I don't see a good reason to remove useful functionality from ColumnarBatch just to keep an implementation class in a different module.

@cloud-fan
Copy link
Contributor Author

@rdblue that's a good point. I checked the code base, MutableColumnarRow does need to be mutable, when it's used in HashAggregateExec. But the row returned by ColumnarBatch.rowIterator doesn't need to be mutable. Let me create a special row to replace MutableColumnarRow in ColumnarBatch.

@cloud-fan
Copy link
Contributor Author

I've created #24581, please take a look, thanks!

mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
## What changes were proposed in this pull request?

To move DS v2 API to the catalyst module, we can't refer to an internal class (`MutableColumnarRow`) in `ColumnarBatch`.

This PR creates a read-only version of `MutableColumnarRow`, and use it in `ColumnarBatch`.

close apache#24546

## How was this patch tested?

existing tests

Closes apache#24581 from cloud-fan/mutable-row.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants