-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-27650][SQL] separate the row iterator functionality from ColumnarBatch #24546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #105214 has finished for PR 24546 at commit
|
ef4088a
to
6fcc10c
Compare
Test build #105218 has finished for PR 24546 at commit
|
Test build #105227 has finished for PR 24546 at commit
|
Is this necessary to make #24416 possible? If so, why? |
@rdblue As I mentioned above, this PR avoids referring to internal classes( |
I agree with this direction regarding the separation. Can we make |
|
* batch so that Spark can access the data row by row. Instance of it is meant to be reused during | ||
* the entire data loading process. | ||
* This class wraps multiple {@link ColumnVector}s as a table-like data batch. Instance of it is | ||
* meant to be reused during the entire data loading process. | ||
*/ | ||
@Evolving | ||
public final class ColumnarBatch { | ||
private int numRows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still proper to carry this info here now? Row-wise access isn't at ColumnarBatch
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark needs to know the row count to read the columnar data.
project/MimaExcludes.scala
Outdated
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.mllib.feature.IDF#DocumentFrequencyAggregator.idf") | ||
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.mllib.feature.IDF#DocumentFrequencyAggregator.idf"), | ||
|
||
// [SPARK-27650][SQL] sepate the row iterator functionality from ColumnarBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit typo: sepate -> separate
Test build #105256 has finished for PR 24546 at commit
|
@cloud-fan, I was asking for a more thorough explanation. The PR description just says that this avoids referring to |
@cloud-fan Sorry for my mistake. I wanted to say |
// Staging row returned from `getRow`. | ||
private final MutableColumnarRow row; | ||
|
||
public ColumnarBatchRowView(ColumnarBatch batch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to create another constructor not to allocate MutableColumnarRow
as an optimization?
This is because most of the use cases are to immediately call rowIterator()
that obviously never calls getRow()
.
This avoids referring to @kiszk The responsibility of |
@cloud-fan, I don't think that separating the iterator functionality from For implementations to actually use the columnar API in practice, this iterator is really useful. For example, sources need to build tests to validate batches and those tests need a way to read through a I think it would be better to either use a different I don't see a good reason to remove useful functionality from |
@rdblue that's a good point. I checked the code base, |
I've created #24581, please take a look, thanks! |
## What changes were proposed in this pull request? To move DS v2 API to the catalyst module, we can't refer to an internal class (`MutableColumnarRow`) in `ColumnarBatch`. This PR creates a read-only version of `MutableColumnarRow`, and use it in `ColumnarBatch`. close apache#24546 ## How was this patch tested? existing tests Closes apache#24581 from cloud-fan/mutable-row. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
ColumnarBatch
is user-facing, we should expose as fewer details as possible.ColumnarBatch
is used to carry the data from data source to Spark. Accessing the data in a row-wise style is not its responsibility.This PR creates
ColumnarBatchRowView
, and moves the row iterator functionality fromColumnarBatch
toColumnarBatchRowView
.This PR avoids referring to internal classes(MutableColumnarRow) in
ColumnarBatch
, so that we can move DS v2 APIs to catalyst module later in #24416How was this patch tested?
existing tests