-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-24971][SQL] remove SupportsDeprecatedScanRow #21921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #93796 has finished for PR 21921 at commit
|
lgtm |
thanks, merging to master! |
@cloud-fan, I thought it was a requirement to have a committer +1 before merging. Or is this list of committers out of date? |
@@ -91,7 +90,7 @@ class RateStreamContinuousReader(options: DataSourceOptions) | |||
i, | |||
numPartitions, | |||
perPartitionRate) | |||
.asInstanceOf[InputPartition[Row]] | |||
.asInstanceOf[InputPartition[InternalRow]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this cast necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't dig into it as the cast was already there. The reason seems to be, java.util.List
isn't covariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to leave casts. Can you check to see if this can be avoided? I found in #21118 that many of the casts were unnecessary if variables had declared types and it is much better to avoid explicit casts that work around the type system.
@@ -169,7 +170,7 @@ class RateStreamMicroBatchReader(options: DataSourceOptions, checkpointLocation: | |||
(0 until numPartitions).map { p => | |||
new RateStreamMicroBatchInputPartition( | |||
p, numPartitions, rangeStart, rangeEnd, localStartTimeMs, relativeMsPerValue) | |||
: InputPartition[Row] | |||
: InputPartition[InternalRow] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Doesn't RateStreamMicroBatchInputPartition implement InputPartition[InternalRow]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine since it isn't a cast, but it's generally better to check whether these are still necessary after refactoring.
@@ -121,17 +121,6 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
|
|||
test("unsafe row scan implementation") { | |||
Seq(classOf[UnsafeRowDataSourceV2], classOf[JavaUnsafeRowDataSourceV2]).foreach { cls => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove unsafe tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a followup of #21118. you removed SupportsScanUnsafeRow
there and then this test becomes meaningless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
override def planRowInputPartitions(): JList[InputPartition[Row]] = { | ||
val lowerBound = filters.collect { | ||
override def planInputPartitions(): JList[InputPartition[InternalRow]] = { | ||
val lowerBound = filters.collectFirst { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is an unrelated change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but this is really minor. When I changed the code nearby, the IDE shows a warning for not using collectFirst
here. Then I went for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me since it is so small, just wanted to point it out.
} | ||
} | ||
|
||
|
||
class UnsafeRowDataSourceV2 extends DataSourceV2 with ReadSupport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't Row implementations. Why remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as above, I'm guessing.
This looks fine other than the possibly unnecessary cast. |
@rdblue I vaguely remember that, if the PR author himself is a committer, we can merge a PR with one more LGTM from the community and no one objects in several days. I'm sorry if it's not the case. |
@cloud-fan To be safe, let us get one more LGTM from the other committer. |
retest this please |
@cloud-fan, @gatorsmile, I'm fine with that if it's documented somewhere. I wasn't aware of that convention and no one brought it up the last time I pointed out commits without a committer +1. |
@rdblue I do not think it is documented. Let us be more conservative. Collect LGTM from the committers no matter whether the PR author is a committer or not. |
It sounds like Github is experiencing a very bad delay. @cloud-fan Could you submit a follow-up PR to address the comments from @rdblue ? |
addressed in #21948 (comment) |
Yeah, I'd say that if it isn't documented then lets go with the usually RTC conventions. |
Test build #93887 has finished for PR 21921 at commit
|
This is a follow up of apache#21118 . In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build. existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21921 from cloud-fan/row. (cherry picked from commit defc54c) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ContinuousMemoryStream.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceSuite.scala
This is a follow up of apache#21118 . In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build. existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21921 from cloud-fan/row.
This is a follow up of apache#21118 . In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build. existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21921 from cloud-fan/row. (cherry picked from commit defc54c) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ContinuousMemoryStream.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceSuite.scala
What changes were proposed in this pull request?
This is a follow up of #21118 .
In #21118 we added
SupportsDeprecatedScanRow
. Ideally data source should produceInternalRow
instead ofRow
for better performance. We should removeSupportsDeprecatedScanRow
and encourage data sources to produceInternalRow
, which is also very easy to build.How was this patch tested?
existing tests.