-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Override node name for CometSparkToColumnar #958
base: main
Are you sure you want to change the base?
chore: Override node name for CometSparkToColumnar #958
Conversation
@andygrove can I get a review please? Thank you in advance. |
Thanks for the PR @JensonChoi. The other change that will need to be made is to update the golden files for the tests that check for expected plans. You can find more information in the contributor guide: https://datafusion.apache.org/comet/contributor-guide/development.html#plan-stability-testing |
@andygrove Following our discussions on Discord, I ran the following commands to update the golden files.
The odd thing is that no file change was detected by git even though the commands ran successfully. Is it possible that updating the node name is not something that would be picked up by stability testing? Thank you in advance. |
It is possible that the golden files got written to a different location is you have |
|
@andygrove I reran the following commands without the
And they all passed. In addition, I briefly looked at the stability test suites here, and it appears none of them contains the node |
Thanks for investigating this @JensonChoi. The golden files do not contain any |
c.nodeName | ||
} | ||
assert(nodeNames.length == 1) | ||
assert(nodeNames.head == "CometSparkRowToColumnar") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test that will generate a plan that uses CometSparkColumnarToColumnar
so that we are testing both cases?
I think you could have a copy of this test that writes the dataframe to a Parquet file and then reads the Parquet file back with the following configs. This will use Spark's vectorized Parquet reader which returns Spark columns.
SQLConf.USE_V1_SOURCE_LIST.key -> "",
CometConf.COMET_NATIVE_SCAN_ENABLED.key -> "false",
CometConf.COMET_CONVERT_FROM_PARQUET_ENABLED.key -> "true") {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @andygrove, I'm a bit stuck on the unit test for CometSparkColumnarToColumnar
. I pushed a commit that contains what I've been working on, so you can take a look. However, I'm getting this error when I run the unit test:
Cause: java.lang.ClassCastException: class org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to class org.apache.spark.sql.catalyst.InternalRow (org.apache.spark.sql.vectorized.ColumnarBatch and org.apache.spark.sql.catalyst.InternalRow are in unnamed module of loader 'app')
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:389)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
Would appreciate any help. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JensonChoi. I will look at this today.
Which issue does this PR close?
Closes #936
Rationale for this change
What changes are included in this PR?
How are these changes tested?
Added unit test for row input. Still finding a Spark query that would force the use of
CometSparkColumnarToColumnar
node name.