chore: Override node name for CometSparkToColumnar #958

JensonChoi · 2024-09-21T21:46:38Z

Which issue does this PR close?

Closes #936

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Added unit test for row input. Still finding a Spark query that would force the use of CometSparkColumnarToColumnar node name.

JensonChoi · 2024-09-29T23:06:54Z

@andygrove can I get a review please? Thank you in advance.

andygrove · 2024-10-02T21:10:21Z

@andygrove can I get a review please? Thank you in advance.

Thanks for the PR @JensonChoi. The other change that will need to be made is to update the golden files for the tests that check for expected plans. You can find more information in the contributor guide: https://datafusion.apache.org/comet/contributor-guide/development.html#plan-stability-testing

JensonChoi · 2024-10-20T08:35:52Z

@andygrove Following our discussions on Discord, I ran the following commands to update the golden files.

make clean; make release PROFILES="-Pspark-3.4"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

The odd thing is that no file change was detected by git even though the commands ran successfully. Is it possible that updating the node name is not something that would be picked up by stability testing? Thank you in advance.

andygrove · 2024-10-22T15:46:33Z

@andygrove Following our discussions on Discord, I ran the following commands to update the golden files.

make clean; make release PROFILES="-Pspark-3.4"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

The odd thing is that no file change was detected by git even though the commands ran successfully. Is it possible that updating the node name is not something that would be picked up by stability testing? Thank you in advance.

It is possible that the golden files got written to a different location is you have SPARK_HOME set. Could you try unsettling that env var if you have it set? If you don't have that set, perhaps add logging to see where the files are being written?

JensonChoi · 2024-10-24T07:47:53Z

SPARK_HOME is currently set to /home/jenson/datafusion-comet, which is my local clone of the repo. Does that sound about right? I will try the logging route that you suggested to see where the files are being written to. Thanks.

JensonChoi · 2024-10-28T09:14:37Z

@andygrove I reran the following commands without the SPARK_GENERATE_GOLDEN_FILES=1 flag:

make clean; make release PROFILES="-Pspark-3.4"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

And they all passed. In addition, I briefly looked at the stability test suites here, and it appears none of them contains the node CometSparkToColumnar. What should be the next steps here?

andygrove · 2024-10-30T16:12:02Z

And they all passed. In addition, I briefly looked at the stability test suites here, and it appears none of them contains the node CometSparkToColumnar. What should be the next steps here?

Thanks for investigating this @JensonChoi. The golden files do not contain any RowToColumnar transitions, so this explains why they did not need updating.

andygrove · 2024-10-30T16:15:50Z

spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala

+        c.nodeName
+      }
+      assert(nodeNames.length == 1)
+      assert(nodeNames.head == "CometSparkRowToColumnar")


Could you also add a test that will generate a plan that uses CometSparkColumnarToColumnar so that we are testing both cases?

I think you could have a copy of this test that writes the dataframe to a Parquet file and then reads the Parquet file back with the following configs. This will use Spark's vectorized Parquet reader which returns Spark columns.

SQLConf.USE_V1_SOURCE_LIST.key -> "", CometConf.COMET_NATIVE_SCAN_ENABLED.key -> "false", CometConf.COMET_CONVERT_FROM_PARQUET_ENABLED.key -> "true") {

Hey @andygrove, I'm a bit stuck on the unit test for CometSparkColumnarToColumnar. I pushed a commit that contains what I've been working on, so you can take a look. However, I'm getting this error when I run the unit test:

Cause: java.lang.ClassCastException: class org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to class org.apache.spark.sql.catalyst.InternalRow (org.apache.spark.sql.vectorized.ColumnarBatch and org.apache.spark.sql.catalyst.InternalRow are in unnamed module of loader 'app') at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:389) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139)

Would appreciate any help. Thank you.

Hi @JensonChoi. I will look at this today.

Add method nodeName()

76c8952

JensonChoi marked this pull request as draft September 21, 2024 21:52

JensonChoi changed the title ~~[WIP] CometSparkToColumnar override node name for row vs columnar input~~ chore: Override node name for CometSparkToColumnar Sep 23, 2024

Add unit test for row input

6b5ca95

JensonChoi marked this pull request as ready for review September 29, 2024 23:06

andygrove reviewed Oct 30, 2024

View reviewed changes

Add unit test for columnar input

e0849d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Override node name for CometSparkToColumnar #958

chore: Override node name for CometSparkToColumnar #958

JensonChoi commented Sep 21, 2024 •

edited

Loading

JensonChoi commented Sep 29, 2024

andygrove commented Oct 2, 2024

JensonChoi commented Oct 20, 2024

andygrove commented Oct 22, 2024

JensonChoi commented Oct 24, 2024

JensonChoi commented Oct 28, 2024 •

edited

Loading

andygrove commented Oct 30, 2024

andygrove Oct 30, 2024

JensonChoi Nov 12, 2024

andygrove Nov 18, 2024

chore: Override node name for CometSparkToColumnar #958

Are you sure you want to change the base?

chore: Override node name for CometSparkToColumnar #958

Conversation

JensonChoi commented Sep 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

JensonChoi commented Sep 29, 2024

andygrove commented Oct 2, 2024

JensonChoi commented Oct 20, 2024

andygrove commented Oct 22, 2024

JensonChoi commented Oct 24, 2024

JensonChoi commented Oct 28, 2024 • edited Loading

andygrove commented Oct 30, 2024

andygrove Oct 30, 2024

Choose a reason for hiding this comment

JensonChoi Nov 12, 2024

Choose a reason for hiding this comment

andygrove Nov 18, 2024

Choose a reason for hiding this comment

JensonChoi commented Sep 21, 2024 •

edited

Loading

JensonChoi commented Oct 28, 2024 •

edited

Loading