Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Override node name for CometSparkToColumnar #958

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JensonChoi
Copy link

@JensonChoi JensonChoi commented Sep 21, 2024

Which issue does this PR close?

Closes #936

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Added unit test for row input. Still finding a Spark query that would force the use of CometSparkColumnarToColumnar node name.

@JensonChoi JensonChoi marked this pull request as draft September 21, 2024 21:52
@JensonChoi JensonChoi changed the title [WIP] CometSparkToColumnar override node name for row vs columnar input chore: Override node name for CometSparkToColumnar Sep 23, 2024
@JensonChoi JensonChoi marked this pull request as ready for review September 29, 2024 23:06
@JensonChoi
Copy link
Author

@andygrove can I get a review please? Thank you in advance.

@andygrove
Copy link
Member

@andygrove can I get a review please? Thank you in advance.

Thanks for the PR @JensonChoi. The other change that will need to be made is to update the golden files for the tests that check for expected plans. You can find more information in the contributor guide: https://datafusion.apache.org/comet/contributor-guide/development.html#plan-stability-testing

@JensonChoi
Copy link
Author

@andygrove Following our discussions on Discord, I ran the following commands to update the golden files.

make clean; make release PROFILES="-Pspark-3.4"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

The odd thing is that no file change was detected by git even though the commands ran successfully. Is it possible that updating the node name is not something that would be picked up by stability testing? Thank you in advance.

@andygrove
Copy link
Member

@andygrove Following our discussions on Discord, I ran the following commands to update the golden files.

make clean; make release PROFILES="-Pspark-3.4"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

The odd thing is that no file change was detected by git even though the commands ran successfully. Is it possible that updating the node name is not something that would be picked up by stability testing? Thank you in advance.

It is possible that the golden files got written to a different location is you have SPARK_HOME set. Could you try unsettling that env var if you have it set? If you don't have that set, perhaps add logging to see where the files are being written?

@JensonChoi
Copy link
Author

SPARK_HOME is currently set to /home/jenson/datafusion-comet, which is my local clone of the repo. Does that sound about right? I will try the logging route that you suggested to see where the files are being written to. Thanks.

@JensonChoi
Copy link
Author

JensonChoi commented Oct 28, 2024

@andygrove I reran the following commands without the SPARK_GENERATE_GOLDEN_FILES=1 flag:

make clean; make release PROFILES="-Pspark-3.4"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.4 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.4 -nsu test

make clean; make release PROFILES="-Pspark-3.5"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-3.5 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-3.5 -nsu test

make clean; make release PROFILES="-Pspark-4.0"
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite" -Pspark-4.0 -nsu test
./mvnw -pl spark -Dsuites="org.apache.spark.sql.comet.CometTPCDSV2_7_PlanStabilitySuite" -Pspark-4.0 -nsu test

And they all passed. In addition, I briefly looked at the stability test suites here, and it appears none of them contains the node CometSparkToColumnar. What should be the next steps here?

@andygrove
Copy link
Member

And they all passed. In addition, I briefly looked at the stability test suites here, and it appears none of them contains the node CometSparkToColumnar. What should be the next steps here?

Thanks for investigating this @JensonChoi. The golden files do not contain any RowToColumnar transitions, so this explains why they did not need updating.

c.nodeName
}
assert(nodeNames.length == 1)
assert(nodeNames.head == "CometSparkRowToColumnar")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test that will generate a plan that uses CometSparkColumnarToColumnar so that we are testing both cases?

I think you could have a copy of this test that writes the dataframe to a Parquet file and then reads the Parquet file back with the following configs. This will use Spark's vectorized Parquet reader which returns Spark columns.

        SQLConf.USE_V1_SOURCE_LIST.key -> "",
        CometConf.COMET_NATIVE_SCAN_ENABLED.key -> "false",
        CometConf.COMET_CONVERT_FROM_PARQUET_ENABLED.key -> "true") {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @andygrove, I'm a bit stuck on the unit test for CometSparkColumnarToColumnar. I pushed a commit that contains what I've been working on, so you can take a look. However, I'm getting this error when I run the unit test:

  Cause: java.lang.ClassCastException: class org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to class org.apache.spark.sql.catalyst.InternalRow (org.apache.spark.sql.vectorized.ColumnarBatch and org.apache.spark.sql.catalyst.InternalRow are in unnamed module of loader 'app')
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:389)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
  at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
  at org.apache.spark.scheduler.Task.run(Task.scala:139)

Would appreciate any help. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JensonChoi. I will look at this today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CometSparkToColumnar should have different name for row vs columnar input
2 participants