feat: Add CometRowToColumnar operator #206

advancedxy · 2024-03-13T07:23:23Z

Which issue does this PR close?

This closes #119 and partially resolves #137

Rationale for this change

For ease testing with RangeExec operator in the short term.
In the long term, this PR introduce a general way to enable Comet with row-based source exec nodes

What changes are included in this PR?

Introduce CometRowToColumnarExec to transform Spark's InternalRow into ColumnarBatch
CometArrowConverters and its corresponding ArrowWriters
Glue code to apply RowToColumnar transition

How are these changes tested?

added tests and existing test code
verify plan transition in the Spark UI

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/CometArrowConverters.scala

codecov-commenter · 2024-03-13T08:22:06Z

Codecov Report

Attention: Patch coverage is 15.70513% with 263 lines in your changes are missing coverage. Please review.

Project coverage is 33.36%. Comparing base (aa6ddc5) to head (fb49a88).
Report is 1 commits behind head on main.

Files	Patch %	Lines
...spark/sql/comet/execution/arrow/ArrowWriters.scala	0.00%	198 Missing ⚠️
...l/comet/execution/arrow/CometArrowConverters.scala	0.00%	42 Missing ⚠️
.../scala/org/apache/spark/sql/comet/util/Utils.scala	0.00%	8 Missing ⚠️
...ain/scala/org/apache/comet/vector/NativeUtil.scala	0.00%	6 Missing ⚠️
...org/apache/comet/CometSparkSessionExtensions.scala	78.57%	0 Missing and 3 partials ⚠️
...pache/spark/sql/comet/CometRowToColumnarExec.scala	90.32%	1 Missing and 2 partials ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala	33.33%	1 Missing and 1 partial ⚠️
...n/scala/org/apache/comet/vector/StreamReader.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #206      +/-   ##
============================================
- Coverage     33.48%   33.36%   -0.13%     
- Complexity      776      791      +15     
============================================
  Files           108      111       +3     
  Lines         37178    37479     +301     
  Branches       8146     8192      +46     
============================================
+ Hits          12448    12503      +55     
- Misses        22107    22351     +244     
- Partials       2623     2625       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

advancedxy · 2024-03-15T14:29:27Z

common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala

@@ -78,6 +83,9 @@ object Utils {
    case _: ArrowType.FixedSizeBinary => BinaryType
    case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
    case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
+    case ts: ArrowType.Timestamp


Looks like the TimestampNTZType case is missed.

advancedxy · 2024-03-15T14:38:16Z

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/ArrowWriters.scala

+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types._
+
+private[arrow] object ArrowWriter {


It's mostly from Spark's side.
Since we are shading Arrow in Comet, we cannot use this code directly.

It'd be useful to add some comments on which Spark class this is from, to help it get better maintained in future.

Of course, let me add some comments.

advancedxy · 2024-03-22T06:54:47Z

Gently ping @sunchao and @viirya

sunchao · 2024-03-22T07:15:08Z

Sorry for the delay @advancedxy . I'll try to take a look soon.

sunchao

Overall this looks good, I just have some minor comments so far.

sunchao · 2024-03-24T22:52:10Z

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/ArrowWriters.scala

+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types._
+
+private[arrow] object ArrowWriter {


It'd be useful to add some comments on which Spark class this is from, to help it get better maintained in future.

common/src/main/scala/org/apache/comet/CometConf.scala

sunchao · 2024-03-26T05:06:36Z

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/ArrowWriters.scala

+      case (BinaryType, vector: LargeVarBinaryVector) => new LargeBinaryWriter(vector)
+      case (DateType, vector: DateDayVector) => new DateWriter(vector)
+      case (TimestampType, vector: TimeStampMicroTZVector) => new TimestampWriter(vector)
+      case (TimestampNTZType, vector: TimeStampMicroVector) => new TimestampNTZWriter(vector)


I think this is not compatible with Spark 3.2?

Hmmm, although the TimestampNTZType is removed in Spark 3.2 after this PR: https://github.com/apache/spark/pull/33444. It's still possible to access TimestampNTZType type, which makes the code a lot cleaner for later versions of Spark: we don't have to add special directories for spark 3.2/3.3/3.4 etc.

Per my understanding, since Spark 3.2 will not produce schema with TimestampNTZType, this pattern match case is effective a no-op case. And it could be effective for spark 3.3 and spark 3.4.

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/ArrowWriters.scala

sunchao · 2024-03-26T05:18:26Z

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/CometArrowConverters.scala

+      try {
+        if (!closed) {
+          if (currentBatch != null) {
+            arrowWriter.reset()


Do we need to close arrowWriter too? for example close all the ValueVectors in the writer.

There's no close method for arrowWriter. The ColumnarBatch shall be closed to close all the ValueVectors, which is already achieved by root.close

sunchao · 2024-03-26T05:21:52Z

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/CometArrowConverters.scala

+    }
+
+    override def hasNext: Boolean = rowIter.hasNext || {
+      close(false)


I wonder why we need to call close here and whether just calling close in the TaskCompletionListener is sufficient. Will this iterator be used again once it drains all the rows from the input iterator?

I wonder why we need to call close here and whether just calling close in the TaskCompletionListener is sufficient.

It might not be sufficient to close in TaskCompletionListener as the task might live much longer than the iterator, for example, a task contains Range -> CometRowToColumnar --> Sort --> ShuffleWrite, the CometRowToColumnar will be drained much earlier than Sort or ShuffleWrite. So we need to close the iterator earlier to make sure there's no memory buffering/(leaking).

However due to comments in https://github.com/apache/arrow-datafusion-comet/pull/206/files#diff-04037044481f9a656275a63ebb6a3a63badf866f19700d4a6909d2e17c8d7b72R37-R46, we cannot close the allocator after the iterator is consumed. It's already exported in the native side, it might be dropped later than the iterator consumption. So I add the allocator.close in the task completion callback.

I see, make sense.

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

sunchao · 2024-03-26T05:31:16Z

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/ArrowWriters.scala

+        }
+        new StructWriter(vector, children.toArray)
+      case (NullType, vector: NullVector) => new NullWriter(vector)
+      case (_: YearMonthIntervalType, vector: IntervalYearVector) =>


I think these are not supported yet for dictionary vector, see CometDictionary on the check of minor type.

Hmm, I believe this ArrowWriter doesn't produce dictionary encoded ColumnVectors.
But I do believe we should match how org.apache.spark.sql.comet.util.Utils#toArrowType matches SparkType to ArrowType. Let me comment out the following pattern match cases.

advancedxy · 2024-04-01T11:22:09Z

@sunchao would you mind to take a look at this again, I should address most of your comments, please let me know if you have any other comments.

And sorry for the late update, I wasn't feeling well last week.

sunchao

Sorry for late on the review @advancedxy . Overall LGTM. Please rebase the PR.

sunchao · 2024-04-08T05:53:02Z

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/CometArrowConverters.scala

+    }
+
+    override def hasNext: Boolean = rowIter.hasNext || {
+      close(false)


I see, make sense.

viirya · 2024-04-08T07:25:36Z

common/src/main/scala/org/apache/comet/CometConf.scala

+    "spark.comet.rowToColumnar.enabled")
+    .internal()
+    .doc("Whether to enable row to columnar conversion in Comet. When this is turned on, " +
+      "Comet will convert row-based operators in spark.comet.rowToColumnar.sourceNodeList into " +


Suggested change

"Comet will convert row-based operators in spark.comet.rowToColumnar.sourceNodeList into " +

"Comet will convert row-based data scan operators in spark.comet.rowToColumnar.sourceNodeList into " +

hmm, actually CometRowToColumnar is general enough that it can be used to convert any row-based operator to a columnar one.

viirya · 2024-04-08T07:29:32Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

@@ -238,6 +239,11 @@ class CometSparkSessionExtensions
          val nativeOp = QueryPlanSerde.operator2Proto(op).get
          CometScanWrapper(nativeOp, op)

+        case op if shouldApplyRowToColumnar(conf, op) =>
+          val cometOp = CometRowToColumnarExec(op)
+          val nativeOp = QueryPlanSerde.operator2Proto(cometOp).get


Hmm, is the source always able to be converted to Comet scan? If there are unsupported types, it will return None.

shouldApplyRowToColumnar already checks the output type of op. Only supported types are allowed.

viirya · 2024-04-08T07:34:59Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

+  // 2. Consecutive operators of CometRowToColumnarExec and ColumnarToRowExec, which might be
+  //    possible for Comet to add a `CometRowToColumnarExec` for row-based operators first, then
+  //    Spark only requests row-based output.


Do you actually mean:

Comet adds `CometRowToColumnarExec` on top of row-based data scan operators, but the downstream operator is Spark operator which takes row-based input. So Spark adds another `ColumnarToRowExec` after `CometRowToColumnarExec`. In this case, we remove the pair of `CometRowToColumnarExec` and `ColumnarToRowExec`.

but the
downstream operator is Spark operator which takes row-based input

hmm, this is another possibility, let me update the comment to include this one.
The case I described above is that Spark only requests row-based at the end of the operator, the row-based requirement might be passed down to the CometRowToColumnarExec and then we have a pair of CometRowToColumnarExec and ColumnarToRowExec.

I refined the comments, hopefully it clarifies things. Please let me know if you have any other comments.

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

sunchao · 2024-04-10T04:01:12Z

Merged, thanks @advancedxy ! If any comments from @viirya are not addressed, we can do it in a separate PR. This PR has been open for too long :)

advancedxy · 2024-04-10T07:33:40Z

If any comments from @viirya are not addressed, we can do it in a separate PR.

Of course.

Thanks for @sunchao and @viirya's review, really appreciate that.

advancedxy commented Mar 13, 2024

View reviewed changes

common/src/main/scala/org/apache/spark/sql/comet/execution/arrow/CometArrowConverters.scala Show resolved Hide resolved

advancedxy force-pushed the comet_row_to_columnar branch from d6858fa to 47cb668 Compare March 15, 2024 14:27

advancedxy commented Mar 15, 2024

View reviewed changes

advancedxy marked this pull request as ready for review March 15, 2024 14:30

advancedxy changed the title ~~[WIP] feat: Add CometRowToColumnar operator~~ feat: Add CometRowToColumnar operator Mar 15, 2024

advancedxy commented Mar 15, 2024

View reviewed changes

advancedxy mentioned this pull request Mar 22, 2024

feat: Support Broadcast HashJoin #211

Merged

sunchao reviewed Mar 26, 2024

View reviewed changes

advancedxy force-pushed the comet_row_to_columnar branch from 47cb668 to fb49a88 Compare April 1, 2024 11:18

advancedxy mentioned this pull request Apr 8, 2024

[EPIC] Add Spark expression coverage #240

Open

sunchao approved these changes Apr 8, 2024

View reviewed changes

viirya reviewed Apr 8, 2024

View reviewed changes

feat: Support CometRowToColumnar operator

2c6a4f7

viirya reviewed Apr 8, 2024

View reviewed changes

advancedxy force-pushed the comet_row_to_columnar branch from fb49a88 to 2c6a4f7 Compare April 8, 2024 07:35

viirya reviewed Apr 8, 2024

View reviewed changes

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala Outdated Show resolved Hide resolved

refine comments to address reviewer's comments

20cd07a

advancedxy force-pushed the comet_row_to_columnar branch from dc0841f to 20cd07a Compare April 8, 2024 15:18

advancedxy added 3 commits April 8, 2024 23:32

rename config name and doc

73e0cb9

fix

7c24127

apply scalafix

ae3f092

sunchao merged commit 60fe431 into apache:main Apr 10, 2024
28 checks passed

advancedxy mentioned this pull request May 6, 2024

fix: Disable Comet shuffle with AQE coalesce partitions enabled #380

Merged

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024

feat: Add CometRowToColumnar operator (apache#206)

4553cd4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add CometRowToColumnar operator #206

feat: Add CometRowToColumnar operator #206

advancedxy commented Mar 13, 2024 •

edited

Loading

codecov-commenter commented Mar 13, 2024 •

edited

Loading

advancedxy Mar 15, 2024

advancedxy Mar 15, 2024

sunchao Mar 24, 2024

advancedxy Mar 26, 2024

advancedxy commented Mar 22, 2024

sunchao commented Mar 22, 2024

sunchao left a comment

sunchao Mar 24, 2024

sunchao Mar 26, 2024

advancedxy Mar 26, 2024

sunchao Mar 26, 2024

advancedxy Mar 26, 2024

sunchao Mar 26, 2024

advancedxy Mar 26, 2024

sunchao Apr 8, 2024

sunchao Mar 26, 2024

advancedxy Mar 26, 2024

advancedxy commented Apr 1, 2024

sunchao left a comment •

edited

Loading

sunchao Apr 8, 2024

viirya Apr 8, 2024 •

edited

Loading

advancedxy Apr 8, 2024 •

edited

Loading

viirya Apr 8, 2024

advancedxy Apr 8, 2024

viirya Apr 8, 2024 •

edited

Loading

advancedxy Apr 8, 2024

advancedxy Apr 8, 2024

sunchao commented Apr 10, 2024

advancedxy commented Apr 10, 2024

	"Comet will convert row-based operators in spark.comet.rowToColumnar.sourceNodeList into " +
	"Comet will convert row-based data scan operators in spark.comet.rowToColumnar.sourceNodeList into " +

feat: Add CometRowToColumnar operator #206

feat: Add CometRowToColumnar operator #206

Conversation

advancedxy commented Mar 13, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

codecov-commenter commented Mar 13, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy commented Mar 22, 2024

sunchao commented Mar 22, 2024

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy commented Apr 1, 2024

sunchao left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

advancedxy Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Apr 10, 2024

advancedxy commented Apr 10, 2024

advancedxy commented Mar 13, 2024 •

edited

Loading

codecov-commenter commented Mar 13, 2024 •

edited

Loading

sunchao left a comment •

edited

Loading

viirya Apr 8, 2024 •

edited

Loading

advancedxy Apr 8, 2024 •

edited

Loading

viirya Apr 8, 2024 •

edited

Loading