Skip to content

Conversation

@parthchandra
Copy link
Contributor

outputPartitioning for CometScanExec was incorrectly returning 0 causing unit tests to fail

@parthchandra
Copy link
Contributor Author

@viirya could you review?
Without this unit tests were failing with Can't zip RDDs with unequal numbers of partitions:... (in CometNativeExec.doExecuteColumnar)


override lazy val (outputPartitioning, outputOrdering): (Partitioning, Seq[SortOrder]) =
(wrapped.outputPartitioning, wrapped.outputOrdering)
(UnknownPartitioning(wrapped.inputRDD.getNumPartitions), wrapped.outputOrdering)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the output partitioning is when it fails the test? I assume that the outputPartitioning should have correct partition number but not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were seeing zero partitions. Here is some debug output from testing:

firstNonBroadcastPlan = Some((CometScan parquet [_1#6,_2#7] Batched: true, DataFilters: [isnotnull(_1#6)], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-a21845bf-72dc-49d1-94f0-f785bb6f2f18], PartitionFilters: [], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:int,_2:int>
,0))

firstNonBroadcastPlanNumPartitions = Some(0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. wrapped.outputPartitioning was zero

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somehow weird. It means the outputPartitioning info is not correct in original ScanExec and not matched the output RDD's partition number. 🤔


override lazy val (outputPartitioning, outputOrdering): (Partitioning, Seq[SortOrder]) =
(wrapped.outputPartitioning, wrapped.outputOrdering)
(UnknownPartitioning(wrapped.inputRDD.getNumPartitions), wrapped.outputOrdering)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think we should keep original partition instead a hard-coded UnknownPartitioning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this based on what we do in CometNativeScanExec as well as the original output partition. The original wrapped.outputPartitioning for CometScanExec inherits from FileSourceScanLike and this returns a hardcoded UnknownPartitioning(0) for non-bucketed scan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@andygrove andygrove merged commit b63570b into apache:comet-parquet-exec Dec 11, 2024
14 of 75 checks passed
andygrove added a commit that referenced this pull request Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants