Skip to content

Use Row Format in SortExec #7053

Open
Open
@tustvold

Description

Is your feature request related to a problem or challenge?

Currently SortExec::sort_batch_stream uses lexsort_to_indices to sort the produced RecordBatch. For multi-column sorts this makes use of LexicographicalComparator. The branching and dynamic dispatch involved in this comparator is relatively expensive. Converting to the row format first, and comparing these rows has been found to offer significant performance advantages in similar applications - #3386.

Describe the solution you'd like

SortExec should:

  • If single sort column, use sort_to_indices to sort the input batches
  • If multiple columns, convert to the row format and sort using this representation
  • If performing a subsequent merge, preserve the row encoding to avoid redundant work

Describe alternatives you've considered

No response

Additional context

This is likely not a good first issue, and I do not recommend people pick it up, creating primarily for tracking purposes. I will likely pick it up at some point in the near-ish future.

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions