Skip to content

Performance Regression: Backtraces in errors slow down planning time (Expensive backtraces) #7522

Closed
@crepererum

Description

@crepererum

DataFusion creates loads of errors even on the happy path. However as of #7434, we now gather a backtrace for each error. This is rather expensive. Here is a profile dump from a prod workload:

Screenshot from 2023-09-11 18-35-55

In the said workload, there is a LOT of going for bookkeeping (that's why I had the profiler running in the first place) but the backtraces alone make up for 30% of the time. The place looks like this:

logical:
Projection: ...
  TableScan: ...                                                                                                                                                                                                                                                                                                                                                                                               

physical:
ProjectionExec: ..
  CoalesceBatchesExec: target_batch_size=8192
    FilterExec: ...
      ParquetExec: ...

(had to remove a good amount of details due to data protection, but the filters / predicates are rather simple)

I think there are two paths forward:

  • do NOT generate backtraces for errors
  • do NOT use errors for the happy path but rather Option or some other enum

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions