Skip to content

Orc performance issue #630

@richox

Description

@richox

Describe the bug
we found orc scan has poor perfomance while running tpcds benchmark:

the same scan operator is times slower than parquet (from tpcds q3).

To Reproduce
Steps to reproduce the behavior:

  1. generate parquet and orc datasets using /tpcds/datagen.
  2. run bechmarks on both datasets using /tpcds/benchmark-runner.
  3. compare the performance of NativeParquetScan and NativeOrcScan.

Expected behavior
orc should have the similar performance comparing to parquet.

Screenshots
image

Edit
the main reason is that orc-rust reads all data without column pruning and predicate filtering, after applying column pruning with datafusion-contrib/datafusion-orc#133 , the performance will be much better:

image

currently orc is still 20%~30% slower than parquet, which maybe related to unsupported predicate filtering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions