Orc performance issue

**Describe the bug**
we found orc scan has poor perfomance while running tpcds benchmark:


the same scan operator is times slower than parquet (from tpcds q3).

**To Reproduce**
Steps to reproduce the behavior:
1. generate parquet and orc datasets using /tpcds/datagen.
2. run bechmarks on both datasets using /tpcds/benchmark-runner.
3. compare the performance of NativeParquetScan and NativeOrcScan.

**Expected behavior**
orc should have the similar performance comparing to parquet.

**Screenshots**
<img width="157" alt="image" src="https://github.com/user-attachments/assets/403a520f-9829-46c6-ad49-58ac33070797">

**Edit**
the main reason is that orc-rust reads all data without column pruning and predicate filtering, after applying column pruning with https://github.com/datafusion-contrib/datafusion-orc/pull/133 , the performance will be much better:

<img width="158" alt="image" src="https://github.com/user-attachments/assets/f55702b2-0bef-4152-8910-b29ea4fbe49e">

currently orc is still 20%~30% slower than parquet, which maybe related to unsupported predicate filtering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orc performance issue #630

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Orc performance issue #630

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions