Skip to content

Spark iceberg runtime - predicate pushdown in parquet reader #12428

Open
@nateagr

Description

@nateagr

Hello!

After migrating some of our parquet tables (in Hive) to Iceberg (still parquet), I've noticed that reading the new Iceberg tables with Spark is much slower (at least / 4) than reading from the initial parquet tables. I've been trying to understand why we see such slowdown and it seems that Iceberg don't push the predicates to the parquet reader. I've written a unit test where I read one of our new Iceberg table with Spark and I always see the NoOp row group filter in the parquet reader. However, when reading one of our initial parquet tables, I see a row group filter that actually filter row groups based on statistics, dictionaries ...
Is my understanding correct? If yes, I've read many times that Iceberg supports predicate pushdown so when is it done? After reading the parquet files?

@RussellSpitzer, I'm pinging you as I've seen you've answered several questions about predicate pushdown in Iceberg.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions