Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pushdown like filter (endsWith and contains) #9683

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yabola
Copy link
Contributor

@yabola yabola commented Feb 8, 2024

This PR supports pushing down endsWith (like %x) and contains (like %x%) to Iceberg. The benefits are:

  1. Support for early filtering of partition columns in partitioned tables, before this PR iceberg needs to scan the entire table.
  2. Support for pushdown agg under certain scenarios (before endWiths or contains could not be pushed down, so it could not pushdown agg).
  3. Support for filtering files using Parquet dictionaries for regular columns in tables.

Before this PR, iceberg only support startWith.

@yabola yabola changed the title Support pushdown like filter (endsWith and contains) Support for pushdown like filter (endsWith and contains) Feb 8, 2024
@yabola
Copy link
Contributor Author

yabola commented Feb 8, 2024

I have an example of performance comparison.
Table : p_lineorder_ice has partition column LO_ORDERDATE.

  1. pushdown partition column
    test sql: select * from ice.ssb10.p_lineorder_ice where LO_ORDERDATE like '%01'
    before this pr:
image

after this pr:
image

  1. pushdown agg column
    test sql: select count(1) from ice.ssb10.p_lineorder_ice where LO_ORDERDATE like '%01';
    before this PR:
image

after this PR:
image

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yabola This is very cool, could we break up the PR though for easier review? Let me take a deeper look at the code before I propose a way to break it up into separate PRs. On the surface though it seems like we should be able to break apart the expression changes in API/Core and then have separate PRs for the file formats and Spark integration?

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Feb 8, 2024

Also, I'll need to think more if we can actually support this for delete files. If not, this will need to only be applied for CoW tables. For example, for agg pushdown, we don't support MoR tables (well unless it's compacted)

@yabola
Copy link
Contributor Author

yabola commented Feb 9, 2024

@yabola This is very cool, could we break up the PR though for easier review? Let me take a deeper look at the code before I propose a way to break it up into separate PRs. On the surface though it seems like we should be able to break apart the expression changes in API/Core and then have separate PRs for the file formats and Spark integration?

I will break up my PR. Thanks for your advice.

@yabola
Copy link
Contributor Author

yabola commented Feb 11, 2024

Also, I'll need to think more if we can actually support this for delete files. If not, this will need to only be applied for CoW tables. For example, for agg pushdown, we don't support MoR tables (well unless it's compacted)

@amogh-jahagirdar I checked the code. When there are delete rows, it won't pushdown agg and I have tested it.

if (!task.deletes().isEmpty()) {
LOG.info("Skipping aggregate pushdown: detected row level deletes");
return false;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants