Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Jun 18, 2025

Penultimate part of #7955.

This implements pushdown of self filters for HashJoinExec, similar to what DuckDB does.

We plan to follow up with pushdown of a reference to the hash table itself which will benefit joins where there is no order to the join condition (e.g. UUIDs).

@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Jun 18, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 18, 2025
@Dandandan
Copy link
Contributor

I tink we should also consider a heuristic for not evaluating the filter if it's not useful.

Also I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead.

@Dandandan Dandandan closed this Jun 18, 2025
@Dandandan Dandandan reopened this Jun 18, 2025
@Dandandan
Copy link
Contributor

Sorry, misclicked a button.

@adriangb
Copy link
Contributor Author

adriangb commented Jun 18, 2025

I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead

My thought was that for some cases the bounds checks are going to be quite effective at pruning and they should always be cheap to compute and cheap to apply. I'm surprised you say that they might create a lot of overhead?

@Dandandan
Copy link
Contributor

Dandandan commented Jun 18, 2025

I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead

My thought was that for some cases the bounds checks are going to be quite effective at pruning and they should always be cheap to compute and cheap to apply. I'm surprised you say that they might create a lot of overhead?

Maybe I should articulate it a bit more.

  • If we are only filtering out based on statistics, min/max might make sense to quickly filter out large chunks of rows.
  • If we are filtering on values (e.g. filter pushdown) - I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think we want to avoid to also evaluate the min/max expression (at least for all rows).

I think it also makes sense to also thing about a heuristic we want to use to use this pushdown only when we think it might be useful - e.g. the left side is much smaller than the right side, or we know (based on column statistics) it will filter out rows.

@adriangb
Copy link
Contributor Author

I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think we want to avoid to also evaluate the min/max expression (at least for all rows)

I'm surprised that the hash table lookup, even if O(1), has such a small constant factor that its ~ a couple of binary comparisons. That said a reason to still do both is stats and filter caching: simple filters like col >= 123 and col <= 456 can be used for stats pruning and can easily be cached (for example for filter caching based indexing). So even if performance is not strictly better there is still something to be said for including a simple filter in addition to the hash table lookup.

@xudong963 xudong963 self-requested a review June 19, 2025 10:14
@adriangb
Copy link
Contributor Author

I think it also makes sense to also thing about a heuristic we want to use to use this pushdown only when we think it might be useful - e.g. the left side is much smaller than the right side, or we know (based on column statistics) it will filter out rows

Datafusion is generally not great at these things: we often don't have enough stats / info to make decisions like this.

@Dandandan
Copy link
Contributor

I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think we want to avoid to also evaluate the min/max expression (at least for all rows)

I'm surprised that the hash table lookup, even if O(1), has such a small constant factor that its ~ a couple of binary comparisons. That said a reason to still do both is stats and filter caching: simple filters like col >= 123 and col <= 456 can be used for stats pruning and can easily be cached (for example for filter caching based indexing). So even if performance is not strictly better there is still something to be said for including a simple filter in addition to the hash table lookup.

It's hard to say generally, but a hashtable lookup which fits into cache on a u64 key can be really fast.

@adriangb
Copy link
Contributor Author

It's hard to say generally, but a hashtable lookup which fits into cache on a u64 key can be really fast.

I guess only benchmarks can tell. But I still think the scalar bounds are worth keeping for stats pruning reasons.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any metrics to record how much data is filtered by dynamic join filter?

}

#[tokio::test]
async fn test_hashjoin_dynamic_filter_pushdown() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests for multiple joins? Such as

Join (t1.a = t2.b)
/        \
t1    Join(t2.c = t3.d)
        /    \
       t3   t2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such test can check

  1. dynamic filters are pushed down to right scan node
  2. dynamic filters aren't missed during pushdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test that I think matches your suggestion

@adriangb adriangb force-pushed the hash-join-pushdown branch from 49d1636 to 04efcc1 Compare June 24, 2025 19:08
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Jun 24, 2025
@adriangb
Copy link
Contributor Author

Guess it just doesn't make a difference then. Thanks!

@adriangb
Copy link
Contributor Author

adriangb commented Jul 31, 2025

Here it is in action:

COPY (
    with data as (
        select unnest(generate_series(1, 99999999)) as id
    )
    select 
        id,
        (id / 100000) as partition_id
    from data
) TO 'test/t1/' STORED AS parquet PARTITIONED BY (partition_id);

CREATE EXTERNAL TABLE t1 (
    id int
)
STORED AS PARQUET
PARTITIONED BY (partition_id int)
LOCATION 'test/t1/';

COPY (
    with data as (
        select unnest(generate_series(1, 100)) as id
    )
    select 
        id,
        (id / 100000) as partition_id
    from data
) TO 'test/t2/' STORED AS parquet PARTITIONED BY (partition_id);

CREATE EXTERNAL TABLE t2 (
    id int
)
STORED AS PARQUET
PARTITIONED BY (partition_id int)
LOCATION 'test/t2/';

SET datafusion.optimizer.enable_dynamic_filter_pushdown = false;

explain analyze
SELECT count(*)
FROM t1
JOIN t2 USING (id);

SET datafusion.optimizer.enable_dynamic_filter_pushdown = true;

explain analyze
SELECT count(*)
FROM t1
JOIN t2 USING (id);
dynamic filters time (ms) bytes scanned
on 55 376,155
off 401 296,192,099

And that's without filter pushdown enabled, only benefiting from stats pruning.
I expect that for cases w/ strings or just more expensive data and filter pushdown enabled the speedup would be even faster.

Explain plans
❯ ./target/release/datafusion-cli
DataFusion CLI v49.0.0
> COPY (
    with data as (
        select unnest(generate_series(1, 99999999)) as id
    )
    select 
        id,
        (id / 100000) as partition_id
    from data
) TO 'test/t1/' STORED AS parquet PARTITIONED BY (partition_id);


CREATE EXTERNAL TABLE t1 (
    id int
)
STORED AS PARQUET
PARTITIONED BY (partition_id int)
LOCATION 'test/t1/';



COPY (
    with data as (
        select unnest(generate_series(1, 100)) as id
    )
    select 
        id,
        (id / 100000) as partition_id
    from data
) TO 'test/t2/' STORED AS parquet PARTITIONED BY (partition_id);


CREATE EXTERNAL TABLE t2 (
    id int
)
STORED AS PARQUET
PARTITIONED BY (partition_id int)
LOCATION 'test/t2/';
+----------+
| count    |
+----------+
| 99999999 |
+----------+
1 row(s) fetched. 
Elapsed 16.986 seconds.

0 row(s) fetched. 
Elapsed 0.015 seconds.

+-------+
| count |
+-------+
| 100   |
+-------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

0 row(s) fetched. 
Elapsed 0.000 seconds.

> SET datafusion.optimizer.enable_dynamic_filter_pushdown = false;

0 row(s) fetched. 
Elapsed 0.000 seconds.

> explain analyze
SELECT count(*)
FROM t1
JOIN t2 USING (id);
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[count(Int64(1))@0 as count(*)], metrics=[output_rows=1, elapsed_compute=417ns]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                   |   AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))], metrics=[output_rows=1, elapsed_compute=18.874µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |     CoalescePartitionsExec, metrics=[output_rows=12, elapsed_compute=7.959µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                   |       AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))], metrics=[output_rows=12, elapsed_compute=14.54µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                   |         ProjectionExec: expr=[], metrics=[output_rows=100, elapsed_compute=219ns]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |           CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=100, elapsed_compute=1.570227ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                   |             HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(id@0, id@0)], metrics=[output_rows=100, elapsed_compute=694.331448ms, build_input_batches=1, build_input_rows=100, input_batches=13000, input_rows=99999999, output_batches=13000, build_mem_used=2680, build_time=1.906461ms, join_time=692.419193ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                   |               DataSourceExec: file_groups={1 group: [[Users/adriangb/GitHub/datafusion/test/t2/partition_id=0/0lwnifc1mVkAu3uv.parquet]]}, projection=[id], file_type=parquet, metrics=[output_rows=100, elapsed_compute=1ns, batches_splitted=0, bytes_scanned=318, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=0, page_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=2ns, metadata_load_time=102.71µs, page_index_eval_time=2ns, row_pushdown_eval_time=2ns, statistics_eval_time=2ns, time_elapsed_opening=1.735375ms, time_elapsed_processing=1.865291ms, time_elapsed_scanning_total=224.5µs, time_elapsed_scanning_until_data=217.333µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                   |               DataSourceExec: file_groups={12 groups: [[Users/adriangb/GitHub/datafusion/test/t1/partition_id=0/pE27ne7XoO3Ozi2m.parquet:0..332561, Users/adriangb/GitHub/datafusion/test/t1/partition_id=1/pE27ne7XoO3Ozi2m.parquet:0..296777, Users/adriangb/GitHub/datafusion/test/t1/partition_id=10/pE27ne7XoO3Ozi2m.parquet:0..296764, Users/adriangb/GitHub/datafusion/test/t1/partition_id=100/pE27ne7XoO3Ozi2m.parquet:0..296777, Users/adriangb/GitHub/datafusion/test/t1/partition_id=101/pE27ne7XoO3Ozi2m.parquet:0..296764, ...], [Users/adriangb/GitHub/datafusion/test/t1/partition_id=173/pE27ne7XoO3Ozi2m.parquet:65945..296772, Users/adriangb/GitHub/datafusion/test/t1/partition_id=174/pE27ne7XoO3Ozi2m.parquet:0..296774, Users/adriangb/GitHub/datafusion/test/t1/partition_id=175/pE27ne7XoO3Ozi2m.parquet:0..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=176/pE27ne7XoO3Ozi2m.parquet:0..296771, Users/adriangb/GitHub/datafusion/test/t1/partition_id=177/pE27ne7XoO3Ozi2m.parquet:0..296764, ...], [Users/adriangb/GitHub/datafusion/test/t1/partition_id=248/pE27ne7XoO3Ozi2m.parquet:167870..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=249/pE27ne7XoO3Ozi2m.parquet:0..296774, Users/adriangb/GitHub/datafusion/test/t1/partition_id=25/pE27ne7XoO3Ozi2m.parquet:0..296767, Users/adriangb/GitHub/datafusion/test/t1/partition_id=250/pE27ne7XoO3Ozi2m.parquet:0..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=251/pE27ne7XoO3Ozi2m.parquet:0..296775, ...], [Users/adriangb/GitHub/datafusion/test/t1/partition_id=322/pE27ne7XoO3Ozi2m.parquet:269778..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=323/pE27ne7XoO3Ozi2m.parquet:0..296775, Users/adriangb/GitHub/datafusion/test/t1/partition_id=324/pE27ne7XoO3Ozi2m.parquet:0..296764, Users/adriangb/GitHub/datafusion/test/t1/partition_id=325/pE27ne7XoO3Ozi2m.parquet:0..296771, Users/adriangb/GitHub/datafusion/test/t1/partition_id=326/pE27ne7XoO3Ozi2m.parquet:0..296764, ...], [Users/adriangb/GitHub/datafusion/test/t1/partition_id=399/pE27ne7XoO3Ozi2m.parquet:74933..296772, Users/adriangb/GitHub/datafusion/test/t1/partition_id=4/pE27ne7XoO3Ozi2m.parquet:0..296767, Users/adriangb/GitHub/datafusion/test/t1/partition_id=40/pE27ne7XoO3Ozi2m.parquet:0..296764, Users/adriangb/GitHub/datafusion/test/t1/partition_id=400/pE27ne7XoO3Ozi2m.parquet:0..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=401/pE27ne7XoO3Ozi2m.parquet:0..296775, ...], ...]}, projection=[id], file_type=parquet, metrics=[output_rows=99999999, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=296192099, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=0, page_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=2.022µs, metadata_load_time=597.091275ms, page_index_eval_time=2.022µs, row_pushdown_eval_time=2.022µs, statistics_eval_time=2.022µs, time_elapsed_opening=9.99707ms, time_elapsed_processing=1.943677056s, time_elapsed_scanning_total=3.949963979s, time_elapsed_scanning_until_data=2.625103918s] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.401 seconds.

> SET datafusion.optimizer.enable_dynamic_filter_pushdown = true;

0 row(s) fetched. 
Elapsed 0.000 seconds.

> explain analyze
SELECT count(*)
FROM t1
JOIN t2 USING (id);
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[count(Int64(1))@0 as count(*)], metrics=[output_rows=1, elapsed_compute=333ns]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                   |   AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))], metrics=[output_rows=1, elapsed_compute=4.372µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                   |     CoalescePartitionsExec, metrics=[output_rows=12, elapsed_compute=3.166µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                   |       AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))], metrics=[output_rows=12, elapsed_compute=7.5µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |         ProjectionExec: expr=[], metrics=[output_rows=100, elapsed_compute=136ns]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                   |           CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=100, elapsed_compute=2.546µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                   |             HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(id@0, id@0)], filter=[id@0 >= 1 AND id@0 <= 100], metrics=[output_rows=100, elapsed_compute=811.172µs, build_input_batches=1, build_input_rows=100, input_batches=3, input_rows=20480, output_batches=3, build_mem_used=2680, build_time=607.5µs, join_time=131.88µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                   |               DataSourceExec: file_groups={1 group: [[Users/adriangb/GitHub/datafusion/test/t2/partition_id=0/0lwnifc1mVkAu3uv.parquet]]}, projection=[id], file_type=parquet, predicate=true, metrics=[output_rows=100, elapsed_compute=1ns, batches_splitted=0, bytes_scanned=318, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=0, page_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=2ns, metadata_load_time=797.418µs, page_index_eval_time=126ns, row_pushdown_eval_time=2ns, statistics_eval_time=2ns, time_elapsed_opening=824.667µs, time_elapsed_processing=245.335µs, time_elapsed_scanning_total=196.501µs, time_elapsed_scanning_until_data=193.583µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                   |               DataSourceExec: file_groups={12 groups: [[Users/adriangb/GitHub/datafusion/test/t1/partition_id=0/pE27ne7XoO3Ozi2m.parquet:0..332561, Users/adriangb/GitHub/datafusion/test/t1/partition_id=1/pE27ne7XoO3Ozi2m.parquet:0..296777, Users/adriangb/GitHub/datafusion/test/t1/partition_id=10/pE27ne7XoO3Ozi2m.parquet:0..296764, Users/adriangb/GitHub/datafusion/test/t1/partition_id=100/pE27ne7XoO3Ozi2m.parquet:0..296777, Users/adriangb/GitHub/datafusion/test/t1/partition_id=101/pE27ne7XoO3Ozi2m.parquet:0..296764, ...], [Users/adriangb/GitHub/datafusion/test/t1/partition_id=173/pE27ne7XoO3Ozi2m.parquet:65945..296772, Users/adriangb/GitHub/datafusion/test/t1/partition_id=174/pE27ne7XoO3Ozi2m.parquet:0..296774, Users/adriangb/GitHub/datafusion/test/t1/partition_id=175/pE27ne7XoO3Ozi2m.parquet:0..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=176/pE27ne7XoO3Ozi2m.parquet:0..296771, Users/adriangb/GitHub/datafusion/test/t1/partition_id=177/pE27ne7XoO3Ozi2m.parquet:0..296764, ...], [Users/adriangb/GitHub/datafusion/test/t1/partition_id=248/pE27ne7XoO3Ozi2m.parquet:167870..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=249/pE27ne7XoO3Ozi2m.parquet:0..296774, Users/adriangb/GitHub/datafusion/test/t1/partition_id=25/pE27ne7XoO3Ozi2m.parquet:0..296767, Users/adriangb/GitHub/datafusion/test/t1/partition_id=250/pE27ne7XoO3Ozi2m.parquet:0..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=251/pE27ne7XoO3Ozi2m.parquet:0..296775, ...], [Users/adriangb/GitHub/datafusion/test/t1/partition_id=322/pE27ne7XoO3Ozi2m.parquet:269778..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=323/pE27ne7XoO3Ozi2m.parquet:0..296775, Users/adriangb/GitHub/datafusion/test/t1/partition_id=324/pE27ne7XoO3Ozi2m.parquet:0..296764, Users/adriangb/GitHub/datafusion/test/t1/partition_id=325/pE27ne7XoO3Ozi2m.parquet:0..296771, Users/adriangb/GitHub/datafusion/test/t1/partition_id=326/pE27ne7XoO3Ozi2m.parquet:0..296764, ...], [Users/adriangb/GitHub/datafusion/test/t1/partition_id=399/pE27ne7XoO3Ozi2m.parquet:74933..296772, Users/adriangb/GitHub/datafusion/test/t1/partition_id=4/pE27ne7XoO3Ozi2m.parquet:0..296767, Users/adriangb/GitHub/datafusion/test/t1/partition_id=40/pE27ne7XoO3Ozi2m.parquet:0..296764, Users/adriangb/GitHub/datafusion/test/t1/partition_id=400/pE27ne7XoO3Ozi2m.parquet:0..296761, Users/adriangb/GitHub/datafusion/test/t1/partition_id=401/pE27ne7XoO3Ozi2m.parquet:0..296775, ...], ...]}, projection=[id], file_type=parquet, predicate=DynamicFilterPhysicalExpr [ id@0 >= 1 AND id@0 <= 100 ], pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 >= 1 AND id_null_count@1 != row_count@2 AND id_min@3 <= 100, required_guarantees=[] |
|                   | , metrics=[output_rows=20480, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=376155, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=79519, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=999, bloom_filter_eval_time=37.804µs, metadata_load_time=298.876746ms, page_index_eval_time=132.682µs, row_pushdown_eval_time=2.022µs, statistics_eval_time=4.021308ms, time_elapsed_opening=296.060463ms, time_elapsed_processing=118.217967ms, time_elapsed_scanning_total=27.708559ms, time_elapsed_scanning_until_data=27.457468ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.055 seconds.

@adriangb
Copy link
Contributor Author

Personally I would like to move forward with this and then make another PR to push down a reference to the entire hash table.

@alamb
Copy link
Contributor

alamb commented Jul 31, 2025

Thanks @adriangb -- I will put this on my list of PRs to review more carefully in the next few days

// Create build side with limited values
let build_batches = vec![record_batch!(
("a", Utf8, ["aa", "ab"]),
("b", Utf8, ["ba", "bb"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may add some Utf8View fields testing cases, because our default mapping already changing to Utf8View.

Copy link
Contributor Author

@adriangb adriangb Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudong963
Copy link
Member

I'll have a look tomorrow

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests also need to update, others lgtm, thank you @adriangb

@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Aug 8, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Aug 8, 2025

Fixed @xudong963, thanks so much for your review!

I will leave this up until early next week to see if we get any more feedback.

@adriangb
Copy link
Contributor Author

I plan on merging this in a couple hours if there are no objections

@adriangb adriangb merged commit ff77b70 into apache:main Aug 11, 2025
29 checks passed
@adriangb adriangb deleted the hash-join-pushdown branch August 11, 2025 17:58
@adriangb
Copy link
Contributor Author

Would anyone here like to review #17153 which enables this for left semi joins / IN (subquery) / WHERE {NOT} EXISTS?

@alamb
Copy link
Contributor

alamb commented Aug 14, 2025

@nuno-faria has reported a bug related to this PR:

@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

Would anyone here like to review #17153 which enables this for left semi joins / IN (subquery) / WHERE {NOT} EXISTS?

I am not likely to have time, but if I do I will take a loo

@adriangb
Copy link
Contributor Author

Would anyone here like to review #17153 which enables this for left semi joins / IN (subquery) / WHERE {NOT} EXISTS?

I am not likely to have time, but if I do I will take a loo

No worries Andrew I think @kosiew is going to be working on that and I'll help review

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 2, 2025
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request Sep 9, 2025
* Enable physical filter pushdown for hash joins (apache#16954)

(cherry picked from commit b10f453)

* Add ExecutionPlan::reset_state (apache#17028)

* Add ExecutionPlan::reset_state

Co-authored-by: Robert Ream <robert@stably.io>

* Update datafusion/sqllogictest/test_files/cte.slt

* Add reference

* fmt

* add to upgrade guide

* add explain plan, implement in more plans

* fmt

* only explain

---------

Co-authored-by: Robert Ream <robert@stably.io>

* Add dynamic filter (bounds) pushdown to HashJoinExec (apache#16445)

(cherry picked from commit ff77b70)

* Push dynamic pushdown through CooperativeExec and ProjectionExec (apache#17238)

(cherry picked from commit 4bc0696)

* Fix dynamic filter pushdown in HashJoinExec (apache#17201)

(cherry picked from commit 1d4d74b)

* Fix HashJoinExec sideways information passing for partitioned queries (apache#17197)

(cherry picked from commit 64bc58d)

* disallow pushdown of volatile functions (apache#16861)

* dissallow pushdown of volatile PhysicalExprs

* fix

* add FilteredVec helper to handle filter / remap pattern (#34)

* checkpoint: Address PR feedback in https://github.com/apach...

* add FilteredVec to consolidate handling of filter / remap pattern

* lint

* Add slt test for pushing volatile predicates down (#35)

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
(cherry picked from commit 94e8548)

* fix bounds accumulator reset in HashJoinExec dynamic filter pushdown (apache#17371)

---------

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Co-authored-by: Robert Ream <robert@stably.io>
Co-authored-by: Jack Kleeman <jackkleeman@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants