Push down InList or hash table references from HashJoinExec depending on the size of the build side #18393

adriangb · 2025-10-31T01:05:17Z

This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in #17171.

A "target state" is tracked in #18393 (this PR).
There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own:

Refactor create_hashes to accept array references #18448
Refactor InListExpr to support structs by re-using existing hashing infrastructure #18449 (depends on Refactor create_hashes to accept array references #18448)
Refactor state management in HashJoinExec and use CASE expressions for more precise filters #18451

As those are merged I will rebase this PR to keep track of the "remaining work", and we can use this PR to explore big picture ideas or benchmarks of the final state.

adriangb

Leaving my review comments, will post benchmarks afterwards.

Although this PR is large I think there's a clear path to split it up into independent smaller PRs:

Refactor create_hashes to accept references (changes only to hash_utils.rs)
Refactor InListExpr to store arrays and support structs, re-using create_hashes_from_arrays from (1) (changes only to in_list.rs).
Refactor the data structures used to track pushdown data in HashJoinExec (changes only to files in datafusion/physical-plan/src/joins/hash_join/).
Introduce the CASE statement structure into the filter pushdown of HashJoinExec and the repartition hash PhysicalExpr (changes only to files in datafusion/physical-plan/src/joins/hash_join/, adds HashExpr in datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs).
Add hash table pushdown (adds HashTableLookupExpr in datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs), adds HashTable to PushdownStrategy, adds create_membership_predicate, etc.).
Add InListExpr pushdown (adds PushdownStrategy::InList, etc.)

There is also potential to remove the barrier expression by filtering out only from known partitions CASE ... ELSE true i.e. if we don't have information for all partitions only filter out data that we know won't match in our partition and then update the filter to ELSE false in the case where we have information from all partitions. This might be useful for the distributed case, not sure though.

I think we could somehow unify the hashing / inner data structures of the join hash table and the InList expression - they are very similar - to at least eliminate one round of hashing. I wonder if there's a version of an InList expression that avoids building a Vec<ScalarValue> altogether and instead just wraps an ArrayRef + metadata (data types etc) + an optional hash lookup. That would be quite versatile, we could essentially replace the join hash tables with that structure.

adriangb · 2025-10-31T20:16:25Z

datafusion/common/src/hash_utils.rs

 #[cfg(not(feature = "force_hash_collisions"))]
-pub fn create_hashes<'a>(
-    arrays: &[ArrayRef],
+pub fn create_hashes_from_arrays<'a>(
+    arrays: &[&dyn Array],


Recommend this for its own PR.

I think this is a nice refactor for this function, however I decided not to deprecate / replace it to avoid churn. If this were it's own PR I think it would be worth it to just make a new version and deprecate the old one, replacing all references in DataFusion (which I imagine is most users; even if this is pub I think it's mostly pub to be used in other crates within this repo). I did not investigate if this can help avoid clones in any other call sites.

adriangb · 2025-10-31T20:17:37Z

datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs

-    -           RepartitionExec: partitioning=Hash([a@0, b@1], 12), input_partitions=1
-    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, e], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ a@0 >= ab AND a@0 <= ab AND b@1 >= bb AND b@1 <= bb OR a@0 >= aa AND a@0 <= aa AND b@1 >= ba AND b@1 <= ba ]
+    -           RepartitionExec: partitioning=Hash([a@0, b@1], 4), input_partitions=1
+    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, e], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ CASE hash_repartition % 4 WHEN 0 THEN a@0 >= aa AND a@0 <= aa AND b@1 >= ba AND b@1 <= ba AND struct(a@0, b@1) IN (SET) ([{c0:aa,c1:ba}]) WHEN 2 THEN a@0 >= ab AND a@0 <= ab AND b@1 >= bb AND b@1 <= bb AND struct(a@0, b@1) IN (SET) ([{c0:ab,c1:bb}]) ELSE false END ]


Note that this automatically excludes empty partitions and defaults to false, which works with inner joins. I'm not sure how we'd structure this for other join types.

adriangb · 2025-10-31T20:19:19Z

datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs

-    -           RepartitionExec: partitioning=Hash([a@0, b@1], 12), input_partitions=1
-    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, e], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ a@0 >= aa AND a@0 <= ab AND b@1 >= ba AND b@1 <= bb ]
+    -           RepartitionExec: partitioning=Hash([a@0, b@1], 4), input_partitions=1
+    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, e], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ a@0 >= aa AND a@0 <= ab AND b@1 >= ba AND b@1 <= bb AND struct(a@0, b@1) IN (SET) ([{c0:aa,c1:ba}, {c0:ab,c1:bb}]) ]


In this case there's only 1 partition with data (because of hash collisions) -> we optimize away the CASE expression. This is relevant because the same thing would happen with a point lookup primary key join.

adriangb · 2025-10-31T20:20:19Z

datafusion/physical-expr/src/expressions/in_list.rs

+/// Specialized Set implementation for StructArray
+struct StructArraySet {
+    array: Arc<StructArray>,
+    hash_set: ArrayHashSet,
+}


I think this is a nice improvement / feature that can easily be it's own PR.

adriangb · 2025-10-31T20:20:41Z

datafusion/physical-expr/src/expressions/in_list.rs

+        let has_nulls = self.array.null_count() != 0;
+
+        // Compute hashes for all rows in the input array
+        let mut input_hashes = vec![0u64; v.len()];


Could we use a thread local to avoid repeated re-allocations here? I imagine that would work quite well.

adriangb · 2025-10-31T20:25:18Z

datafusion/physical-plan/src/joins/hash_join/exec.rs

-    /// Create a new `JoinLeftData` from its parts
-    pub(super) fn new(


Clippy was complaining about too many arguments, a constructor was not necessary anyway

adriangb · 2025-10-31T20:26:10Z

datafusion/physical-plan/src/joins/hash_join/exec.rs

+    let membership = if num_rows == 0 {
+        PushdownStrategy::Empty
+    } else {
+        // If the build side is small enough we can use IN list pushdown.
+        // If it's too big we fall back to pushing down a reference to the hash table.
+        // See `PushdownStrategy` for more details.
+        if let Some(in_list_values) =
+            build_struct_inlist_values(&left_values, max_inlist_size)?
+        {
+            PushdownStrategy::InList(in_list_values)
+        } else {
+            PushdownStrategy::HashTable(Arc::clone(&hash_map))
+        }
+    };


I did some remodeling of the data structures we use to track state. I think the new structure is much better - even if we didn't move forward with the rest of the PR.

adriangb · 2025-10-31T20:26:55Z

datafusion/physical-plan/src/joins/hash_join/inlist_builder.rs

+    // Size check using built-in method
+    // This is not 1:1 with the actual size of ScalarValues, but it is a good approximation
+    // and at this point is basically "free" to compute since we have the arrays already.
+    let estimated_size = join_key_arrays
+        .iter()
+        .map(|arr| arr.get_array_memory_size())
+        .sum::<usize>();
+
+    if estimated_size > max_size_bytes {
+        return Ok(None);
+    }


Note: this is where we check the size to set the size limit.

adriangb · 2025-10-31T20:27:28Z

datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs

@@ -0,0 +1,292 @@
+// Licensed to the Apache Software Foundation (ASF) under one


This is the part used to handle the cases where we push down the entire hash table.

adriangb · 2025-10-31T20:29:24Z

datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs

+/// Build-side data reported by a single partition
+pub(crate) enum PartitionBuildData {
+    Partitioned {
+        partition_id: usize,
+        pushdown: PushdownStrategy,
+        bounds: PartitionBounds,
+    },
+    CollectLeft {
+        pushdown: PushdownStrategy,
+        bounds: PartitionBounds,
+    },
+}
+
+/// Per-partition accumulated data (Partitioned mode)
+#[derive(Clone)]
+struct PartitionData {
+    bounds: PartitionBounds,
+    pushdown: PushdownStrategy,
+}
+
+/// Build-side data organized by partition mode
+enum AccumulatedBuildData {
+    Partitioned {
+        partitions: Vec<Option<PartitionData>>,
+    },
+    CollectLeft {
+        data: Option<PartitionData>,
+    },
+}


This is the refactoring of the data structures we store to track state. It's now much cleaner, e.g. avoids comparing partitions by id.

adriangb · 2025-10-31T20:57:20Z

Here's one interesting benchmark:

COPY (SELECT uuid() as k, uuid() as v FROM generate_series(1, 5) t(i))
TO 'small_table_uuids.parquet'
OPTIONS (
  'MAX_ROW_GROUP_SIZE' '50000',
  'BLOOM_FILTER_ENABLED::k' 'true'
);
COPY (SELECT random()::text as v1, random()::text as v2, uuid() as k FROM generate_series(1, 100000000) t(i))
TO 'large_table_uuids.parquet'
OPTIONS (
  'MAX_ROW_GROUP_SIZE' '50000',
  'BLOOM_FILTER_ENABLED::k' 'true'
);

CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 'small_table_uuids.parquet';
CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 'large_table_uuids.parquet';

SET datafusion.execution.parquet.pushdown_filters = true;
SET datafusion.execution.parquet.reorder_filters = true;
SET datafusion.runtime.metadata_cache_limit = 0;

-- Join the two tables, with a filter on small_table
SELECT *
FROM small_table s JOIN large_table l ON s.k = l.k;

This is similar to the query we benchmarked in our recent blog post but using UUIDs instead of ints so that the min/max stats pushdown doesn't really help (hence why main is ~ same as DuckDB instead of faster as in the blog post).

Full benchmark script

#!/usr/bin/env python3
"""
Benchmark script comparing DataFusion with/without inlist pushdown vs DuckDB.

Groups:
1. branch (no inlist): hash_join_inlist_pushdown_max_size = 0
2. branch (w/ inlist): hash_join_inlist_pushdown_max_size = default (999999)
3. main: using datafusion-cli-main
4. duckdb: using duckdb CLI
"""

import subprocess
import tempfile
import time
import os
import sys
from pathlib import Path

# Configuration
DATAFUSION_CLI = "./target/release/datafusion-cli"
DATAFUSION_CLI_MAIN = "./datafusion-cli-main"
DUCKDB_CLI = "duckdb"
NUM_RUNS = 5  # Number of times to run each benchmark

# Data generation settings
SMALL_TABLE_SIZE = 5
LARGE_TABLE_SIZE = 100_000_000
SMALL_TABLE_FILE = "small_table_uuids.parquet"
LARGE_TABLE_FILE = "large_table_uuids.parquet"


def run_command(cmd, input_sql=None, description=""):
    """Run a command and measure execution time."""
    print(f"  Running: {description}...", end=" ", flush=True)

    start = time.time()
    try:
        if input_sql:
            result = subprocess.run(
                cmd,
                input=input_sql,
                capture_output=True,
                text=True,
                timeout=600  # 10 minute timeout
            )
        else:
            result = subprocess.run(
                cmd,
                capture_output=True,
                text=True,
                timeout=600
            )

        elapsed = time.time() - start

        if result.returncode != 0:
            print(f"FAILED (exit code {result.returncode})")
            print(f"    stderr: {result.stderr}")
            return None

        print(f"{elapsed:.3f}s")
        return elapsed
    except subprocess.TimeoutExpired:
        print("TIMEOUT")
        return None
    except Exception as e:
        print(f"ERROR: {e}")
        return None


def create_data():
    """Create test data files if they don't exist."""
    if os.path.exists(SMALL_TABLE_FILE) and os.path.exists(LARGE_TABLE_FILE):
        print(f"Data files already exist, skipping creation.")
        return True

    print(f"Creating test data...")

    data_gen_sql = f"""
COPY (SELECT uuid() as k, uuid() as v FROM generate_series(1, {SMALL_TABLE_SIZE}) t(i))
TO '{SMALL_TABLE_FILE}'
OPTIONS (
  'MAX_ROW_GROUP_SIZE' '50000',
  'BLOOM_FILTER_ENABLED::k' 'true'
);

COPY (SELECT random()::text as v1, random()::text as v2, uuid() as k FROM generate_series(1, {LARGE_TABLE_SIZE}) t(i))
TO '{LARGE_TABLE_FILE}'
OPTIONS (
  'MAX_ROW_GROUP_SIZE' '50000',
  'BLOOM_FILTER_ENABLED::k' 'true'
);
"""

    result = subprocess.run(
        [DATAFUSION_CLI],
        input=data_gen_sql,
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        print(f"Failed to create data: {result.stderr}")
        return False

    print(f"Data created successfully.")
    return True


def create_datafusion_sql(inlist_size):
    """Create SQL for DataFusion with specified inlist pushdown size."""
    return f"""
CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION '{SMALL_TABLE_FILE}';
CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION '{LARGE_TABLE_FILE}';

SET datafusion.execution.parquet.pushdown_filters = true;
SET datafusion.execution.parquet.reorder_filters = true;
SET datafusion.optimizer.hash_join_inlist_pushdown_max_size = {inlist_size};
SET datafusion.runtime.metadata_cache_limit = '0M';

SELECT *
FROM small_table s JOIN large_table l ON s.k = l.k;
"""


def create_duckdb_sql():
    """Create SQL for DuckDB."""
    return f"""
SELECT *
FROM '{SMALL_TABLE_FILE}' s JOIN '{LARGE_TABLE_FILE}' l ON s.k = l.k;
"""


def run_benchmark_group(name, cmd, sql_content, num_runs=NUM_RUNS):
    """Run a benchmark group multiple times and collect results."""
    print(f"\n{name}:")
    times = []

    for i in range(num_runs):
        elapsed = run_command(cmd, input_sql=sql_content, description=f"Run {i+1}/{num_runs}")
        if elapsed is not None:
            times.append(elapsed)

    if times:
        avg = sum(times) / len(times)
        min_time = min(times)
        max_time = max(times)
        print(f"  Results: min={min_time:.3f}s, avg={avg:.3f}s, max={max_time:.3f}s")
        return times
    else:
        print(f"  No successful runs")
        return []


def main():
    print("=" * 60)
    print("DataFusion Inlist Pushdown Benchmark")
    print("=" * 60)

    # Verify executables exist
    if not os.path.exists(DATAFUSION_CLI):
        print(f"Error: {DATAFUSION_CLI} not found")
        sys.exit(1)

    if not os.path.exists(DATAFUSION_CLI_MAIN):
        print(f"Error: {DATAFUSION_CLI_MAIN} not found")
        sys.exit(1)

    try:
        subprocess.run([DUCKDB_CLI, "--version"], capture_output=True, check=True)
    except (subprocess.CalledProcessError, FileNotFoundError):
        print(f"Error: duckdb CLI not found or not working")
        sys.exit(1)

    # Create data
    if not create_data():
        sys.exit(1)

    # Run benchmarks
    results = {}

    # 1. Branch without inlist pushdown
    results["branch_no_inlist"] = run_benchmark_group(
        "Branch (no inlist, size=0)",
        [DATAFUSION_CLI],
        create_datafusion_sql(0)
    )

    # 2. Branch with inlist pushdown
    results["branch_with_inlist"] = run_benchmark_group(
        "Branch (w/ inlist, size=999999)",
        [DATAFUSION_CLI],
        create_datafusion_sql(999999)
    )

    # 3. Main branch
    results["main"] = run_benchmark_group(
        "Main branch",
        [DATAFUSION_CLI_MAIN],
        create_datafusion_sql(999999)
    )

    # 4. DuckDB
    results["duckdb"] = run_benchmark_group(
        "DuckDB",
        [DUCKDB_CLI],
        create_duckdb_sql()
    )

    # Summary
    print("\n" + "=" * 60)
    print("SUMMARY")
    print("=" * 60)

    for name, times in results.items():
        if times:
            avg = sum(times) / len(times)
            print(f"{name:25s}: {avg:.3f}s avg over {len(times)} runs")
        else:
            print(f"{name:25s}: No successful runs")

    print("\nAll times (seconds):")
    for name, times in results.items():
        if times:
            times_str = ", ".join(f"{t:.3f}" for t in times)
            print(f"  {name}: [{times_str}]")


if __name__ == "__main__":
    main()

============================================================
SUMMARY
============================================================
branch_no_inlist         : 1.010s avg over 5 runs
branch_with_inlist       : 0.290s avg over 5 runs
main                     : 1.696s avg over 5 runs
duckdb                   : 1.634s avg over 5 runs

adriangb · 2025-10-31T21:58:24Z

❯ ./bench.sh compare main use-in-list         
Comparing main and use-in-list
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ use-in-list ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  516.69 ms │   503.63 ms │     no change │
│ QQuery 2     │  101.85 ms │   126.92 ms │  1.25x slower │
│ QQuery 3     │  266.65 ms │   262.82 ms │     no change │
│ QQuery 4     │  220.78 ms │   208.05 ms │ +1.06x faster │
│ QQuery 5     │  393.63 ms │   379.17 ms │     no change │
│ QQuery 6     │  145.39 ms │   142.40 ms │     no change │
│ QQuery 7     │  542.04 ms │   557.89 ms │     no change │
│ QQuery 8     │  437.90 ms │   415.90 ms │ +1.05x faster │
│ QQuery 9     │  647.38 ms │   611.35 ms │ +1.06x faster │
│ QQuery 10    │  355.89 ms │   342.20 ms │     no change │
│ QQuery 11    │   78.78 ms │    79.20 ms │     no change │
│ QQuery 12    │  217.50 ms │   195.61 ms │ +1.11x faster │
│ QQuery 13    │  379.35 ms │   348.38 ms │ +1.09x faster │
│ QQuery 14    │  194.42 ms │   178.55 ms │ +1.09x faster │
│ QQuery 15    │  274.45 ms │   259.19 ms │ +1.06x faster │
│ QQuery 16    │   67.99 ms │    59.72 ms │ +1.14x faster │
│ QQuery 17    │  708.83 ms │   630.79 ms │ +1.12x faster │
│ QQuery 18    │ 1002.48 ms │  1055.66 ms │  1.05x slower │
│ QQuery 19    │  319.69 ms │   290.27 ms │ +1.10x faster │
│ QQuery 20    │  253.56 ms │   246.25 ms │     no change │
│ QQuery 21    │  760.56 ms │   697.09 ms │ +1.09x faster │
│ QQuery 22    │   94.52 ms │    78.84 ms │ +1.20x faster │
└──────────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)          │ 7980.32ms │
│ Total Time (use-in-list)   │ 7669.89ms │
│ Average Time (main)        │  362.74ms │
│ Average Time (use-in-list) │  348.63ms │
│ Queries Faster             │        12 │
│ Queries Slower             │         2 │
│ Queries with No Change     │         8 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

xudong963 · 2025-11-01T02:31:45Z

duckdb : 1.634s avg over 5 runs

Surprised that duckdb takes so long, I was thinking it also can push down a inlist for hash join

adriangb · 2025-11-01T05:51:44Z

duckdb : 1.634s avg over 5 runs

Surprised that duckdb takes so long, I was thinking it also can push down a inlist for hash join

As far as I know they only do min/max stats: https://duckdb.org/2024/09/09/announcing-duckdb-110#dynamic-filter-pushdown-from-joins

adriangb · 2025-11-02T00:45:55Z

Here are some new numbers after a couple more optimizations to InListExpr:

============================================================
SUMMARY
============================================================
branch_no_inlist         : 0.670s avg over 5 runs
branch_with_inlist       : 0.163s avg over 5 runs
main                     : 1.499s avg over 5 runs
duckdb                   : 1.386s avg over 5 runs

I'd like to clarify why the InListExpr makes such a difference:

It allows some statistics pruning. The pruning expressions know how to handle InList and explode it into a chain of col = 1 OR col = 2 ... when there are less than 20 items in the list. This may have some impact but similar to the current min/max pushdown depends on properties of the data.
It allows bloom filter pruning. If the above case is hit (less than 20 items) then each one of the OR cases will be checked against a bloom filter on the column if one exists.

I think it's mainly that latter optimization that provides the win.

And TPCH:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ use-in-list ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  516.69 ms │   478.97 ms │ +1.08x faster │
│ QQuery 2     │  101.85 ms │    88.72 ms │ +1.15x faster │
│ QQuery 3     │  266.65 ms │   247.67 ms │ +1.08x faster │
│ QQuery 4     │  220.78 ms │   201.75 ms │ +1.09x faster │
│ QQuery 5     │  393.63 ms │   348.56 ms │ +1.13x faster │
│ QQuery 6     │  145.39 ms │   131.40 ms │ +1.11x faster │
│ QQuery 7     │  542.04 ms │   499.63 ms │ +1.08x faster │
│ QQuery 8     │  437.90 ms │   389.98 ms │ +1.12x faster │
│ QQuery 9     │  647.38 ms │   593.22 ms │ +1.09x faster │
│ QQuery 10    │  355.89 ms │   341.61 ms │     no change │
│ QQuery 11    │   78.78 ms │    70.63 ms │ +1.12x faster │
│ QQuery 12    │  217.50 ms │   195.95 ms │ +1.11x faster │
│ QQuery 13    │  379.35 ms │   338.09 ms │ +1.12x faster │
│ QQuery 14    │  194.42 ms │   176.03 ms │ +1.10x faster │
│ QQuery 15    │  274.45 ms │   246.27 ms │ +1.11x faster │
│ QQuery 16    │   67.99 ms │    62.21 ms │ +1.09x faster │
│ QQuery 17    │  708.83 ms │   648.45 ms │ +1.09x faster │
│ QQuery 18    │ 1002.48 ms │  1118.42 ms │  1.12x slower │
│ QQuery 19    │  319.69 ms │   268.59 ms │ +1.19x faster │
│ QQuery 20    │  253.56 ms │   241.12 ms │     no change │
│ QQuery 21    │  760.56 ms │   719.05 ms │ +1.06x faster │
│ QQuery 22    │   94.52 ms │    74.86 ms │ +1.26x faster │
└──────────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)          │ 7980.32ms │
│ Total Time (use-in-list)   │ 7481.19ms │
│ Average Time (main)        │  362.74ms │
│ Average Time (use-in-list) │  340.05ms │
│ Queries Faster             │        19 │
│ Queries Slower             │         1 │
│ Queries with No Change     │         2 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

adriangb · 2025-11-02T07:14:25Z

Marking as draft again. Looking at the changes while they're mostly in the right spirit there's a lot of details that need more manual care. I'll do that before marking as ready for review.

adriangb · 2025-11-02T20:50:08Z

I've done some cleanup, the first three PRs are ready for review:

Refactor create_hashes to accept array references #18448
Refactor InListExpr to support structs by re-using existing hashing infrastructure #18449
Refactor state management in HashJoinExec and use CASE expressions for more precise filters #18451

That will be most of the changes, the only two follwup PRs will be to push down the hash table references and InListExpr

## Background This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in apache#17171. A "target state" is tracked in apache#18393. There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own: - (This PR): apache#18448 - apache#18449 (depends on apache#18448) - apache#18451 ## Changes in this PR Change create_hashes and related functions to work with &dyn Array references instead of requiring ArrayRef (Arc-wrapped arrays). This avoids unnecessary Arc::clone() calls and enables calls that only have an &dyn Array to use the hashing utilities. - Add create_hashes_from_arrays(&[&dyn Array]) function - Refactor hash_dictionary, hash_list_array, hash_fixed_list_array to use references instead of cloning - Extract hash_single_array() helper for common logic --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…nfrastructure (#18449) ## Background This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in #17171. A "target state" is tracked in #18393. There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own: - #18448 - (This PR): #18449 (depends on #18448) - #18451 ## Changes in this PR - Enhance InListExpr to efficiently store homogeneous lists as arrays and avoid a conversion to Vec<PhysicalExpr> by adding an internal InListStorage enum with Array and Exprs variants - Re-use existing hashing and comparison utilities to support Struct arrays and other complex types - Add public function `in_list_from_array(expr, list_array, negated)` for creating InList from arrays Although the diff looks large most of it is actually tests and docs. I think the actual code change is a negative LOC change, or at least negative complexity (eliminates a trait, a macro, matching on data types). --------- Co-authored-by: David Hewitt <mail@davidhewitt.dev> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

adriangb · 2025-11-19T00:36:43Z

This is getting close! The main blocker is a review on #18451 which will bring the size of this PR down to only a couple hundred LOC

…for more precise filters (#18451) ## Background This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in #17171. A "target state" is tracked in #18393. There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own: - #18448 - #18449 (depends on #18448) - (This PR): #18451 ## Changes in this PR This PR refactors state management in HashJoinExec to make filter pushdown more efficient and prepare for pushing down membership tests. - Refactor internal data structures to clean up state management and make usage more idiomatic (use `Option` instead of comparing integers, etc.) - Uses CASE expressions to evaluate pushed-down filters selectively by partition Example: `CASE hash_repartition % N WHEN partition_id THEN condition ELSE false END` --------- Co-authored-by: Lía Adriana <lia.castaneda@datadoghq.com>

adriangb · 2025-11-20T07:33:12Z

I think this is now ready for review.

@rkrishn7, @LiaCastaneda or @Dandandan would one of you like to review?

@alamb could you kick off some benchmarks please?

LiaCastaneda · 2025-11-20T10:09:17Z

I can take a look later today 👍

LiaCastaneda · 2025-11-20T11:54:44Z

datafusion/physical-plan/src/joins/hash_join/inlist_builder.rs

+/// Note that this will not deduplicate values, that will happen later when building an InList expression from this array.
+///
+/// Returns `None` if the estimated size exceeds `max_size_bytes`.
+/// Performs deduplication to ensure unique values only.


🤔 I think it doesn't dedup here

It was a bad comment, updated

LiaCastaneda · 2025-11-20T11:55:01Z

datafusion/common/src/config.rs

+        /// The default is 128kB per partition.
+        /// This should allow point lookup joins (e.g. joining on a unique primary key) to use InList pushdown in most cases
+        /// but avoids excessive memory usage or overhead for larger joins.
+        pub hash_join_inlist_pushdown_max_size: usize, default = 128 * 1024


Maybe a future improvement could be to also expose an option to limit the number of distinct values that can be inside an IN LIST. For instance, we could end up with a very large list like x IN (1, 2, 3, ..., 1000000) that fits in 128KB but is still inefficient because we'd be duplicating values and performance might decrease.

I got this idea from trino:
https://trino.io/docs/current/admin/dynamic-filtering.html#dynamic-filter-collection-thresholds

For instance, we could end up with a very large list like x IN (1, 2, 3, ..., 1000000) that fits in 128KB but is still inefficient because we'd be duplicating values and performance might decrease

Could you elaborate? In my mind a large InList is not any more or less efficient than pushing down the hash table itself, but if it's big it looses access to the bloom filter pushdown optimization so it's probably not faster than the hash table itself. That said there are still reasons to push it down instead, namely that custom execution nodes that downcast match on a PhysicalExpr can recognize it.

So the idea with the 128kB is to balance how much CPU we burn upfront building the filter. But I agree it could be in terms of rows as well.

Added a config

So if theIN LISTis very large, it loses the pruning advantage on the probe side - like you say, bloom filters becoming ineffective with that many values, so once we lose that, we might as well use the hash table instead and avoid having to copy the data from the IN LIST

Not sure if I'm missing anything here, anyways this is just a thought and it would be better to check this with the benchmarks.

Yeah that's the idea. As the build side gets larger:

It becomes more expensive to build the InListExpr (I think we can make it cheaper but it will probably always be more expensive than copying a reference)

It's less likely optimizations like bloom filters will help. In fact, bloom filters will only be hit with < 20 items (this is set deep in the PruningPredicate code)

So at some point it makes sense to cut the losses and go through each row with the hash table.

The InListExpr approach is going to shine when there is a point lookup type query (i.e. one row from the build side) that can hit a bloom filter on the probe side.

Naive question: is the size (in bits) of the bloom filter tunable at the moment?

In principle you could use NDV to tune its size at build time, and lift the limitation on the number of elements (within a reasonable limit, of course).

I think the PR is already very useful as is, this could be tackled in a follow-up PR if the point raised makes sense (possibly another addition to the list of places where NDV can help, as tracked in #18628)

🤔 As a parquet write propery, I don't think it can be tunned, you can control some write properties like NDV or FPP which in theory help control the size of the bloom filter as well.

Yeah I think there may be a bit of confusion:

The bloom filters I am referring to here are written into the parquet file like Lia said.

The thing we are pushing down is an expression like col IN (1, 2, 3) which inside of pruning logic gets converted to col = 1 or col = 2 or col = 3 as long as there are <20 elements in the list, and then col = 1 gets evaluated against any bloom filters on col, then col = 2, etc. for each row group.

LiaCastaneda · 2025-11-20T13:23:34Z

datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs

+                        // Combine membership and bounds expressions
+                        let filter_expr = match (membership_expr, bounds_expr) {


I probably missed the explanation somewhere in previous threads, but is there a special benefit of pushing both bounds and IN LIST filters into the consumer?

I think so because:

You have to calculate the bound anyway in case you need to fall back to that.

Downstream operators may be able to do things with bounds that they can't with InListExpr (e.g. stats pruning).

The bound are going to be cheaper to evaluate and thus may short circuit the InListExpr if they are false.

ah, makes sense!

LiaCastaneda · 2025-11-20T13:44:45Z

datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs

+                                    as Arc<dyn PhysicalExpr>
+                            }
+                            (Some(membership), None) => membership,
+                            (None, Some(bounds)) => bounds,


Is (None, Some(bounds)) and (Some(membership), None) actually reachable? If we have no data, we shouldn't have any bounds either, right?

Yes, I added a note explaining that it's easier to handle it defensively so might as well (as opposed to unreacheable!)

LiaCastaneda · 2025-11-20T14:04:49Z

datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs

+                        // Optimize for single partition: skip CASE expression entirely
+                        let filter_expr = if when_then_branches.is_empty() {
+                            // All partitions are empty: no rows can match
+                            lit(false)
+                        } else if when_then_branches.len() == 1 {
+                            // Single partition: just use the condition directly
+                            // since hash % 1 == 0 always, the WHEN 0 branch will always match
+                            Arc::clone(&when_then_branches[0].1)
+                        } else {


This makes sense, so we avoid calling create_hashes for every single row on the probe side if it's going to end up landing on the same branch. Can we add a comment to the tests that changed in datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs about this?

Code comments or review comments? I added a review comment for now.

as a code comment, before looking into this file I was looking at the diff in test_dynamic_filter_pushdown_through_hash_join_with_topk and wondered why there’s no CASE struct if it’s a partitioned join. Super nit -- just mentioning it for clarity

Okay will add!

adriangb · 2025-11-20T15:32:27Z

datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs

    -   HashJoinExec: mode=Partitioned, join_type=Inner, on=[(a@0, d@0)]
    -     DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, c], file_type=test, pushdown_supported=true
-    -     DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[d, e, f], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ CASE hash_repartition % 1 WHEN 0 THEN d@0 >= aa AND d@0 <= ab ELSE false END ] AND DynamicFilter [ e@1 IS NULL OR e@1 < bb ]
+    -     DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[d, e, f], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ d@0 >= aa AND d@0 <= ab AND d@0 IN (SET) ([aa, ab]) ] AND DynamicFilter [ e@1 IS NULL OR e@1 < bb ]


Note that we dropped the CASE expression here because we now optimize that away if there's only 1 partition

LiaCastaneda · 2025-11-20T21:19:42Z

datafusion/physical-plan/src/joins/hash_join/exec.rs

+        // See `PushdownStrategy` for more details.
+        let estimated_size = left_values
+            .iter()
+            .map(|arr| arr.get_array_memory_size())


If we have a query like SELECT * FROM t1 JOIN t2 ON t1.a = t2.x AND t1.a = t2.y, left_values would have t1.a twice (same ArrayRef). Since both are references to the same underlying data, estimated_size would double count the memory. However, I guess this overaccounting is acceptable because we are estimating CPU cost?

Makes sense. But maybe it's still okay since we would end up duplicating the values in the InListExpr? Either way like you say I think it's not a big deal, it's just a ballpark estimate...

Yeah, since estimated_size is used to estimate CPU time spent building the filter (rather than actual memory), it makes sense to 'double account' because in theory its ~double the CPU work for building the filter I guess

LiaCastaneda · 2025-11-21T12:42:29Z

@adriangb I think you will have to update the branch to include mark_complete(), the CI is failing because of:
test joins::hash_join::exec::tests::test_hash_join_marks_filter_complete_empty_build_side has been running for over 60 seconds

LiaCastaneda

This lgtm! It would be interesting to know the benchmark results

…nfrastructure (apache#18449) ## Background This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in apache#17171. A "target state" is tracked in apache#18393. There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own: - apache#18448 - (This PR): apache#18449 (depends on apache#18448) - apache#18451 ## Changes in this PR - Enhance InListExpr to efficiently store homogeneous lists as arrays and avoid a conversion to Vec<PhysicalExpr> by adding an internal InListStorage enum with Array and Exprs variants - Re-use existing hashing and comparison utilities to support Struct arrays and other complex types - Add public function `in_list_from_array(expr, list_array, negated)` for creating InList from arrays Although the diff looks large most of it is actually tests and docs. I think the actual code change is a negative LOC change, or at least negative complexity (eliminates a trait, a macro, matching on data types). --------- Co-authored-by: David Hewitt <mail@davidhewitt.dev> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…for more precise filters (apache#18451) ## Background This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in apache#17171. A "target state" is tracked in apache#18393. There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own: - apache#18448 - apache#18449 (depends on apache#18448) - (This PR): apache#18451 ## Changes in this PR This PR refactors state management in HashJoinExec to make filter pushdown more efficient and prepare for pushing down membership tests. - Refactor internal data structures to clean up state management and make usage more idiomatic (use `Option` instead of comparing integers, etc.) - Uses CASE expressions to evaluate pushed-down filters selectively by partition Example: `CASE hash_repartition % N WHEN partition_id THEN condition ELSE false END` --------- Co-authored-by: Lía Adriana <lia.castaneda@datadoghq.com>

… on the size of the build side

adriangb · 2025-11-25T16:54:01Z

@gabotechs, @2010YOUY01 or @comphead would you be willing to review this?

comphead · 2025-11-25T17:05:49Z

Hi @adriangb thanks for the PR numbers looks good, I'll check it out this week!

comphead · 2025-11-26T20:39:32Z

Im planning to introduce TPCDS benchmarks in addition to TPCH and see how this PR performs

adriangb mentioned this pull request Oct 31, 2025

Push down hash table references from HashJoinExec into scans using CASE structure #18306

Closed

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate physical-plan Changes to the physical-plan crate labels Oct 31, 2025

adriangb mentioned this pull request Oct 31, 2025

Push down entire hash table from HashJoinExec into scans #17171

Open

github-actions bot added physical-expr Changes to the physical-expr crates documentation Improvements or additions to documentation labels Oct 31, 2025

adriangb commented Oct 31, 2025

View reviewed changes

adriangb marked this pull request as ready for review October 31, 2025 21:59

github-actions bot added the proto Related to proto crate label Nov 2, 2025

adriangb marked this pull request as draft November 2, 2025 07:13

2010YOUY01 mentioned this pull request Nov 3, 2025

Refactor create_hashes to accept array references #18448

Merged

adriangb force-pushed the use-in-list branch from d198ab3 to fe5dfec Compare November 19, 2025 00:34

github-actions bot removed documentation Improvements or additions to documentation proto Related to proto crate labels Nov 19, 2025

github-actions bot added the documentation Improvements or additions to documentation label Nov 19, 2025

adriangb force-pushed the use-in-list branch 2 times, most recently from 9299d0c to 3f067b9 Compare November 20, 2025 07:06

adriangb marked this pull request as ready for review November 20, 2025 07:33

LiaCastaneda reviewed Nov 20, 2025

View reviewed changes

mbutrovich self-requested a review November 20, 2025 14:07

adriangb commented Nov 20, 2025

View reviewed changes

LiaCastaneda reviewed Nov 20, 2025

View reviewed changes

LiaCastaneda approved these changes Nov 21, 2025

View reviewed changes

adriangb force-pushed the use-in-list branch from 8f8856d to dff5b80 Compare November 21, 2025 14:20

adriangb added 9 commits November 24, 2025 11:44

Push down InList or hash table references from HashJoinExec depending…

83f7dd3

… on the size of the build side

update configs

26a87d7

Add tests

968b5b6

lint

b417a41

add comments, add max items config

b8d0a16

update configs

7dc44e4

add code comment

dea8261

make sure we mark completed

9892741

update

f8bc4ed

adriangb force-pushed the use-in-list branch from 1c53209 to f8bc4ed Compare November 24, 2025 06:14

adriangb requested a review from gabotechs November 25, 2025 16:53

		/// Create a new `JoinLeftData` from its parts
		pub(super) fn new(

		@@ -0,0 +1,292 @@
		// Licensed to the Apache Software Foundation (ASF) under one

		// Combine membership and bounds expressions
		let filter_expr = match (membership_expr, bounds_expr) {

Push down InList or hash table references from HashJoinExec depending on the size of the build side #18393

Are you sure you want to change the base?

Push down InList or hash table references from HashJoinExec depending on the size of the build side #18393

Conversation

adriangb commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Oct 31, 2025

Uh oh!

xudong963 commented Nov 1, 2025

Uh oh!

adriangb commented Nov 1, 2025

Uh oh!

adriangb commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Nov 2, 2025

Uh oh!

adriangb commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Nov 19, 2025

Uh oh!

adriangb commented Nov 20, 2025

Uh oh!

LiaCastaneda commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Oct 31, 2025 •

edited

Loading

adriangb left a comment •

edited

Loading

adriangb commented Oct 31, 2025 •

edited

Loading

adriangb commented Nov 2, 2025 •

edited

Loading

adriangb commented Nov 2, 2025 •

edited

Loading

adriangb Nov 22, 2025 •

edited

Loading

LiaCastaneda commented Nov 21, 2025 •

edited

Loading