Query.Farm SQL Scan Planning
A Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.
This module provides predicate pushdown capabilities for file-based data storage systems. By maintaining metadata about file contents (min/max values, value sets, null presence), the library can quickly determine which files contain data that could satisfy a given SQL WHERE
clause, significantly reducing query execution time.
- SQL Expression Parsing: Parse and evaluate complex
SQL WHERE
clauses using SQLGlot - Metadata-Based Filtering: Support for both range-based (min/max) and set-based field metadata
- Null Handling: Comprehensive support for
NULL
value semantics in SQL expressions - Complex Predicates: Handle
AND
,OR
,XOR
,NOT
,IN
,BETWEEN
,CASE
statements, and more - Multiple Data Types: Support for integers, floats, strings, decimals, and
NULL
values. Support for casting between Arrow scalar types. - Dialect Support: Configurable SQL dialect support (default: DuckDB)
pip install query-farm-sql-scan-planning
Or using rye:
rye add query-farm-sql-scan-planning
from query_farm_sql_scan_planning import Planner, RangeFieldInfo, SetFieldInfo
import pyarrow as pa
# Define file metadata
files = [
(
"data_2023_q1.parquet",
{
"sales_amount": RangeFieldInfo(
min_value=pa.scalar(100), max_value=pa.scalar(50000),
has_nulls=False, has_non_nulls=True
),
"region": SetFieldInfo(
values={pa.scalar("US"), pa.scalar("CA"), pa.scalar("MX")},
has_nulls=False, has_non_nulls=True
),
}
),
(
"data_2023_q2.parquet",
{
"sales_amount": RangeFieldInfo(
min_value=pa.scalar(200), max_value=pa.scalar(75000),
has_nulls=False, has_non_nulls=True
),
"region": SetFieldInfo(
values={pa.scalar("US"), pa.scalar("EU"), pa.scalar("UK")},
has_nulls=False, has_non_nulls=True
),
}
),
]
# Create planner
planner = Planner(files)
# Filter files based on SQL expressions
matching_files = set(planner.files("sales_amount > 40000 AND region = 'US'"))
print(matching_files) # {'data_2023_q1.parquet', 'data_2023_q2.parquet'}
# More complex queries
matching_files = set(planner.files("region IN ('EU', 'UK')"))
print(matching_files) # {'data_2023_q2.parquet'}
For fields with known minimum and maximum values:
RangeFieldInfo(
min_value=pa.scalar(0),
max_value=pa.scalar(100),
has_nulls=False, # Whether the field contains NULL values
has_non_nulls=True # Whether the field contains non-NULL values
)
For fields with a known set of possible values (useful for categorical data):
SetFieldInfo(
values={pa.scalar("apple"), pa.scalar("banana"), pa.scalar("cherry")},
has_nulls=False,
has_non_nulls=True
)
Note: SetFieldInfo
can produce false positives - if a value is in the set, the file might contain it, but the file could contain additional values not in the set.
=
,!=
,<>
(equality and inequality)<
,<=
,>
,>=
(range comparisons)IS NULL
,IS NOT NULL
(null checks)IS DISTINCT FROM
,IS NOT DISTINCT FROM
(null-safe comparisons)
AND
,OR
,XOR
(logical connectors)NOT
(negation)
IN
,NOT IN
(membership tests)BETWEEN
,NOT BETWEEN
(range tests)
CASE WHEN ... THEN ... ELSE ... END
(conditional expressions)
CAST
(type casting)
- Numeric literals:
123
,45.67
- String literals:
'hello'
- Boolean literals:
TRUE
,FALSE
- NULL literal:
NULL
# Files with sales between 1000 and 5000
planner.files("sales_amount BETWEEN 1000 AND 5000")
# Files with any sales over 10000
planner.files("sales_amount > 10000")
# Files containing specific regions
planner.files("region IN ('US', 'CA')")
# Files not containing specific regions
planner.files("region NOT IN ('UNKNOWN', 'TEST')")
# Combination of range and set conditions
planner.files(
"sales_amount > 5000 AND region IN ('US', 'EU') AND customer_id IS NOT NULL"
)
# Case expressions
planner.files(
"CASE WHEN region = 'US' THEN sales_amount > 1000 ELSE sales_amount > 500 END"
)
# Files that might contain null values in sales_amount
planner.files("sales_amount IS NULL")
# Files with non-null sales amounts over 1000
planner.files("sales_amount IS NOT NULL AND sales_amount > 1000")
- Metadata Quality: More accurate metadata (tighter ranges, complete value sets) leads to better filtering
- Expression Complexity: Simple expressions evaluate faster than complex nested conditions
- False Positives: The library errs on the side of including files that might match rather than risk excluding files that do match
- Data Lake Query Optimization: Skip irrelevant files in distributed query engines
- ETL Pipeline Optimization: Process only files containing relevant data
- Data Catalog Integration: Enhance metadata catalogs with query planning capabilities
- Columnar Storage: Optimize scans of Parquet, ORC, or similar formats
git clone https://github.com/query-farm/python-sql-scan-planning.git
cd python-sql-scan-planning
rye sync
rye run pytest
rye run ruff check
rye run pytest --mypy
- sqlglot: SQL parsing and AST manipulation
- Python 3.12+: Required for modern type hints and pattern matching
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
This Python module was created by Query.Farm.
MIT Licensed.