Skip to content

A Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.

License

Notifications You must be signed in to change notification settings

Query-farm/python-sql-scan-planning

Repository files navigation

Query.Farm SQL Scan Planning

A Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.

Overview

This module provides predicate pushdown capabilities for file-based data storage systems. By maintaining metadata about file contents (min/max values, value sets, null presence), the library can quickly determine which files contain data that could satisfy a given SQL WHERE clause, significantly reducing query execution time.

Features

  • SQL Expression Parsing: Parse and evaluate complex SQL WHERE clauses using SQLGlot
  • Metadata-Based Filtering: Support for both range-based (min/max) and set-based field metadata
  • Null Handling: Comprehensive support for NULL value semantics in SQL expressions
  • Complex Predicates: Handle AND, OR, XOR, NOT, IN, BETWEEN, CASE statements, and more
  • Multiple Data Types: Support for integers, floats, strings, decimals, and NULL values. Support for casting between Arrow scalar types.
  • Dialect Support: Configurable SQL dialect support (default: DuckDB)

Installation

pip install query-farm-sql-scan-planning

Or using rye:

rye add query-farm-sql-scan-planning

Quick Start

from query_farm_sql_scan_planning import Planner, RangeFieldInfo, SetFieldInfo
import pyarrow as pa

# Define file metadata
files = [
    (
        "data_2023_q1.parquet",
        {
            "sales_amount": RangeFieldInfo(
                min_value=pa.scalar(100), max_value=pa.scalar(50000),
                has_nulls=False, has_non_nulls=True
            ),
            "region": SetFieldInfo(
                values={pa.scalar("US"), pa.scalar("CA"), pa.scalar("MX")},
                has_nulls=False, has_non_nulls=True
            ),
        }
    ),
    (
        "data_2023_q2.parquet",
        {
            "sales_amount": RangeFieldInfo(
                min_value=pa.scalar(200), max_value=pa.scalar(75000),
                has_nulls=False, has_non_nulls=True
            ),
            "region": SetFieldInfo(
                values={pa.scalar("US"), pa.scalar("EU"), pa.scalar("UK")},
                has_nulls=False, has_non_nulls=True
            ),
        }
    ),
]

# Create planner
planner = Planner(files)

# Filter files based on SQL expressions
matching_files = set(planner.files("sales_amount > 40000 AND region = 'US'"))
print(matching_files)  # {'data_2023_q1.parquet', 'data_2023_q2.parquet'}

# More complex queries
matching_files = set(planner.files("region IN ('EU', 'UK')"))
print(matching_files)  # {'data_2023_q2.parquet'}

Field Information Types

RangeFieldInfo

For fields with known minimum and maximum values:

RangeFieldInfo(
    min_value=pa.scalar(0),
    max_value=pa.scalar(100),
    has_nulls=False,      # Whether the field contains NULL values
    has_non_nulls=True    # Whether the field contains non-NULL values
)

SetFieldInfo

For fields with a known set of possible values (useful for categorical data):

SetFieldInfo(
    values={pa.scalar("apple"), pa.scalar("banana"), pa.scalar("cherry")},
    has_nulls=False,
    has_non_nulls=True
)

Note: SetFieldInfo can produce false positives - if a value is in the set, the file might contain it, but the file could contain additional values not in the set.

Supported SQL Operations

Comparison Operators

  • =, !=, <> (equality and inequality)
  • <, <=, >, >= (range comparisons)
  • IS NULL, IS NOT NULL (null checks)
  • IS DISTINCT FROM, IS NOT DISTINCT FROM (null-safe comparisons)

Logical Operators

  • AND, OR, XOR (logical connectors)
  • NOT (negation)

Set Operations

  • IN, NOT IN (membership tests)
  • BETWEEN, NOT BETWEEN (range tests)

Control Flow

  • CASE WHEN ... THEN ... ELSE ... END (conditional expressions)

Data Types

  • CAST (type casting)

Literals

  • Numeric literals: 123, 45.67
  • String literals: 'hello'
  • Boolean literals: TRUE, FALSE
  • NULL literal: NULL

Examples

Range Queries

# Files with sales between 1000 and 5000
planner.files("sales_amount BETWEEN 1000 AND 5000")

# Files with any sales over 10000
planner.files("sales_amount > 10000")

Set Membership

# Files containing specific regions
planner.files("region IN ('US', 'CA')")

# Files not containing specific regions
planner.files("region NOT IN ('UNKNOWN', 'TEST')")

Complex Conditions

# Combination of range and set conditions
planner.files(
    "sales_amount > 5000 AND region IN ('US', 'EU') AND customer_id IS NOT NULL"
)

# Case expressions
planner.files(
    "CASE WHEN region = 'US' THEN sales_amount > 1000 ELSE sales_amount > 500 END"
)

Null Handling

# Files that might contain null values in sales_amount
planner.files("sales_amount IS NULL")

# Files with non-null sales amounts over 1000
planner.files("sales_amount IS NOT NULL AND sales_amount > 1000")

Performance Considerations

  • Metadata Quality: More accurate metadata (tighter ranges, complete value sets) leads to better filtering
  • Expression Complexity: Simple expressions evaluate faster than complex nested conditions
  • False Positives: The library errs on the side of including files that might match rather than risk excluding files that do match

Use Cases

  • Data Lake Query Optimization: Skip irrelevant files in distributed query engines
  • ETL Pipeline Optimization: Process only files containing relevant data
  • Data Catalog Integration: Enhance metadata catalogs with query planning capabilities
  • Columnar Storage: Optimize scans of Parquet, ORC, or similar formats

Development

Setup

git clone https://github.com/query-farm/python-sql-scan-planning.git
cd python-sql-scan-planning
rye sync

Running Tests

rye run pytest

Code Quality

rye run ruff check
rye run pytest --mypy

Dependencies

  • sqlglot: SQL parsing and AST manipulation
  • Python 3.12+: Required for modern type hints and pattern matching

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Author

This Python module was created by Query.Farm.

License

MIT Licensed.

About

A Python library for intelligent file filtering using SQL expressions and metadata-based scan planning. This library enables efficient data lake query optimization by determining which files need to be scanned based on their statistical metadata.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages