Skip to content

Make pyspark an optional dependency #54

@giampaolocasolla

Description

@giampaolocasolla

Problem

pyspark>=3.3.0 is currently a hard dependency in [project] dependencies. This forces installation of standalone PyPI pyspark even when:

  1. The user only validates Pandas DataFrames — pyspark is not needed at all.
  2. The user runs on Databricks Runtime — pyspark is already provided by the runtime environment. The Databricks base image includes its own patched pyspark, and installing a second copy from PyPI is redundant.
  3. The user uses databricks-connect — this package provides a Databricks-patched pyspark and is mutually exclusive with standalone pyspark. Having dataframe-expectations force-install pyspark breaks the environment.

Concrete example

In the badge_ranking service, we had to move dataframe-expectations out of our base dependencies into separate prod and test dependency groups to prevent it from pulling pyspark into our dev environment (which uses databricks-connect). This added complexity and duplication to our pyproject.toml. See tdp-ml-pipelines#485 for context.

Proposal

Make pyspark an optional dependency (extra), e.g.:

[project]
dependencies = [
    "pandas>=1.5.0",
    "pydantic>=2.12.4",
    "tabulate>=0.8.9",
]

[project.optional-dependencies]
pyspark = ["pyspark>=3.3.0"]

Users who need PySpark validation would install:

pip install dataframe-expectations[pyspark]

At runtime, the library can check for pyspark availability and raise a clear error if PySpark-specific features are used without it installed:

try:
    from pyspark.sql import DataFrame as SparkDataFrame
    HAS_PYSPARK = True
except ImportError:
    HAS_PYSPARK = False

This is a common pattern in libraries that support multiple backends (e.g., pandas-stubs, sqlalchemy with database drivers, great-expectations).

Benefits

  • Users who only use Pandas validation don't install pyspark (~300MB)
  • Databricks Runtime users avoid redundant/conflicting pyspark installations
  • databricks-connect users can use dataframe-expectations without dependency group workarounds
  • No breaking change for existing users: pip install dataframe-expectations[pyspark] restores the current behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions