-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
pyspark>=3.3.0 is currently a hard dependency in [project] dependencies. This forces installation of standalone PyPI pyspark even when:
- The user only validates Pandas DataFrames — pyspark is not needed at all.
- The user runs on Databricks Runtime — pyspark is already provided by the runtime environment. The Databricks base image includes its own patched pyspark, and installing a second copy from PyPI is redundant.
- The user uses
databricks-connect— this package provides a Databricks-patched pyspark and is mutually exclusive with standalonepyspark. Havingdataframe-expectationsforce-installpysparkbreaks the environment.
Concrete example
In the badge_ranking service, we had to move dataframe-expectations out of our base dependencies into separate prod and test dependency groups to prevent it from pulling pyspark into our dev environment (which uses databricks-connect). This added complexity and duplication to our pyproject.toml. See tdp-ml-pipelines#485 for context.
Proposal
Make pyspark an optional dependency (extra), e.g.:
[project]
dependencies = [
"pandas>=1.5.0",
"pydantic>=2.12.4",
"tabulate>=0.8.9",
]
[project.optional-dependencies]
pyspark = ["pyspark>=3.3.0"]Users who need PySpark validation would install:
pip install dataframe-expectations[pyspark]At runtime, the library can check for pyspark availability and raise a clear error if PySpark-specific features are used without it installed:
try:
from pyspark.sql import DataFrame as SparkDataFrame
HAS_PYSPARK = True
except ImportError:
HAS_PYSPARK = FalseThis is a common pattern in libraries that support multiple backends (e.g., pandas-stubs, sqlalchemy with database drivers, great-expectations).
Benefits
- Users who only use Pandas validation don't install pyspark (~300MB)
- Databricks Runtime users avoid redundant/conflicting pyspark installations
databricks-connectusers can usedataframe-expectationswithout dependency group workarounds- No breaking change for existing users:
pip install dataframe-expectations[pyspark]restores the current behavior