Description
We currently have examples for Pandas but no examples for PySpark, even though we install PySpark as an optional dependency.
We should enable tests for PySpark, at least in single-node cluster mode (we don't have access to clusters on GitHub Actions without some custom runners...).
Checklist
- Add optional dependency to package (or add into
datatests
). - Add implementation into
pydantic_cereal.examples.pyspark
(since it's an example, support only Parquet). - Add example notebook into documentation at
docs/examples/pyspark.ipynb
. We probably want to skip re-execution on doc build. - Update GitHub Actions to support Spark, e.g. using this action.
Possible Issues
How will we test this? I'm not entirely sure. For now, writing code and testing a single-node cluster implementation will be enough.
Later on, we can implement checks (before writing) that all workers can reach the target path...