Add examples and tests for PySpark Dataframes

We currently have [examples for Pandas](https://pydantic-cereal.readthedocs.io/en/latest/examples/pandas.html) but no examples for [PySpark](https://spark.apache.org/docs/latest/api/python/index.html), even though [we install PySpark as an optional dependency](https://github.com/NowanIlfideme/pydantic-cereal/blob/main/pyproject.toml#L60).

We should enable tests for PySpark, at least in single-node cluster mode (we don't have access to clusters on GitHub Actions without some custom runners...).

## Checklist

- [x] Add optional dependency to package (or add into `datatests`).
- [ ] Add implementation into `pydantic_cereal.examples.pyspark` (since it's an example, support only Parquet).
- [ ] Add example notebook into documentation at `docs/examples/pyspark.ipynb`. We probably want to skip re-execution on doc build.
- [ ] Update GitHub Actions to support Spark, e.g. using [this action](https://github.com/marketplace/actions/setup-apache-spark).

## Possible Issues

⚠️  One pretty big problem... All Spark workers can technically be run anywhere, which means they won't necessarily have access to the same physical storage! If you want to save something on a "local" file system, this will only be accessible by the Driver ("main worker") unless other volumes are somehow mounted...

How will we test this? I'm not entirely sure. For now, writing code and testing a single-node cluster implementation will be enough.
Later on, we can implement checks (before writing) that all workers can reach the target path...



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add examples and tests for PySpark Dataframes #35

Checklist

Possible Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add examples and tests for PySpark Dataframes #35

Description

Checklist

Possible Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions