Skip to content

Add examples and tests for PySpark Dataframes #35

Open
@NowanIlfideme

Description

@NowanIlfideme

We currently have examples for Pandas but no examples for PySpark, even though we install PySpark as an optional dependency.

We should enable tests for PySpark, at least in single-node cluster mode (we don't have access to clusters on GitHub Actions without some custom runners...).

Checklist

  • Add optional dependency to package (or add into datatests).
  • Add implementation into pydantic_cereal.examples.pyspark (since it's an example, support only Parquet).
  • Add example notebook into documentation at docs/examples/pyspark.ipynb. We probably want to skip re-execution on doc build.
  • Update GitHub Actions to support Spark, e.g. using this action.

Possible Issues

⚠️ One pretty big problem... All Spark workers can technically be run anywhere, which means they won't necessarily have access to the same physical storage! If you want to save something on a "local" file system, this will only be accessible by the Driver ("main worker") unless other volumes are somehow mounted...

How will we test this? I'm not entirely sure. For now, writing code and testing a single-node cluster implementation will be enough.
Later on, we can implement checks (before writing) that all workers can reach the target path...

Metadata

Metadata

Assignees

No one assigned

    Labels

    dependenciesPull requests that update a dependency filedocumentationImprovements or additions to documentationenhancementNew feature or requestgithub_actionsPull requests that update GitHub Actions codepythonPull requests that update Python code

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions