Pyspark Factories

This library provides you with ways to easily create pyspark dataframes with mock data. Based on a spark schema, a dataframe factory can be created filled with mock data generated by both faker and the random package, very useful when creating unittests. This package is heavily inspired by pydantic factories an awesome project build for generating mock data for Pydantic models.

Example

Basic

In the most basic form you can create a custom factory by inheriting from the ModelFactory class. Then provide the schema when you create the class as in the example below:

from pyspark.sql import SparkSession
from pyspark.sql.types import (BooleanType, DateType, DecimalType, IntegerType,
                               StringType, StructField, StructType,
                               TimestampType)

from pyspark_factories.factory import ModelFactory

# define some pyspark schema
spark_schema = StructType(
    [
        StructField("string_type", StringType(), True),
        StructField("bool_type", BooleanType(), False),
        StructField("date_type", DateType(), False),
        StructField("datetime_type", TimestampType(), False),
        StructField("decimal_type", DecimalType(precision=8, scale=5), False),
        StructField(
            "nested_type",
            StructType([StructField("nested_deeper_type", IntegerType(), True)]),
        ),
    ]
)


class TestFactory(ModelFactory):
    __model__ = spark_schema # add pyspark schema to class creation


# you need a spark session if you want to create spark dataframe
spark = SparkSession.builder.getOrCreate()

# generates a dataframe containing mock data with 1 row
results = TestFactory.create(spark, nr_of_rows=1)

Overwriting

In some cases it might be useful to overwrite specific fields so they are not generated randomly. For example let's say I want to overwrite the field string_type in the example above with a fixed value. I can define these Overwrites using the Overwrites object. Selecting nested fields to overwrite can be done by using the '.' (dot) notation. An example can be found below.

from pyspark_factories.factory import ModelFactory

# define some pyspark schema
spark_schema = StructType(
    [
        StructField("string_type", StringType(), True),
        StructField(
            "nested_type",
            StructType([StructField("nested_string", StringType(), True)]),
        ),
    ]
)


class TestFactory(ModelFactory):
    __model__ = spark_schema
    __overwrite__ = [
        Overwrite("string_type", "overwritten"),
        Overwrite("nested_type.nested_string", "overwritten"),
    ]


# you need a spark session if you want to create spark dataframe
spark = SparkSession.builder.getOrCreate()

# generates a dataframe containing mock data with 1 row
results = TestFactory.create(spark, nr_of_rows=1)

Other useful attributes

There are more useful attributes you can set when creating a Factory

class TestFactory(ModelFactory):
    __model__ = spark_schema
    __allow_none_optionals_ = True # Default true
    # whether or not to randomly include None values if pyspark schema allows optionals
    __max_array_length__ = 10 # maximum lenghts of the arrays that are randomly generated
    __faker__ = Faker() # Default set to default faker but can be provided by user

Development

To develop this project please fork and clone the repository. Also make sure to have poetry installed.

To install the required development packages run:

poetry install --with dev

To run the test run the following:

poetry run pytest tests

The package is automatically formatted with black and isort

poetry run isort .
poetry run black .

Also the project is should be linted with pflake8

poetry run pflake8

Contributing

Contributions are very welcome! To contribute please follow these steps:

Fork this repository
Create an issue describing what needs to be fixed, improved etc (if it does not exist yet).
Create a pull request from your forked branch and reference this pull request in the issue

Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
pyspark_factories		pyspark_factories
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pyspark Factories

Example

Basic

Overwriting

Other useful attributes

Development

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

DaanRademaker/pyspark-factories

Folders and files

Latest commit

History

Repository files navigation

Pyspark Factories

Example

Basic

Overwriting

Other useful attributes

Development

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages