This library provides you with ways to easily create pyspark dataframes with mock data. Based on a spark schema, a dataframe factory can be created filled with mock data generated by both faker and the random package, very useful when creating unittests. This package is heavily inspired by pydantic factories an awesome project build for generating mock data for Pydantic models.
In the most basic form you can create a custom factory by inheriting from the ModelFactory class. Then provide the schema when you create the class as in the example below:
from pyspark.sql import SparkSession
from pyspark.sql.types import (BooleanType, DateType, DecimalType, IntegerType,
StringType, StructField, StructType,
TimestampType)
from pyspark_factories.factory import ModelFactory
# define some pyspark schema
spark_schema = StructType(
[
StructField("string_type", StringType(), True),
StructField("bool_type", BooleanType(), False),
StructField("date_type", DateType(), False),
StructField("datetime_type", TimestampType(), False),
StructField("decimal_type", DecimalType(precision=8, scale=5), False),
StructField(
"nested_type",
StructType([StructField("nested_deeper_type", IntegerType(), True)]),
),
]
)
class TestFactory(ModelFactory):
__model__ = spark_schema # add pyspark schema to class creation
# you need a spark session if you want to create spark dataframe
spark = SparkSession.builder.getOrCreate()
# generates a dataframe containing mock data with 1 row
results = TestFactory.create(spark, nr_of_rows=1)
In some cases it might be useful to overwrite specific fields so they are not generated randomly. For example let's say I want to overwrite the field string_type in the example above with a fixed value. I can define these Overwrites using the Overwrites object. Selecting nested fields to overwrite can be done by using the '.' (dot) notation. An example can be found below.
from pyspark_factories.factory import ModelFactory
# define some pyspark schema
spark_schema = StructType(
[
StructField("string_type", StringType(), True),
StructField(
"nested_type",
StructType([StructField("nested_string", StringType(), True)]),
),
]
)
class TestFactory(ModelFactory):
__model__ = spark_schema
__overwrite__ = [
Overwrite("string_type", "overwritten"),
Overwrite("nested_type.nested_string", "overwritten"),
]
# you need a spark session if you want to create spark dataframe
spark = SparkSession.builder.getOrCreate()
# generates a dataframe containing mock data with 1 row
results = TestFactory.create(spark, nr_of_rows=1)
There are more useful attributes you can set when creating a Factory
class TestFactory(ModelFactory):
__model__ = spark_schema
__allow_none_optionals_ = True # Default true
# whether or not to randomly include None values if pyspark schema allows optionals
__max_array_length__ = 10 # maximum lenghts of the arrays that are randomly generated
__faker__ = Faker() # Default set to default faker but can be provided by user
To develop this project please fork and clone the repository. Also make sure to have poetry installed.
To install the required development packages run:
poetry install --with dev
To run the test run the following:
poetry run pytest tests
The package is automatically formatted with black and isort
poetry run isort .
poetry run black .
Also the project is should be linted with pflake8
poetry run pflake8
Contributions are very welcome! To contribute please follow these steps:
- Fork this repository
- Create an issue describing what needs to be fixed, improved etc (if it does not exist yet).
- Create a pull request from your forked branch and reference this pull request in the issue
Thanks!