MOVED TO FOLLOWING REPO: Data Caterer

Data Caterer - Data Generation and Validation

Overview

Generator data for databases, files, JMS or HTTP request through a Scala/Java API or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly.

Full docs can be found here.

Features

Metadata discovery
Batch and/or event data generation
Maintain referential integrity across any dataset
Create custom data generation/validation scenarios
Clean up generated data
Data validation
Suggest data validations

Quick start

git clone git@github.com:pflooky/data-caterer-example.git
cd data-caterer-example && ./run.sh
#check results under docker/sample/report/index.html folder

Integrations

Supported data sources

Data Caterer is able to support the following data sources:

Database
1. JDBC
  1. Postgres
  2. MySQL
2. Cassandra
3. ElasticSearch (coming soon)
Files (local or cloud)
1. CSV
2. Parquet
3. ORC
4. Delta (coming soon)
5. JSON
HTTP (sponsors only)
JMS (sponsors only)
1. Solace
Kafka (sponsors only)

Metadata sources (sponsors only):

OpenAPI
Marquez (OpenLineage)
OpenMetadata

Supported use cases

Insert into single data sink
Insert into multiple data sinks
1. Foreign keys associated between data sources
2. Number of records per column value
Set random seed at column and whole data generation level
Generate real looking data (via DataFaker) and edge cases
1. Names, addresses, places etc.
2. Edge cases for each data type (e.g. newline character in string, maximum integer, NaN, 0)
3. Nullability
Send events progressively
Automatically insert data into data source
1. Read metadata from data source and insert for all sub data sources (e.g. tables)
2. Get statistics from existing data in data source if exists
Track and delete generated data
Extract data profiling and metadata from given data sources
1. Calculate the total number of combinations
Validate data
1. Basic column validations (not null, contains, equals, greater than)
2. Aggregate validations (group by account_id and sum amounts should be less than 100, each account should have at least one transaction)
3. Upstream data source validations (generate data and then check same data is inserted in another data source with potential transformations)

Run Configurations

Different ways to run Data Caterer based on your use case:

Sponsorship

Data Caterer is set up under a sponsorware model where all features are available to sponsors. A subset of the features are available here in this project for all to use/fork/update/improve etc., as the open core.

Sponsors have access to the following features:

Metadata discovery
All data sources (see here for all data sources)
Batch and :material-circle-multiple: Event generation
Auto generation from data connections or metadata sources
Suggest data validations
Clean up generated data
Run as many times as you want, not charged by usage plus more the come.

Find out more details here to help with sponsorship.

This is inspired by the mkdocs-material project which follows the same model.

Additional Details

High Level Flow

Roadmap

Can check here for full list.

Challenges

How to apply foreign keys across datasets
Providing functions for data generators
Setting out the Plan -> Task -> Step model
How to process the data in batches
Data cleanup after run
- Save data into parquet files. Can read and delete when needed
- Have option to delete directly
- Have to do in particular order due to foreign keys
Relationships/constraints between fields
- e.g. if transaction has type purchase, then it is a debit
- if country is Australia, then country code should be AU
- could be one to one, one to many, many to many mapping
Predict the type of string expression to use from DataFaker
- Utilise the metadata for the field
Having intermediate fields and not including them into the output
- Allow for SQL expressions
Issues with spark streaming to write real-time data
- Using rate format, have to manage the connection to the data source yourself
- Connection per batch, stopped working for Solace after 125 messages (5 per second)
Generating regex pattern given data samples
Database generated columns values
- Auto increment
- On update current_timestamp
- Omit generating columns (only if they are not used as foreign keys)
Metadata storage and referencing
- How will it interact with a data dictionary?
- Updated schema/metadata

Resources

Spark test data generator

Java 17 VM Options

--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MOVED TO FOLLOWING REPO: Data Caterer

Data Caterer - Data Generation and Validation

Overview

Features

Quick start

Integrations

Supported data sources

Supported use cases

Run Configurations

Sponsorship

Additional Details

High Level Flow

Roadmap

Challenges

Resources

Java 17 VM Options

Files

README.md

Latest commit

History

README.md

File metadata and controls

MOVED TO FOLLOWING REPO: Data Caterer

Data Caterer - Data Generation and Validation

Overview

Features

Quick start

Integrations

Supported data sources

Supported use cases

Run Configurations

Sponsorship

Additional Details

High Level Flow

Roadmap

Challenges

Resources

Java 17 VM Options