Generator data for databases, files, JMS or HTTP request through a Scala/Java API or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly.
Full docs can be found here. Demo of the UI found here.
- Metadata discovery
- Batch and/or event data generation
- Maintain referential integrity across any dataset
- Create custom data generation/validation scenarios
- Clean up generated data
- Data validation
- Suggest data validations
- Mac download
- Windows download
- After downloaded, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
- Double-click 'DataCaterer-1.0.0' to install Data Caterer
- Click on 'More info' then at the bottom, click 'Run anyway'
- Go to '/Program Files/DataCaterer' folder and run DataCaterer application
- If your browser doesn't open, go to http://localhost:9898 in your preferred browser
- Linux download
- Docker
Open localhost:9898.
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.7.0
git clone git@github.com:data-catering/data-caterer-example.git
cd data-caterer-example && ./run.sh
#check results under docker/sample/report/index.html folder
Data Caterer is able to support the following data sources:
Data Source Type | Data Source | Sponsor |
---|---|---|
Database | Postgres, MySQL, Cassandra | N |
File | CSV, JSON, ORC, Parquet | N |
Messaging | Kafka, Solace | Y |
HTTP | REST API | Y |
Metadata | Marquez, OpenMetadata, OpenAPI/Swagger | Y |
- Insert into single data sink
- Insert into multiple data sinks
- Foreign keys associated between data sources
- Number of records per column value
- Set random seed at column and whole data generation level
- Generate real looking data (via DataFaker) and edge cases
- Names, addresses, places etc.
- Edge cases for each data type (e.g. newline character in string, maximum integer, NaN, 0)
- Nullability
- Send events progressively
- Automatically insert data into data source
- Read metadata from data source and insert for all sub data sources (e.g. tables)
- Get statistics from existing data in data source if exists
- Track and delete generated data
- Extract data profiling and metadata from given data sources
- Calculate the total number of combinations
- Validate data
- Basic column validations (not null, contains, equals, greater than)
- Aggregate validations (group by account_id and sum amounts should be less than 100, each account should have at least one transaction)
- Upstream data source validations (generate data and then check same data is inserted in another data source with potential transformations)
- Column name validations (check count and ordering of column names)
- Data migration validations
- Ensure row counts are equal
- Check both data sources have same values for key columns
Different ways to run Data Caterer based on your use case:
Data Caterer is set up under a sponsorware model where all features are available to sponsors. The core features are available here in this project for all to use/fork/update/improve etc., as the open core.
Sponsors have access to the following features:
- Metadata discovery
- All data sources (see here for all data sources)
- Batch and Event generation
- Auto generation from data connections or metadata sources
- Suggest data validations
- Clean up generated data
- Run as many times as you want, not charged by usage
- Plus more to come
Find out more details here to help with sponsorship.
This is inspired by the mkdocs-material project which follows the same model.
View details here about how you can contribute to the project.
Design motivations and details can be found here.
- Allow the application to run with UI enabled
- Runs as long-lived app with UI that interacts with existing app as single container
- Ability to run as UI, Spark job or both
- Persist data in files or database (Postgres)
- UI will show history of data generation/validation runs, delete generated data, create new scenarios, define data connections
gradle clean :api:shadowJar :app:shadowJar
docker build --build-arg "APP_VERSION=0.7.0" --build-arg "SPARK_VERSION=3.5.0" --no-cache -t datacatering/data-caterer:0.7.0 .
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone -v data-caterer-data:/opt/data-caterer --name datacaterer datacatering/data-caterer:0.7.0
#open localhost:9898
JPACKAGE_BUILD=true gradle clean :api:shadowJar :app:shadowJar
# Mac
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-mac.cfg"
# Windows
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-windows.cfg"
# Linux
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-linux.cfg"