Synthetic Data Solution

Generate high-quality synthetic data (structured and unstructured) for professional services development. Enable development teams to build and test client solutions without handling actual sensitive client data.

Features

Context-Aware Generation: Provide client/project context, get realistic synthetic data
Schema Inference: Automatically determine data structures from natural language descriptions
Multiple Data Types:
- Structured: Tabular (CSV, Excel), JSON, SQL with relationships
- Unstructured: Documents, emails, case narratives
Domain Support: Consulting, Financial, Healthcare, Legal, and Generic domains
Validation Workflow: Review samples before generating full corpus
Dual LLM Support: Works with both OpenAI and Anthropic APIs
CLI & API: Use via command line or REST API

Quick Start

Installation

# Clone the repository
git clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e ".[dev]"

Configuration

Create a .env file with your LLM API keys:

# Required: At least one LLM provider API key
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# Optional: Set default provider (defaults to anthropic)
DEFAULT_LLM_PROVIDER=anthropic

CLI Usage

# Basic generation with inline context
synth generate -c "Healthcare client with 500 patients, need appointment records and billing data"

# Generate with specific output format and directory
synth generate -c "..." --format csv --output ./data/

# Interactive mode with feedback loop
synth generate -c "..." --interactive

# Generate from context file
synth generate --file context.txt --format json --size 1000

# Auto-approve without review (skip sample validation)
synth generate -c "..." --size 5000 --auto-approve

# Validate schemas without generating data
synth validate -c "Legal case files for litigation firm"

# Start API server
synth serve --host 127.0.0.1 --port 8000 --reload

API Usage

Start the server:

# Using CLI
synth serve --reload

# Or directly with uvicorn
uvicorn src.api.main:app --reload

The API will be available at:

API Base: http://localhost:8000/api/v1
Interactive Docs: http://localhost:8000/docs
OpenAPI Schema: http://localhost:8000/openapi.json

Example API Calls

# Analyze context
curl -X POST "http://localhost:8000/api/v1/context/analyze" \
  -H "Content-Type: application/json" \
  -d '{"context": "Healthcare clinic needs patient records, appointments, and billing data"}'

# Generate samples
curl -X POST "http://localhost:8000/api/v1/generate/sample" \
  -H "Content-Type: application/json" \
  -d '{"context": "Legal firm case management system with cases and documents", "sample_size": 5}'

# Start corpus generation (async)
curl -X POST "http://localhost:8000/api/v1/generate/corpus" \
  -H "Content-Type: application/json" \
  -d '{"context": "...", "corpus_size": 1000}'

# Check job status
curl "http://localhost:8000/api/v1/jobs/{job_id}/status"

# Export completed job
curl -X POST "http://localhost:8000/api/v1/jobs/{job_id}/export" \
  -H "Content-Type: application/json" \
  -d '{"format": "csv"}'

Supported Domains

Domain	Description	Example Data Types
Consulting	Professional services projects	Clients, projects, deliverables, timesheets
Financial	Banking and financial services	Accounts, transactions, portfolios, customers
Healthcare	Medical and healthcare data	Patients, appointments, medical records, billing
Legal	Law firm and legal services	Cases, clients, documents, billing, court filings
Generic	General-purpose data	People, organizations, contacts, addresses

Export Formats

CSV: Standard comma-separated values
JSON: Nested JSON with relationships preserved
XLSX: Excel workbook with multiple sheets
SQL: DDL + INSERT statements (PostgreSQL, MySQL, SQLite)

Development

Setup

# Install dev dependencies
uv sync --all-extras

# Install pre-commit hooks (optional)
pre-commit install

Commands

# Run all unit tests
pytest tests/unit/ --override-ini="addopts="

# Run integration tests
pytest tests/integration/ --override-ini="addopts="

# Run E2E tests (requires LLM API keys)
SYNTH_E2E_TESTS=1 pytest tests/e2e/ --override-ini="addopts="

# Run linting
ruff check src/ tests/

# Run type checking
mypy src/

# Format code
black src/ tests/

Project Structure

src/
├── core/           # Config, LLM client abstraction
├── context/        # Context analysis and domain classification
├── schema/         # Schema inference, validation, and templates
├── generators/     # Data generators
│   ├── structured/     # Tabular, JSON, SQL generators
│   └── unstructured/   # Document, email, narrative generators
├── pipeline/       # Orchestration and generation workflows
├── api/            # FastAPI REST API
├── cli/            # Typer CLI application
└── utils/          # Logging, export utilities

tests/
├── unit/           # Unit tests (298 tests)
├── integration/    # Integration tests
├── e2e/            # End-to-end tests with real LLM
├── benchmarks/     # Performance benchmarks
└── fixtures/       # Realistic test data

config/
├── default.yaml    # Default configuration
└── prompts/        # LLM prompt templates

docs/
├── quickstart.md   # Getting started guide
├── api-reference.md # API documentation
├── deployment.md   # Deployment guide
└── context/        # Development context files

Documentation

Configuration

The application can be configured via:

Environment variables
.env file
config/default.yaml

Environment variables take precedence over YAML configuration.

Key Configuration Options

Variable	Description	Default
`OPENAI_API_KEY`	OpenAI API key	-
`ANTHROPIC_API_KEY`	Anthropic API key	-
`DEFAULT_LLM_PROVIDER`	Default provider (openai/anthropic)	anthropic
`GENERATION_SAMPLE_SIZE`	Default sample size	10
`GENERATION_BATCH_SIZE`	Batch size for corpus	100
`LOGGING_LEVEL`	Log level	INFO

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
config		config
docs		docs
examples		examples
frontend		frontend
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.MD		CLAUDE.MD
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data Solution

Features

Quick Start

Installation

Configuration

CLI Usage

API Usage

Example API Calls

Supported Domains

Export Formats

Development

Setup

Commands

Project Structure

Documentation

Configuration

Key Configuration Options

License

About

Uh oh!

Releases

Packages

Languages

License

joshburslem/synthetic-data-solution

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Solution

Features

Quick Start

Installation

Configuration

CLI Usage

API Usage

Example API Calls

Supported Domains

Export Formats

Development

Setup

Commands

Project Structure

Documentation

Configuration

Key Configuration Options

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages