Skip to content

Generate high-quality synthetic data for professional services development

License

Notifications You must be signed in to change notification settings

joshburslem/synthetic-data-solution

Repository files navigation

Synthetic Data Solution

Generate high-quality synthetic data (structured and unstructured) for professional services development. Enable development teams to build and test client solutions without handling actual sensitive client data.

Features

  • Context-Aware Generation: Provide client/project context, get realistic synthetic data
  • Schema Inference: Automatically determine data structures from natural language descriptions
  • Multiple Data Types:
    • Structured: Tabular (CSV, Excel), JSON, SQL with relationships
    • Unstructured: Documents, emails, case narratives
  • Domain Support: Consulting, Financial, Healthcare, Legal, and Generic domains
  • Validation Workflow: Review samples before generating full corpus
  • Dual LLM Support: Works with both OpenAI and Anthropic APIs
  • CLI & API: Use via command line or REST API

Quick Start

Installation

# Clone the repository
git clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e ".[dev]"

Configuration

Create a .env file with your LLM API keys:

# Required: At least one LLM provider API key
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# Optional: Set default provider (defaults to anthropic)
DEFAULT_LLM_PROVIDER=anthropic

CLI Usage

# Basic generation with inline context
synth generate -c "Healthcare client with 500 patients, need appointment records and billing data"

# Generate with specific output format and directory
synth generate -c "..." --format csv --output ./data/

# Interactive mode with feedback loop
synth generate -c "..." --interactive

# Generate from context file
synth generate --file context.txt --format json --size 1000

# Auto-approve without review (skip sample validation)
synth generate -c "..." --size 5000 --auto-approve

# Validate schemas without generating data
synth validate -c "Legal case files for litigation firm"

# Start API server
synth serve --host 127.0.0.1 --port 8000 --reload

API Usage

Start the server:

# Using CLI
synth serve --reload

# Or directly with uvicorn
uvicorn src.api.main:app --reload

The API will be available at:

Example API Calls

# Analyze context
curl -X POST "http://localhost:8000/api/v1/context/analyze" \
  -H "Content-Type: application/json" \
  -d '{"context": "Healthcare clinic needs patient records, appointments, and billing data"}'

# Generate samples
curl -X POST "http://localhost:8000/api/v1/generate/sample" \
  -H "Content-Type: application/json" \
  -d '{"context": "Legal firm case management system with cases and documents", "sample_size": 5}'

# Start corpus generation (async)
curl -X POST "http://localhost:8000/api/v1/generate/corpus" \
  -H "Content-Type: application/json" \
  -d '{"context": "...", "corpus_size": 1000}'

# Check job status
curl "http://localhost:8000/api/v1/jobs/{job_id}/status"

# Export completed job
curl -X POST "http://localhost:8000/api/v1/jobs/{job_id}/export" \
  -H "Content-Type: application/json" \
  -d '{"format": "csv"}'

Supported Domains

Domain Description Example Data Types
Consulting Professional services projects Clients, projects, deliverables, timesheets
Financial Banking and financial services Accounts, transactions, portfolios, customers
Healthcare Medical and healthcare data Patients, appointments, medical records, billing
Legal Law firm and legal services Cases, clients, documents, billing, court filings
Generic General-purpose data People, organizations, contacts, addresses

Export Formats

  • CSV: Standard comma-separated values
  • JSON: Nested JSON with relationships preserved
  • XLSX: Excel workbook with multiple sheets
  • SQL: DDL + INSERT statements (PostgreSQL, MySQL, SQLite)

Development

Setup

# Install dev dependencies
uv sync --all-extras

# Install pre-commit hooks (optional)
pre-commit install

Commands

# Run all unit tests
pytest tests/unit/ --override-ini="addopts="

# Run integration tests
pytest tests/integration/ --override-ini="addopts="

# Run E2E tests (requires LLM API keys)
SYNTH_E2E_TESTS=1 pytest tests/e2e/ --override-ini="addopts="

# Run linting
ruff check src/ tests/

# Run type checking
mypy src/

# Format code
black src/ tests/

Project Structure

src/
├── core/           # Config, LLM client abstraction
├── context/        # Context analysis and domain classification
├── schema/         # Schema inference, validation, and templates
├── generators/     # Data generators
│   ├── structured/     # Tabular, JSON, SQL generators
│   └── unstructured/   # Document, email, narrative generators
├── pipeline/       # Orchestration and generation workflows
├── api/            # FastAPI REST API
├── cli/            # Typer CLI application
└── utils/          # Logging, export utilities

tests/
├── unit/           # Unit tests (298 tests)
├── integration/    # Integration tests
├── e2e/            # End-to-end tests with real LLM
├── benchmarks/     # Performance benchmarks
└── fixtures/       # Realistic test data

config/
├── default.yaml    # Default configuration
└── prompts/        # LLM prompt templates

docs/
├── quickstart.md   # Getting started guide
├── api-reference.md # API documentation
├── deployment.md   # Deployment guide
└── context/        # Development context files

Documentation

Configuration

The application can be configured via:

  1. Environment variables
  2. .env file
  3. config/default.yaml

Environment variables take precedence over YAML configuration.

Key Configuration Options

Variable Description Default
OPENAI_API_KEY OpenAI API key -
ANTHROPIC_API_KEY Anthropic API key -
DEFAULT_LLM_PROVIDER Default provider (openai/anthropic) anthropic
GENERATION_SAMPLE_SIZE Default sample size 10
GENERATION_BATCH_SIZE Batch size for corpus 100
LOGGING_LEVEL Log level INFO

License

MIT License - see LICENSE for details.

About

Generate high-quality synthetic data for professional services development

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published