Generate high-quality synthetic data (structured and unstructured) for professional services development. Enable development teams to build and test client solutions without handling actual sensitive client data.
- Context-Aware Generation: Provide client/project context, get realistic synthetic data
- Schema Inference: Automatically determine data structures from natural language descriptions
- Multiple Data Types:
- Structured: Tabular (CSV, Excel), JSON, SQL with relationships
- Unstructured: Documents, emails, case narratives
- Domain Support: Consulting, Financial, Healthcare, Legal, and Generic domains
- Validation Workflow: Review samples before generating full corpus
- Dual LLM Support: Works with both OpenAI and Anthropic APIs
- CLI & API: Use via command line or REST API
# Clone the repository
git clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution
# Install with uv (recommended)
uv sync
# Or with pip
pip install -e ".[dev]"Create a .env file with your LLM API keys:
# Required: At least one LLM provider API key
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
# Optional: Set default provider (defaults to anthropic)
DEFAULT_LLM_PROVIDER=anthropic# Basic generation with inline context
synth generate -c "Healthcare client with 500 patients, need appointment records and billing data"
# Generate with specific output format and directory
synth generate -c "..." --format csv --output ./data/
# Interactive mode with feedback loop
synth generate -c "..." --interactive
# Generate from context file
synth generate --file context.txt --format json --size 1000
# Auto-approve without review (skip sample validation)
synth generate -c "..." --size 5000 --auto-approve
# Validate schemas without generating data
synth validate -c "Legal case files for litigation firm"
# Start API server
synth serve --host 127.0.0.1 --port 8000 --reloadStart the server:
# Using CLI
synth serve --reload
# Or directly with uvicorn
uvicorn src.api.main:app --reloadThe API will be available at:
- API Base: http://localhost:8000/api/v1
- Interactive Docs: http://localhost:8000/docs
- OpenAPI Schema: http://localhost:8000/openapi.json
# Analyze context
curl -X POST "http://localhost:8000/api/v1/context/analyze" \
-H "Content-Type: application/json" \
-d '{"context": "Healthcare clinic needs patient records, appointments, and billing data"}'
# Generate samples
curl -X POST "http://localhost:8000/api/v1/generate/sample" \
-H "Content-Type: application/json" \
-d '{"context": "Legal firm case management system with cases and documents", "sample_size": 5}'
# Start corpus generation (async)
curl -X POST "http://localhost:8000/api/v1/generate/corpus" \
-H "Content-Type: application/json" \
-d '{"context": "...", "corpus_size": 1000}'
# Check job status
curl "http://localhost:8000/api/v1/jobs/{job_id}/status"
# Export completed job
curl -X POST "http://localhost:8000/api/v1/jobs/{job_id}/export" \
-H "Content-Type: application/json" \
-d '{"format": "csv"}'| Domain | Description | Example Data Types |
|---|---|---|
| Consulting | Professional services projects | Clients, projects, deliverables, timesheets |
| Financial | Banking and financial services | Accounts, transactions, portfolios, customers |
| Healthcare | Medical and healthcare data | Patients, appointments, medical records, billing |
| Legal | Law firm and legal services | Cases, clients, documents, billing, court filings |
| Generic | General-purpose data | People, organizations, contacts, addresses |
- CSV: Standard comma-separated values
- JSON: Nested JSON with relationships preserved
- XLSX: Excel workbook with multiple sheets
- SQL: DDL + INSERT statements (PostgreSQL, MySQL, SQLite)
# Install dev dependencies
uv sync --all-extras
# Install pre-commit hooks (optional)
pre-commit install# Run all unit tests
pytest tests/unit/ --override-ini="addopts="
# Run integration tests
pytest tests/integration/ --override-ini="addopts="
# Run E2E tests (requires LLM API keys)
SYNTH_E2E_TESTS=1 pytest tests/e2e/ --override-ini="addopts="
# Run linting
ruff check src/ tests/
# Run type checking
mypy src/
# Format code
black src/ tests/src/
├── core/ # Config, LLM client abstraction
├── context/ # Context analysis and domain classification
├── schema/ # Schema inference, validation, and templates
├── generators/ # Data generators
│ ├── structured/ # Tabular, JSON, SQL generators
│ └── unstructured/ # Document, email, narrative generators
├── pipeline/ # Orchestration and generation workflows
├── api/ # FastAPI REST API
├── cli/ # Typer CLI application
└── utils/ # Logging, export utilities
tests/
├── unit/ # Unit tests (298 tests)
├── integration/ # Integration tests
├── e2e/ # End-to-end tests with real LLM
├── benchmarks/ # Performance benchmarks
└── fixtures/ # Realistic test data
config/
├── default.yaml # Default configuration
└── prompts/ # LLM prompt templates
docs/
├── quickstart.md # Getting started guide
├── api-reference.md # API documentation
├── deployment.md # Deployment guide
└── context/ # Development context files
The application can be configured via:
- Environment variables
.envfileconfig/default.yaml
Environment variables take precedence over YAML configuration.
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | - |
ANTHROPIC_API_KEY |
Anthropic API key | - |
DEFAULT_LLM_PROVIDER |
Default provider (openai/anthropic) | anthropic |
GENERATION_SAMPLE_SIZE |
Default sample size | 10 |
GENERATION_BATCH_SIZE |
Batch size for corpus | 100 |
LOGGING_LEVEL |
Log level | INFO |
MIT License - see LICENSE for details.