Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
-
Updated
Mar 13, 2026 - Python
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
Mimesis is a fast Python library for generating fake data in multiple languages.
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
Synthetic data generation for tabular data
A procedural Blender pipeline for photorealistic training image generation
Python library for Causal AI
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
SDG is a specialized framework designed to generate high-quality structured tabular data.
UnrealCV: Connecting Computer Vision to Unreal Engine
Synthetic data curation for post-training and structured data extraction
Conditional GAN for generating synthetic tabular data.
A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions
DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤
Generate High-Quality Synthetics, Train, Measure, and Evaluate in a Single Pipeline
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!
A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.
🎨 NeMo Data Designer: A general library for generating high-quality synthetic data from scratch or based on seed data.
Synthetic Data SDK ✨
Verbalized Sampling, a training-free prompting strategy to mitigate mode collapse in LLMs by requesting responses with probabilities. Achieves 2-3x diversity improvement while maintaining quality. Model-agnostic framework with CLI/API for creative writing, synthetic data generation, and dialogue simulation.
Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips
Add a description, image, and links to the synthetic-data topic page so that developers can more easily learn about it.
To associate your repository with the synthetic-data topic, visit your repo's landing page and select "manage topics."