One-click tool to convert GitHub repositories into high-quality chat datasets for fine-tuning language models like Qwen with MLX.
π‘ Perfect for: Training coding assistants, documentation bots, or domain-specific AI models from real codebases.
This tool automatically extracts meaningful code-documentation pairs from GitHub repositories and formats them as conversation data that AI models can learn from.
Think of it as: Taking a repository full of Python functions with docstrings, JavaScript with JSDoc comments, and README files, then converting them into "teacher-student" conversations for AI training.
- Python Functions β "Write a docstring for this function" conversations
- JavaScript/TypeScript β "Add JSDoc comments to this code" examples
- Markdown Documentation β "Explain this section" Q&A pairs
- Smart Processing β Removes duplicates, filters by length, splits for training
# Clone this repository
git clone https://github.com/ArjunDivecha/Repo2Dataset.git
cd Repo2Dataset
# Install the tool
pip install -e .[dev]# Example: Convert the popular 'requests' library into training data
gh-chat-dataset --repo https://github.com/psf/requests.git --out ./requests_dataset
# Or try a smaller example first
gh-chat-dataset --repo https://github.com/pallets/itsdangerous.git --out ./test_datasetls test_dataset/
# You'll see:
# dataset.train.jsonl <- 90% of samples for training
# dataset.valid.jsonl <- 10% of samples for validation
# stats.json <- Summary of what was extractedLet's say you run it on a Python repository. Here's what one training sample looks like:
Input (what the AI sees):
Write a clear, concise docstring for this function:
def validate_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
return re.match(pattern, email) is not None
Expected Output (what the AI should respond):
Validate if an email address has a proper format.
Args:
email (str): The email address to validate.
Returns:
bool: True if email format is valid, False otherwise.
This creates training data that teaches AI models to write good documentation!
gh-chat-dataset [OPTIONS]--repo URL- The GitHub repository to convert (must be public or you need access)--out DIRECTORY- Where to save the dataset files
--max-tokens 2048- Skip samples longer than this (prevents memory issues)--split-ratio 0.9- How much data goes to training vs validation (0.9 = 90% train, 10% validation)--allow-llm- (Experimental) Use AI to generate labels where missing
# Basic usage
gh-chat-dataset --repo https://github.com/user/repo.git --out ./my_dataset
# For larger models (more context)
gh-chat-dataset --repo https://github.com/user/repo.git --out ./my_dataset --max-tokens 4096
# More data for validation
gh-chat-dataset --repo https://github.com/user/repo.git --out ./my_dataset --split-ratio 0.8- β Functions with docstrings β "Write docstring" examples
- β Classes with docstrings β "Document this class" examples
- β Module docstrings β "Summarize this module" examples
- β Functions with JSDoc comments β "Add JSDoc" examples
- β Complex function signatures β Documentation examples
- β README sections β "Explain this concept" Q&A
- β Documentation pages β Knowledge Q&A pairs
- β API docs β Usage explanation examples
- β Files without documentation (can't create good training pairs)
- β Very short or very long samples (poor quality for training)
- β Duplicate content (prevents overfitting)
- β Generated files (node_modules, build outputs, etc.)
After running the tool, check stats.json:
{
"sha": "abc123...", // Exact version of repo used
"counts": {
"total": 156, // Total samples created
"train": 140, // Training samples (90%)
"valid": 16 // Validation samples (10%)
}
}Good numbers to see:
- 50+ total samples for small projects
- 500+ total samples for substantial codebases
- 1000+ total samples for large, well-documented projects
pallets/itsdangerous- Small, well-documented Python librarysindresorhus/is- Simple JavaScript utilities with good docsgetsentry/sentry-python- Medium-sized Python project
psf/requests- Popular Python HTTP librarymicrosoft/TypeScript- Large TypeScript codebasedjango/django- Web framework with extensive docs
- β Well-documented codebases
- β Projects with consistent docstring/JSDoc style
- β Repositories with good README files
- β Active projects (recent commits)
Once you have your dataset, here's how to use it for fine-tuning:
# Load your dataset
import json
def load_dataset(path):
with open(path, 'r') as f:
return [json.loads(line) for line in f]
train_data = load_dataset('your_dataset/dataset.train.jsonl')
valid_data = load_dataset('your_dataset/dataset.valid.jsonl')
# Each sample has this format:
sample = train_data[0]
print(sample['messages']) # The conversation
print(sample['meta']) # Metadata about sourcePro Tip: The messages format is compatible with most modern fine-tuning frameworks including MLX, transformers, and OpenAI's fine-tuning API.
- β Make sure the repo has documented code (docstrings, JSDoc, README)
- β
Try a well-known repo first (like
pallets/itsdangerous) - β Check if the repo is public or you have access
- β Make sure you can access the repository
- β For private repos, ensure your Git credentials are set up
- β Try with a public repository first
- β
Use
--max-tokens 1024for smaller samples - β Try processing smaller repositories first
- β Make sure you have sufficient disk space
- β Try repositories with more documentation
- β Look for projects with consistent docstring styles
- β
Consider enabling
--allow-llmfor more samples (experimental)
# Create a script to process multiple repositories
repos=(
"https://github.com/user/repo1.git"
"https://github.com/user/repo2.git"
"https://github.com/user/repo3.git"
)
for repo in "${repos[@]}"; do
name=$(basename "$repo" .git)
gh-chat-dataset --repo "$repo" --out "./datasets/$name"
done# Merge multiple datasets
cat dataset1/dataset.train.jsonl dataset2/dataset.train.jsonl > combined_train.jsonl
cat dataset1/dataset.valid.jsonl dataset2/dataset.valid.jsonl > combined_valid.jsonlWant to improve the tool? Here's how to set up for development:
# Clone and setup
git clone https://github.com/ArjunDivecha/Repo2Dataset.git
cd Repo2Dataset
# Install in development mode
pip install -e .[dev]
# Run tests to make sure everything works
pytest
# Check code style
ruff check .
# Make your changes, then test
python -m pytestThe tool is designed to be extensible. To add support for new languages:
- Create a new extractor in
gh_chat_dataset/extract_xxx.py - Add a builder function in
gh_chat_dataset/builders.py - Update the file discovery patterns in
gh_chat_dataset/discover.py - Add tests in
tests/
MIT License - feel free to use this for any project!
- π Check the troubleshooting section above
- π Found a bug? Open an issue
- π‘ Have an idea? Start a discussion
Happy Training! π€β¨