Skip to content

One shot conversion from Github repo to dataset ready for fine tuning (Specifically for MLX and Qwen models)

License

Notifications You must be signed in to change notification settings

ArjunDivecha/Repo2Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– gh-chat-dataset: Turn Any GitHub Repo into AI Training Data

One-click tool to convert GitHub repositories into high-quality chat datasets for fine-tuning language models like Qwen with MLX.

πŸ’‘ Perfect for: Training coding assistants, documentation bots, or domain-specific AI models from real codebases.

🎯 What Does This Do?

This tool automatically extracts meaningful code-documentation pairs from GitHub repositories and formats them as conversation data that AI models can learn from.

Think of it as: Taking a repository full of Python functions with docstrings, JavaScript with JSDoc comments, and README files, then converting them into "teacher-student" conversations for AI training.

🧠 The Magic Behind It

  • Python Functions β†’ "Write a docstring for this function" conversations
  • JavaScript/TypeScript β†’ "Add JSDoc comments to this code" examples
  • Markdown Documentation β†’ "Explain this section" Q&A pairs
  • Smart Processing β†’ Removes duplicates, filters by length, splits for training

πŸš€ Quick Start (3 Minutes)

Step 1: Install

# Clone this repository
git clone https://github.com/ArjunDivecha/Repo2Dataset.git
cd Repo2Dataset

# Install the tool
pip install -e .[dev]

Step 2: Convert a Repository

# Example: Convert the popular 'requests' library into training data
gh-chat-dataset --repo https://github.com/psf/requests.git --out ./requests_dataset

# Or try a smaller example first
gh-chat-dataset --repo https://github.com/pallets/itsdangerous.git --out ./test_dataset

Step 3: Check Your Results

ls test_dataset/
# You'll see:
# dataset.train.jsonl  <- 90% of samples for training
# dataset.valid.jsonl  <- 10% of samples for validation  
# stats.json          <- Summary of what was extracted

πŸ“Š Real Example Output

Let's say you run it on a Python repository. Here's what one training sample looks like:

Input (what the AI sees):

Write a clear, concise docstring for this function:

def validate_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
    return re.match(pattern, email) is not None

Expected Output (what the AI should respond):

Validate if an email address has a proper format.

Args:
    email (str): The email address to validate.

Returns:
    bool: True if email format is valid, False otherwise.

This creates training data that teaches AI models to write good documentation!

βš™οΈ All Command Options Explained

gh-chat-dataset [OPTIONS]

Required Options

  • --repo URL - The GitHub repository to convert (must be public or you need access)
  • --out DIRECTORY - Where to save the dataset files

Optional Fine-Tuning

  • --max-tokens 2048 - Skip samples longer than this (prevents memory issues)
  • --split-ratio 0.9 - How much data goes to training vs validation (0.9 = 90% train, 10% validation)
  • --allow-llm - (Experimental) Use AI to generate labels where missing

Real Examples

# Basic usage
gh-chat-dataset --repo https://github.com/user/repo.git --out ./my_dataset

# For larger models (more context)
gh-chat-dataset --repo https://github.com/user/repo.git --out ./my_dataset --max-tokens 4096

# More data for validation
gh-chat-dataset --repo https://github.com/user/repo.git --out ./my_dataset --split-ratio 0.8

πŸ“ What Gets Extracted?

From Python Files (.py)

  • βœ… Functions with docstrings β†’ "Write docstring" examples
  • βœ… Classes with docstrings β†’ "Document this class" examples
  • βœ… Module docstrings β†’ "Summarize this module" examples

From JavaScript/TypeScript (.js, .jsx, .ts, .tsx)

  • βœ… Functions with JSDoc comments β†’ "Add JSDoc" examples
  • βœ… Complex function signatures β†’ Documentation examples

From Markdown Files (.md)

  • βœ… README sections β†’ "Explain this concept" Q&A
  • βœ… Documentation pages β†’ Knowledge Q&A pairs
  • βœ… API docs β†’ Usage explanation examples

What Gets Filtered Out

  • ❌ Files without documentation (can't create good training pairs)
  • ❌ Very short or very long samples (poor quality for training)
  • ❌ Duplicate content (prevents overfitting)
  • ❌ Generated files (node_modules, build outputs, etc.)

πŸ“ˆ Understanding Your Output

After running the tool, check stats.json:

{
  "sha": "abc123...",           // Exact version of repo used
  "counts": {
    "total": 156,               // Total samples created
    "train": 140,               // Training samples (90%)
    "valid": 16                 // Validation samples (10%)
  }
}

Good numbers to see:

  • 50+ total samples for small projects
  • 500+ total samples for substantial codebases
  • 1000+ total samples for large, well-documented projects

🎯 Perfect Repositories to Try

Great for Beginners

  • pallets/itsdangerous - Small, well-documented Python library
  • sindresorhus/is - Simple JavaScript utilities with good docs
  • getsentry/sentry-python - Medium-sized Python project

For Larger Datasets

  • psf/requests - Popular Python HTTP library
  • microsoft/TypeScript - Large TypeScript codebase
  • django/django - Web framework with extensive docs

Best Results Come From

  • βœ… Well-documented codebases
  • βœ… Projects with consistent docstring/JSDoc style
  • βœ… Repositories with good README files
  • βœ… Active projects (recent commits)

πŸ›  Using Your Dataset with MLX

Once you have your dataset, here's how to use it for fine-tuning:

# Load your dataset
import json

def load_dataset(path):
    with open(path, 'r') as f:
        return [json.loads(line) for line in f]

train_data = load_dataset('your_dataset/dataset.train.jsonl')
valid_data = load_dataset('your_dataset/dataset.valid.jsonl')

# Each sample has this format:
sample = train_data[0]
print(sample['messages'])  # The conversation
print(sample['meta'])      # Metadata about source

Pro Tip: The messages format is compatible with most modern fine-tuning frameworks including MLX, transformers, and OpenAI's fine-tuning API.

πŸ”§ Troubleshooting

"No samples extracted"

  • βœ… Make sure the repo has documented code (docstrings, JSDoc, README)
  • βœ… Try a well-known repo first (like pallets/itsdangerous)
  • βœ… Check if the repo is public or you have access

"Permission denied" errors

  • βœ… Make sure you can access the repository
  • βœ… For private repos, ensure your Git credentials are set up
  • βœ… Try with a public repository first

"Out of memory" errors

  • βœ… Use --max-tokens 1024 for smaller samples
  • βœ… Try processing smaller repositories first
  • βœ… Make sure you have sufficient disk space

Dataset is too small

  • βœ… Try repositories with more documentation
  • βœ… Look for projects with consistent docstring styles
  • βœ… Consider enabling --allow-llm for more samples (experimental)

πŸš€ Advanced Usage

Batch Processing Multiple Repos

# Create a script to process multiple repositories
repos=(
  "https://github.com/user/repo1.git"
  "https://github.com/user/repo2.git" 
  "https://github.com/user/repo3.git"
)

for repo in "${repos[@]}"; do
  name=$(basename "$repo" .git)
  gh-chat-dataset --repo "$repo" --out "./datasets/$name"
done

Combining Datasets

# Merge multiple datasets
cat dataset1/dataset.train.jsonl dataset2/dataset.train.jsonl > combined_train.jsonl
cat dataset1/dataset.valid.jsonl dataset2/dataset.valid.jsonl > combined_valid.jsonl

πŸ‘¨β€πŸ’» Contributing & Development

Want to improve the tool? Here's how to set up for development:

# Clone and setup
git clone https://github.com/ArjunDivecha/Repo2Dataset.git
cd Repo2Dataset

# Install in development mode
pip install -e .[dev]

# Run tests to make sure everything works
pytest

# Check code style
ruff check .

# Make your changes, then test
python -m pytest

Adding New File Types

The tool is designed to be extensible. To add support for new languages:

  1. Create a new extractor in gh_chat_dataset/extract_xxx.py
  2. Add a builder function in gh_chat_dataset/builders.py
  3. Update the file discovery patterns in gh_chat_dataset/discover.py
  4. Add tests in tests/

πŸ“„ License

MIT License - feel free to use this for any project!

πŸ™‹β€β™€οΈ Questions?


Happy Training! πŸ€–βœ¨

About

One shot conversion from Github repo to dataset ready for fine tuning (Specifically for MLX and Qwen models)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages