🧬 Wolfstitch - AI Training Dataset Creator

Professional-grade dataset creation for AI training with instant startup and progressive premium enhancement

Wolfstitch transforms documents, code, and text files into optimally-chunked, tokenizer-aware datasets for fine-tuning language models. Built for AI developers, researchers, and enterprises who need scalable, accurate dataset creation with complete cost transparency and zero-wait startup.

✨ Key Features

🚀 NEW: Instant Startup with Progressive Enhancement

⚡ Zero-Wait Launch: App starts immediately with full core functionality
🔄 Progressive Premium Loading: Advanced tokenizers and features load in background
📊 Real-Time Loading Progress: Beautiful loading dialog shows premium feature activation
🛡️ Bulletproof Reliability: Works perfectly offline, with slow networks, or behind firewalls
🎯 Smart Fallbacks: Graceful degradation ensures functionality in any environment

🎯 Core Processing Pipeline

📚 Comprehensive Format Support: 40+ file formats including documents, presentations, spreadsheets, and source code
🧠 Context-Aware Cleaning: Preserves code structure while optimizing documents for AI training
💻 Code Intelligence: Automatic detection of minified/auto-generated files with quality control
🌍 International Support: Automatic character encoding detection for global codebases
🔧 Smart Token-Aware Chunking: Configurable token limits (512-4096) with exact tokenization

💎 Premium Tokenizer System

🔄 Hybrid Architecture: Immediate word-based estimation + background exact tokenizer loading
🎯 5 Professional Tokenizers: GPT-2, GPT-3.5, GPT-4 (tiktoken), BERT, Sentence Transformers
📈 Progressive Accuracy: Start with estimates, upgrade to exact counts as tokenizers load
🔒 Access Control: Premium tokenizers with licensing integration
⚡ Performance Optimized: Background loading never blocks user workflow

💰 Advanced Cost Analysis

💡 15+ Training Approaches: Local, cloud, and hybrid training cost comparison
🔄 Real-Time Pricing: Live pricing from Lambda Labs, Vast.ai, RunPod, and more
📊 ROI Calculations: Break-even analysis and cost optimization recommendations
💎 Progressive Enhancement: Basic estimates immediately, detailed analysis when loaded
📋 Export-Ready Reports: Comprehensive cost reports in JSON, CSV, and Excel formats

🔐 Professional Licensing System

🆓 7-Day Free Trial: Full access to all premium features without credit card
🧑‍💻 Demo Mode: WOLFSTITCH_DEMO=true environment variable for development access
🔑 Secure License Management: Encrypted key-based authentication system
⏱️ Trial Tracking: Automatic countdown with upgrade prompts and status indicators

✨ Enhanced User Experience

🎛️ Smart Tokenizer Selection: Seamless dropdown with premium indicators and progressive loading
🔍 Enhanced Preview: Color-coded chunks with efficiency indicators and real-time analytics
📊 Live Analytics: Instant updates as tokenizers and features become available
💬 Progressive Feedback: Clear loading states and feature activation notifications
🎯 Status Indicators: Real-time license status and feature availability display

📋 Supported File Formats

✅ Business Documents

Format	Status	Description	Key Features
PDF	✅ Complete	Adobe PDF documents	Text extraction, multi-page support
Word (.docx)	✅ Complete	Microsoft Word documents	Tables, formatting preservation
Excel (.xlsx)	✅ Complete	Spreadsheets & data	Multi-sheet intelligent extraction
PowerPoint (.pptx)	✅ Complete	Presentations	Slide text, speaker notes, tables
Web/HTML	✅ Complete	Web pages & documentation	Content isolation, clean extraction
Markdown	✅ Complete	Technical documentation	Syntax removal, clean formatting
EPUB	✅ Complete	E-books	Chapter extraction, metadata
Plain Text	✅ Complete	TXT files	Encoding detection, multi-format

✅ Source Code & Configuration Files

Format	Status	Description	Key Features
Python (.py)	✅ Complete	Python source code	Auto-encoding detection, structure preservation
JavaScript (.js/.jsx)	✅ Complete	JS/React code	TypeScript support, quality control
Java (.java)	✅ Complete	Java source code	Comment preservation, structure detection
C/C++ (.c/.cpp/.h)	✅ Complete	C/C++ source & headers	Multiple extensions support
Go (.go)	✅ Complete	Go source code	UTF-8 handling, import detection
Rust (.rs)	✅ Complete	Rust source code	Cargo file support ready
Config Files	✅ Complete	YAML, TOML, INI	Structure preservation, comment handling
30+ Languages	✅ Complete	Swift, Kotlin, Ruby, PHP, etc.	Comprehensive language support

🎯 Use Cases

1. Business Document Fine-Tuning

Transform your organization's knowledge base into training data:

Company policies and procedures
Technical documentation and manuals
Training presentations and materials
Annual reports and business documents

2. Codebase Training

Prepare code repositories for AI model training:

Full Language Support: 30+ programming languages with intelligent extraction
Quality Control: Automatic detection and skipping of minified/auto-generated files
Encoding Handling: Automatic character encoding detection for international codebases
Structure Preservation: Maintains indentation and code structure with context-aware cleaning
Smart Filtering: Configurable file size limits and quality thresholds

3. Research & Academic

Process large document collections for research:

Academic papers and publications
Research datasets and corpora
Multi-format document libraries
Progressive processing with real-time feedback

🛠️ Installation

Prerequisites

Python 3.8 or higher
8GB RAM recommended for large batch processing
Windows, macOS, or Linux
Internet connection for premium features (offline mode available)

Quick Start

# Clone the repository
git clone https://github.com/CLewisMessina/wolfstitch.git
cd wolfstitch

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
.\venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Note: If you see an error about chardet, install it separately:
pip install chardet>=5.0.0

# Launch application
python main.py

First Launch Experience

Instant Startup: App launches immediately with core functionality
Progressive Loading: Watch premium tokenizers load in the background
Immediate Usage: Start processing files while advanced features activate
Premium Trial: Enjoy 7-day free trial with full feature access

💡 Usage Guide

Instant Functionality

✅ Immediate file processing with word-based token estimation
✅ Basic analytics and chunk analysis available instantly
✅ Document cleaning and splitting works immediately
✅ Export functionality ready from first launch

Progressive Enhancement

🔄 Exact tokenizers load in background (GPT-2, tiktoken, BERT)
🔄 Premium features activate automatically as they become available
🔄 Cost analysis becomes available with full calculator loading
🔄 Advanced analytics enhance as premium features load

Single File Processing

Click "Select File" or drag & drop a supported file
Choose splitting method (paragraph, sentence, or custom)
Select tokenizer (immediate options available, premium options load progressively)
Click "Process Text" to chunk the document
Preview chunks and export to desired format

Progressive Loading Dialog

📊 Real-time progress for tokenizer and feature loading
🎛️ Continue in background option to dismiss dialog anytime
✅ Auto-completion notification when all features are ready
🔄 Status updates for each premium component

🎯 Roadmap

✅ Phase A: Progressive Enhancement Foundation (Complete)

Hybrid tokenizer architecture with instant fallbacks ✅
Progressive loading UI with real-time status ✅
Context-aware cleaning system ✅
Zero-wait startup architecture ✅

🔄 Phase B: Batch Processing & Smart Chunking (In Progress)

Multi-file selection and batch processing
Token-aware intelligent chunking
Content deduplication and quality scoring
Metadata tracking and provenance

📋 Phase C: Enhanced Output & Integration (Planned)

JSONL export with metadata
Batch analytics dashboard
API integration options
Cloud storage support

🏗️ Architecture

Hybrid Progressive System

┌─────────────────────────────────────────┐
│             User Interface             │
├─────────────────────────────────────────┤
│         Progressive Loading UI         │
├─────────────────────────────────────────┤
│      Hybrid Processing Controller      │
├─────────────────────────────────────────┤
│    Immediate Fallbacks  │  Premium     │
│    • Word Estimator     │  • tiktoken  │
│    • Char Estimator     │  • GPT-2     │
│    • Basic Analytics    │  • BERT      │
│                         │  • Cost Calc │
├─────────────────────────────────────────┤
│         Core Processing Pipeline        │
│  Extract → Clean → Split → Analyze     │
└─────────────────────────────────────────┘

Progressive Loading Flow

Instant (0ms): Core functionality ready
Background (1-30s): Premium tokenizers load
Enhanced (30s+): Full premium features active
Continuous: Seamless feature activation

🤝 Contributing

Wolfstitch is open source and welcomes contributions! Please see our Contributing Guide for details.

Key Contribution Areas

Format Support: Add support for new file formats
Tokenizer Integration: Contribute new tokenizer implementations
UI/UX Enhancement: Improve progressive loading experience
Performance Optimization: Enhance background loading efficiency

📊 Performance & Reliability

Startup Performance

⚡ 0ms blocking time: App starts immediately
🔄 Background loading: No impact on user workflow
🛡️ Network resilient: Works offline and with slow connections
📱 Resource efficient: Minimal memory usage during startup

Compatibility

🌐 Corporate Networks: Bypasses firewall restrictions
🔌 Offline Mode: Full basic functionality without internet
🐌 Slow Connections: Progressive enhancement adapts to network speed
🚫 API Failures: Graceful fallbacks ensure continuous operation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🎉 Experience the Future of Dataset Creation

Instant startup. Progressive enhancement. Professional results.

Start processing your datasets immediately while premium features activate seamlessly in the background. No waiting, no hanging, no compromises.

Ready to revolutionize your AI training workflow? Download Wolfstitch today!

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
assets		assets
core		core
export		export
processing		processing
ui		ui
wolfcore		wolfcore
.gitattributes		.gitattributes
.gitignore		.gitignore
.wolfscribe_trial		.wolfscribe_trial
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
controller.py		controller.py
launch-wolfstitch.bat		launch-wolfstitch.bat
log.txt		log.txt
main.py		main.py
project-folder-structure.md		project-folder-structure.md
requirements.txt		requirements.txt
session.py		session.py
test_realistic.py		test_realistic.py

License

wolflow-ai/wolfstitch

Folders and files

Latest commit

History

Repository files navigation