Skip to content

Turn files into clean, fine-tuning-ready datasets (TXT/CSV). EPUB, PDF, and token-aware. Local, GUI-based, no cloud required.

License

Notifications You must be signed in to change notification settings

wolflow-ai/wolfstitch

Repository files navigation

🧬 Wolfstitch - AI Training Dataset Creator

Professional-grade dataset creation for AI training with instant startup and progressive premium enhancement

Wolfstitch transforms documents, code, and text files into optimally-chunked, tokenizer-aware datasets for fine-tuning language models. Built for AI developers, researchers, and enterprises who need scalable, accurate dataset creation with complete cost transparency and zero-wait startup.

✨ Key Features

🚀 NEW: Instant Startup with Progressive Enhancement

  • ⚡ Zero-Wait Launch: App starts immediately with full core functionality
  • 🔄 Progressive Premium Loading: Advanced tokenizers and features load in background
  • 📊 Real-Time Loading Progress: Beautiful loading dialog shows premium feature activation
  • 🛡️ Bulletproof Reliability: Works perfectly offline, with slow networks, or behind firewalls
  • 🎯 Smart Fallbacks: Graceful degradation ensures functionality in any environment

🎯 Core Processing Pipeline

  • 📚 Comprehensive Format Support: 40+ file formats including documents, presentations, spreadsheets, and source code
  • 🧠 Context-Aware Cleaning: Preserves code structure while optimizing documents for AI training
  • 💻 Code Intelligence: Automatic detection of minified/auto-generated files with quality control
  • 🌍 International Support: Automatic character encoding detection for global codebases
  • 🔧 Smart Token-Aware Chunking: Configurable token limits (512-4096) with exact tokenization

💎 Premium Tokenizer System

  • 🔄 Hybrid Architecture: Immediate word-based estimation + background exact tokenizer loading
  • 🎯 5 Professional Tokenizers: GPT-2, GPT-3.5, GPT-4 (tiktoken), BERT, Sentence Transformers
  • 📈 Progressive Accuracy: Start with estimates, upgrade to exact counts as tokenizers load
  • 🔒 Access Control: Premium tokenizers with licensing integration
  • ⚡ Performance Optimized: Background loading never blocks user workflow

💰 Advanced Cost Analysis

  • 💡 15+ Training Approaches: Local, cloud, and hybrid training cost comparison
  • 🔄 Real-Time Pricing: Live pricing from Lambda Labs, Vast.ai, RunPod, and more
  • 📊 ROI Calculations: Break-even analysis and cost optimization recommendations
  • 💎 Progressive Enhancement: Basic estimates immediately, detailed analysis when loaded
  • 📋 Export-Ready Reports: Comprehensive cost reports in JSON, CSV, and Excel formats

🔐 Professional Licensing System

  • 🆓 7-Day Free Trial: Full access to all premium features without credit card
  • 🧑‍💻 Demo Mode: WOLFSTITCH_DEMO=true environment variable for development access
  • 🔑 Secure License Management: Encrypted key-based authentication system
  • ⏱️ Trial Tracking: Automatic countdown with upgrade prompts and status indicators

Enhanced User Experience

  • 🎛️ Smart Tokenizer Selection: Seamless dropdown with premium indicators and progressive loading
  • 🔍 Enhanced Preview: Color-coded chunks with efficiency indicators and real-time analytics
  • 📊 Live Analytics: Instant updates as tokenizers and features become available
  • 💬 Progressive Feedback: Clear loading states and feature activation notifications
  • 🎯 Status Indicators: Real-time license status and feature availability display

📋 Supported File Formats

✅ Business Documents

Format Status Description Key Features
PDF ✅ Complete Adobe PDF documents Text extraction, multi-page support
Word (.docx) ✅ Complete Microsoft Word documents Tables, formatting preservation
Excel (.xlsx) ✅ Complete Spreadsheets & data Multi-sheet intelligent extraction
PowerPoint (.pptx) ✅ Complete Presentations Slide text, speaker notes, tables
Web/HTML ✅ Complete Web pages & documentation Content isolation, clean extraction
Markdown ✅ Complete Technical documentation Syntax removal, clean formatting
EPUB ✅ Complete E-books Chapter extraction, metadata
Plain Text ✅ Complete TXT files Encoding detection, multi-format

✅ Source Code & Configuration Files

Format Status Description Key Features
Python (.py) ✅ Complete Python source code Auto-encoding detection, structure preservation
JavaScript (.js/.jsx) ✅ Complete JS/React code TypeScript support, quality control
Java (.java) ✅ Complete Java source code Comment preservation, structure detection
C/C++ (.c/.cpp/.h) ✅ Complete C/C++ source & headers Multiple extensions support
Go (.go) ✅ Complete Go source code UTF-8 handling, import detection
Rust (.rs) ✅ Complete Rust source code Cargo file support ready
Config Files ✅ Complete YAML, TOML, INI Structure preservation, comment handling
30+ Languages ✅ Complete Swift, Kotlin, Ruby, PHP, etc. Comprehensive language support

🎯 Use Cases

1. Business Document Fine-Tuning

Transform your organization's knowledge base into training data:

  • Company policies and procedures
  • Technical documentation and manuals
  • Training presentations and materials
  • Annual reports and business documents

2. Codebase Training

Prepare code repositories for AI model training:

  • Full Language Support: 30+ programming languages with intelligent extraction
  • Quality Control: Automatic detection and skipping of minified/auto-generated files
  • Encoding Handling: Automatic character encoding detection for international codebases
  • Structure Preservation: Maintains indentation and code structure with context-aware cleaning
  • Smart Filtering: Configurable file size limits and quality thresholds

3. Research & Academic

Process large document collections for research:

  • Academic papers and publications
  • Research datasets and corpora
  • Multi-format document libraries
  • Progressive processing with real-time feedback

🛠️ Installation

Prerequisites

  • Python 3.8 or higher
  • 8GB RAM recommended for large batch processing
  • Windows, macOS, or Linux
  • Internet connection for premium features (offline mode available)

Quick Start

# Clone the repository
git clone https://github.com/CLewisMessina/wolfstitch.git
cd wolfstitch

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
.\venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Note: If you see an error about chardet, install it separately:
pip install chardet>=5.0.0

# Launch application
python main.py

First Launch Experience

  1. Instant Startup: App launches immediately with core functionality
  2. Progressive Loading: Watch premium tokenizers load in the background
  3. Immediate Usage: Start processing files while advanced features activate
  4. Premium Trial: Enjoy 7-day free trial with full feature access

💡 Usage Guide

Instant Functionality

  • Immediate file processing with word-based token estimation
  • Basic analytics and chunk analysis available instantly
  • Document cleaning and splitting works immediately
  • Export functionality ready from first launch

Progressive Enhancement

  • 🔄 Exact tokenizers load in background (GPT-2, tiktoken, BERT)
  • 🔄 Premium features activate automatically as they become available
  • 🔄 Cost analysis becomes available with full calculator loading
  • 🔄 Advanced analytics enhance as premium features load

Single File Processing

  1. Click "Select File" or drag & drop a supported file
  2. Choose splitting method (paragraph, sentence, or custom)
  3. Select tokenizer (immediate options available, premium options load progressively)
  4. Click "Process Text" to chunk the document
  5. Preview chunks and export to desired format

Progressive Loading Dialog

  • 📊 Real-time progress for tokenizer and feature loading
  • 🎛️ Continue in background option to dismiss dialog anytime
  • Auto-completion notification when all features are ready
  • 🔄 Status updates for each premium component

🎯 Roadmap

✅ Phase A: Progressive Enhancement Foundation (Complete)

  • Hybrid tokenizer architecture with instant fallbacks ✅
  • Progressive loading UI with real-time status ✅
  • Context-aware cleaning system ✅
  • Zero-wait startup architecture ✅

🔄 Phase B: Batch Processing & Smart Chunking (In Progress)

  • Multi-file selection and batch processing
  • Token-aware intelligent chunking
  • Content deduplication and quality scoring
  • Metadata tracking and provenance

📋 Phase C: Enhanced Output & Integration (Planned)

  • JSONL export with metadata
  • Batch analytics dashboard
  • API integration options
  • Cloud storage support

🏗️ Architecture

Hybrid Progressive System

┌─────────────────────────────────────────┐
│             User Interface             │
├─────────────────────────────────────────┤
│         Progressive Loading UI         │
├─────────────────────────────────────────┤
│      Hybrid Processing Controller      │
├─────────────────────────────────────────┤
│    Immediate Fallbacks  │  Premium     │
│    • Word Estimator     │  • tiktoken  │
│    • Char Estimator     │  • GPT-2     │
│    • Basic Analytics    │  • BERT      │
│                         │  • Cost Calc │
├─────────────────────────────────────────┤
│         Core Processing Pipeline        │
│  Extract → Clean → Split → Analyze     │
└─────────────────────────────────────────┘

Progressive Loading Flow

  1. Instant (0ms): Core functionality ready
  2. Background (1-30s): Premium tokenizers load
  3. Enhanced (30s+): Full premium features active
  4. Continuous: Seamless feature activation

🤝 Contributing

Wolfstitch is open source and welcomes contributions! Please see our Contributing Guide for details.

Key Contribution Areas

  • Format Support: Add support for new file formats
  • Tokenizer Integration: Contribute new tokenizer implementations
  • UI/UX Enhancement: Improve progressive loading experience
  • Performance Optimization: Enhance background loading efficiency

📊 Performance & Reliability

Startup Performance

  • 0ms blocking time: App starts immediately
  • 🔄 Background loading: No impact on user workflow
  • 🛡️ Network resilient: Works offline and with slow connections
  • 📱 Resource efficient: Minimal memory usage during startup

Compatibility

  • 🌐 Corporate Networks: Bypasses firewall restrictions
  • 🔌 Offline Mode: Full basic functionality without internet
  • 🐌 Slow Connections: Progressive enhancement adapts to network speed
  • 🚫 API Failures: Graceful fallbacks ensure continuous operation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🎉 Experience the Future of Dataset Creation

Instant startup. Progressive enhancement. Professional results.

Start processing your datasets immediately while premium features activate seamlessly in the background. No waiting, no hanging, no compromises.

Ready to revolutionize your AI training workflow? Download Wolfstitch today!

About

Turn files into clean, fine-tuning-ready datasets (TXT/CSV). EPUB, PDF, and token-aware. Local, GUI-based, no cloud required.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages