AI-powered real-time sign language recognition using computer vision and machine learning.
Transform hand gestures into text and speech directly in your browser — no server required, no data leaves your device.
The Sign Language Interpretation System (SLIS) is a cutting-edge, browser-based application that interprets sign language gestures in real-time. Using advanced computer vision powered by MediaPipe and machine learning classification, it bridges communication gaps by converting hand movements into readable text and audible speech.
| Capability | Description |
|---|---|
| Real-time Processing | Processes webcam input at 30+ FPS with latency under 50ms |
| Dual Recognition Modes | Pre-trained ASL gestures + custom trainable gestures |
| Complete Privacy | All processing happens locally using WebAssembly/WebGL |
| Zero Dependencies | Single HTML file, no build process, no server required |
| Persistent Learning | Trained models saved to IndexedDB across sessions |
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ WEBCAM │───▶│ MEDIAPIPE │───▶│ FEATURE │───▶│ OUTPUT │
│ CAPTURE │ │ DETECTION │ │ CLASSIFY │ │ TEXT/VOICE │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
30 FPS 21 landmarks 166-dim vector Real-time
╔══════════════════════════════════════════════════════════════════════════════╗
║ ║
║ [ APPLICATION SCREENSHOT PLACEHOLDER ] ║
║ ║
║ Insert a screenshot showing the main interface with: ║
║ • Webcam feed with hand landmark overlay ║
║ • Control panel with training options ║
║ • Recognition output display ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
The detection system uses MediaPipe's HandLandmarker model, which identifies 21 key points on each hand:
- Wrist (1 point): Base reference for all measurements
- Thumb (4 points): CMC, MCP, IP, TIP joints
- Index/Middle/Ring/Pinky (4 points each): MCP, PIP, DIP, TIP joints
Step-by-step process:
- Webcam captures RGB frame at ~30 FPS
- Frame passed to MediaPipe via WebAssembly
- Hand regions detected using palm detection model
- Landmarks extracted using hand landmark model
- 3D coordinates (x, y, z) returned for each point
The recognition system transforms raw landmarks into meaningful classifications:
- Normalization: Landmarks centered on wrist, scaled to unit size
- Feature Extraction: 166-dimensional vector computed (coords + angles + distances)
- Classification: KNN compares against stored training samples
- Filtering: Temporal smoothing prevents rapid switching between gestures
- Output: Confident predictions displayed and optionally spoken
| Technology | Role | Implementation Details |
|---|---|---|
| MediaPipe | Hand detection & landmark extraction | WebAssembly build with WebGL acceleration; processes frames in ~15-25ms |
| HTML5 Canvas | Video display & landmark visualization | Real-time rendering of webcam feed with skeleton overlay |
| Web Speech API | Text-to-speech output | Browser-native speech synthesis with voice selection |
| IndexedDB | Persistent storage | Stores trained models, settings, and session data |
| JavaScript ES6+ | Application logic | Vanilla JS with async/await, classes, modules |
| CSS3 | Styling & animations | Custom properties, flexbox, grid, transitions |
| Technology | Role | Implementation Details |
|---|---|---|
| FastAPI | REST API server | Async framework for high-performance endpoints |
| Uvicorn | ASGI server | Production-ready server with hot reload |
| Starlette | HTTP toolkit | Foundation for FastAPI routing and middleware |
Why Browser-Based?
- Privacy: No data transmitted to external servers
- Accessibility: Works on any device with a modern browser
- Performance: WebAssembly + WebGL achieve near-native speeds
- Simplicity: Single file deployment, no installation required
Why KNN Classification?
- Incremental Learning: New gestures added without retraining entire model
- Interpretability: Easy to understand why classifications are made
- Low Latency: Classification in <5ms per frame
- Small Footprint: Model size scales linearly with training data
Component Interactions:
| Component | Input | Output | Connects To |
|---|---|---|---|
| SLISApp | User events, config | Commands, state | All components |
| MediaPipe | RGB frames | Landmark arrays | FeatureExtractor |
| FeatureExtractor | Landmarks | Feature vectors | KNNClassifier |
| KNNClassifier | Feature vectors | Labels + confidence | Output Handler |
| ModuleDB | Save/load requests | Model data | KNNClassifier |
| SpeechSynthesis | Text strings | Audio output | Output Handler |
Stage-by-Stage Breakdown:
- Input: Raw video stream from getUserMedia API
- Output: RGB frames at ~30 FPS
- Latency: ~33ms per frame
- Technical Note: Constraints set for optimal resolution (640x480 default)
- Input: RGB frame (Uint8Array)
- Output: Array of hand landmarks (21 points × 3 coordinates × 2 hands max)
- Latency: 15-25ms per frame
- Technical Note: WebGL backend preferred; falls back to CPU if unavailable
- Input: Raw landmark coordinates
- Output: 166-dimensional feature vector
- Latency: <1ms
- Technical Note: Rotation-invariant features ensure consistent recognition regardless of hand orientation
Feature Vector Composition:
| Component | Dimensions | Description |
|---|---|---|
| Normalized XYZ | 63 | Wrist-centered coordinates for 21 landmarks |
| Finger Angles | 5 | Bend angle (0-180°) for each finger |
| Fingertip Distances | 5 | Euclidean distance from each fingertip to wrist |
| Inter-finger Distances | 10 | Pairwise distances between all fingertips |
| Total | 83 per hand | 166 for two hands |
- Input: Feature vector
- Output: Gesture label + confidence score
- Latency: ~5ms
- Technical Note: Uses inverse distance weighting; k=5 neighbors vote on classification
- Input: Raw classification results
- Output: Stable, filtered output
- Latency: <1ms
- Technical Note: Debounce prevents flickering; confidence threshold filters uncertain predictions
- Input: Filtered gesture label
- Output: Visual display + optional speech
- Latency: ~10ms (speech synthesis async)
- Technical Note: Rate limiting prevents speech overlap
MediaPipe HandLandmarker Configuration:
const config = {
baseOptions: {
modelAssetPath: 'hand_landmarker.task',
delegate: 'GPU' // WebGL acceleration
},
runningMode: 'VIDEO',
numHands: 2,
minHandDetectionConfidence: 0.5,
minHandPresenceConfidence: 0.5,
minTrackingConfidence: 0.5
};| Requirement | Minimum | Recommended |
|---|---|---|
| Browser | Chrome 90+, Edge 90+, Firefox 100+ | Chrome/Edge latest |
| Hardware | Webcam, WebGL 1.0 | HD webcam, WebGL 2.0 |
| Permissions | Camera access | Camera + Microphone |
| Storage | 50MB free | 100MB+ for models |
| Network | Initial load only | Offline capable after first load |
# Clone the repository
git clone https://github.com/yourusername/sign-language-interpretation-system.git
# Navigate to project directory
cd sign-language-interpretation-system# Simply open index.html in your browser
# On macOS:
open index.html
# On Windows:
start index.html
# On Linux:
xdg-open index.htmlNote: Some browsers restrict webcam access for file:// URLs. If you encounter issues, use Option B or C.
# Using Python 3
python -m http.server 8000
# Then navigate to:
# http://localhost:8000# Using npx (no installation required)
npx serve .
# Or install serve globally
npm install -g serve
serve .
# Then navigate to the displayed URL# Install dependencies
pip install -r requirements.txt
# Start the FastAPI server
uvicorn src.server:app --reload
# Server runs at http://localhost:8000- Camera Access: Click "Allow" when prompted by your browser
- Microphone (optional): Required only for voice input features
- Storage: Automatically granted for IndexedDB
✓ Webcam feed appears in the main panel
✓ Hand landmarks overlay when hand is visible
✓ Status indicator shows "READY" or "DETECTING"
✓ Control panel is responsive
| Issue | Solution |
|---|---|
| Webcam not detected | Check browser permissions; try different browser |
| Slow performance | Enable hardware acceleration in browser settings |
| Landmarks not showing | Ensure good lighting; hand should be clearly visible |
| Speech not working | Check system volume; some browsers require user interaction first |
Step-by-step:
- Launch Application: Open index.html in your browser
- Allow Camera: Grant webcam permission when prompted
- Position Hand: Hold your hand in front of the camera (12-24 inches away)
- View Results: Recognized gestures appear in the output panel
- Enable Voice: Toggle "Voice Output" to hear gestures spoken aloud
Interface Controls:
| Control | Function | Keyboard Shortcut |
|---|---|---|
| Toggle Skeleton | Show/hide hand landmark overlay | S |
| Toggle Voice | Enable/disable speech output | V |
| Voice Selection | Choose from available system voices | - |
| Clear Output | Reset the output display | C |
Step-by-step:
- Enter Training Mode: Click "Train" button or press
T - Name Your Gesture: Type a label (e.g., "Hello", "Thank You")
- Position Hand: Show the gesture you want to train
- Capture Samples: Click "Capture" or press
Space(repeat 15-20 times) - Add Variations: Slightly vary hand position/rotation between captures
- Save Model: Click "Save" to persist to IndexedDB
- Test: Return to Recognition Mode and test your gesture
Training Best Practices:
| Practice | Reason |
|---|---|
| Capture 15-20 samples per gesture | More samples = better accuracy |
| Vary hand position slightly | Improves robustness to positioning |
| Include different distances | Handles near/far variations |
| Use consistent lighting | Reduces false positives |
| Train similar gestures separately | Helps distinguish between them |
| Action | Description | How To |
|---|---|---|
| Save Model | Export trained gestures to JSON file | Click "Export" → Save file |
| Load Model | Import a previously saved model | Click "Import" → Select file |
| Clear Model | Remove all trained gestures | Click "Clear" → Confirm |
| View Samples | See captured training data | Click gesture name in list |
Custom Confidence Threshold:
// In browser console:
app.setConfidenceThreshold(0.8); // Higher = stricter matchingAdjust Debounce Time:
// In browser console:
app.setDebounceTime(500); // Milliseconds between output changes| Path | Type | Purpose |
|---|---|---|
index.html |
File | Complete application (HTML + CSS + JS in single file) |
requirements.txt |
File | Optional Python backend dependencies |
src/ |
Directory | Optional Python server modules |
src/__init__.py |
File | Package initialization with version info |
readme/ |
Directory | Documentation and visual assets |
readme/readme.md |
File | This documentation file |
readme/assets/ |
Directory | SVG diagrams, icons, and images |
| Section | Lines (approx) | Description |
|---|---|---|
| HTML Structure | 1-100 | Document layout, panels, controls |
| CSS Styles | 100-400 | Dark theme, responsive layout, animations |
| JavaScript Core | 400-800 | SLISApp class, state management |
| MediaPipe Integration | 800-1000 | Hand detection, landmark processing |
| Feature Extraction | 1000-1200 | Vector computation, normalization |
| KNN Classifier | 1200-1400 | Training, classification, storage |
| UI Handlers | 1400-1600 | Event listeners, DOM updates |
| Speech Synthesis | 1600-1700 | Voice output, configuration |
| Layer | Technology | File/Location |
|---|---|---|
| Frontend | HTML5, CSS3, JS ES6+ | index.html |
| Detection | MediaPipe HandLandmarker | CDN (loaded at runtime) |
| Classification | Custom KNN | index.html (inline) |
| Storage | IndexedDB | Browser API |
| Speech | Web Speech API | Browser API |
| Backend (optional) | FastAPI, Uvicorn | src/, requirements.txt |
╔══════════════════════════════════════════════════════════════════════════════╗
║ ║
║ [ DEMO VIDEO PLACEHOLDER ] ║
║ ║
║ Insert a video demonstration showing: ║
║ • Application startup and camera initialization ║
║ • Real-time gesture recognition in action ║
║ • Training a custom gesture ║
║ • Voice output functionality ║
║ • Model export/import workflow ║
║ ║
║ Recommended format: MP4/WebM, 30-60 seconds ║
║ Recommended resolution: 1280x720 or higher ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
Video demonstration showing real-time gesture recognition, custom training workflow, and voice output features
