โจ Features โข ๐ Architecture โข โ๏ธ How It Works โข ๐ ๏ธ Tech Stack โข ๐ Installation โข ๐ก Usage โข ๐ Structure โข ๐ฌ Commands
AURA is a production-grade, local desktop AI voice assistant and automation platform built entirely in Python. It enables hands-free control of your Windows PC through natural language voice commands โ with zero manual interaction required after startup. โก
The system features a biometric face-authentication gateway ๐, a dual-layer NLU engine ๐ง (deterministic rules + LLM fallback), cloud-accelerated speech-to-text โ๏ธ via Groq Whisper, and a native Windows SAPI5 TTS engine ๐ฃ๏ธ โ all tied together through a custom-built event-driven architecture using a publish/subscribe event bus and a validated state machine.
๐ก Built for Scale: Designed as a modular, production-ready system with clearly separated concerns: audio pipeline, authentication, natural language understanding, action dispatching, and GUI โ each operating independently via events.
- ๐ก๏ธ On launch, AURA starts silently in the LOCKED state โ microphone is active, but the UI is hidden.
- ๐๏ธ When the wake phrase is detected, the GUI surfaces and OpenCV's YuNet face detector (ONNX, ~200 KB) scans the camera.
- ๐งฌ A 128-dimensional SFace embedding is extracted from the detected face and compared via cosine similarity against stored enrollment embeddings.
- ๐ฏ Threshold:
score โฅ 0.75confidence to accept detection; recognition similarity must pass a tuned threshold. - ๐ฅ Supports multi-user enrollment stored in local SQLite (
aura.db). Includes automatic migration from legacy databases. - ๐ ๏ธ Dev bypass via
--bypass-authflag for rapid iteration.
- ๐ง A single
MicStreamthread captures 16 kHz mono PCM audio in 30 ms frames (480 samples) continuously. - ๐ WebRTC VAD (
webrtcvad, aggressiveness level 2) processes each frame to detect speech onset and offset. - ๐ A pre-speech ring buffer (5 frames = 150 ms) is prepended to every captured utterance to avoid clipping the first syllable.
- ๐ A complete utterance is sent to Groq Whisper (
whisper-large-v3-turbo) for transcription, then matched against the configured wake phrase ("take control"). - โจ No separate wake-word model is needed โ Whisper handles both wake detection and command transcription.
The NLU pipeline uses a two-stage classification approach for reliability + intelligence:
Stage 1 โ Fast Rule Matcher (Deterministic) โก
FastRuleMatcherscans normalized text against curated keyword/synonym dictionaries (synonym_map.json,app_mappings.json,site_mappings.json).- Returns intent with
confidence = 1.0instantly โ no API call, no latency. - Handles the majority of everyday commands (open/close apps, search, time, weather, media, screenshot, system control).
Stage 2 โ LLM Intent Brain (Groq LLaMA Fallback) ๐ค
- If Stage 1 returns no match, text is sent to
llama-3.1-8b-instantvia Groq API with a structured system prompt. - Returns a JSON object:
{intent, action, slots, confidence, needs_clarification, requires_confirmation}. - Temperature is set to
0.0for deterministic structured outputs (response_format: json_object).
Post-Processing Pipeline โ๏ธ:
- Context Resolution โ pronouns like "it", "this", "that" are resolved against the
ContextMemory(last entity mentioned). - Canonical Cross-Check โ if
open_apptargets a known website, intent is automatically pivoted toopen_website. - Confidence Guard โ low-confidence results are escalated to conversation fallback.
- Single-Intent Processing โ only the highest-confidence intent per utterance is executed (prevents accidental chained actions).
- ๐
TTSThreadruns as a dedicated QThread with an internalqueue.Queuefor thread-safe speech requests. - ๐ป Uses Windows SAPI5 (
SAPI.SpVoice) viacomtypesโ no model downloads, no latency overhead. - โก Speaks asynchronously (flag
1) whileWaitUntilDone(100ms)polls in a tight worker loop. - ๐ Immediate interrupt capability:
stop_speaking()callsSpeak("", 2)(SVSFPurgeBeforeSpeak flag) to halt mid-sentence. - ๐ง Automatically selects a male voice (searches for "david", "mark", "james", "george" in installed voices).
- ๐งฉ COM is initialized per-thread (
pythoncom.CoInitialize()) and cleaned up on shutdown. - ๐ Emits
speech_startedandspeech_endedQt signals โ main window mutes the microphone during speech to eliminate feedback loops.
AURA uses a thread-safe validated state machine with 7 states:
๐ LOCKED โ ๐ค IDLE โ ๐ LISTENING โ ๐ค THINKING โ โ๏ธ EXECUTING โ ๐ฃ๏ธ SPEAKING โ ๐ LISTENING
- โ All transitions are validated against an allowed-transitions table โ illegal transitions are logged and blocked.
- ๐ Every state change publishes a
state.changedevent on the EventBus, which updates the GUI, status bar, and orb visualizer in real time. - ๐ The
StateMachineis a thread-safe singleton using athreading.Lock.
- ๐
EventBusis a singleton pub/sub bus backed by a PySide6QObjectwith aSignal(str, object). - ๐
publish()is callable from any thread โ it emits a Qt signal, ensuring callbacks always run on the Qt main thread (thread-safe UI updates). - ๐ 20+ named event types:
auth.success,wake.detected,intent.classified,tts.start,state.changed,system.shutdown, etc.
โ ๏ธ Dangerous commands (shutdown,restart,lock,format,delete) are taggedrequires_confirmation = Trueby both the rule matcher and LLM.- ๐ A visual confirmation dialog (PySide6) appears simultaneously with a verbal prompt.
- โ User can confirm via voice ("yes", "confirm", "proceed") or cancel ("no", "cancel", "stop") within a 6-second timeout window.
- โณ
ConfirmationServiceholds the pendingParsedCommandand resolves it on voice/UI response. - ๐
SessionGuardenforces role-based access control โ certain actions require fresh biometric re-authentication.
All commands are dispatched through a central ActionDispatcher that routes ParsedCommand objects to registered handlers via ActionRegistry:
| Intent | Actions |
|---|---|
๐ฅ๏ธ app_control |
Open/close any desktop application via subprocess + psutil |
๐ browser_control |
Default browser launch, Google/YouTube/website search |
โ๏ธ system_control |
Shutdown, restart, lock, sleep, volume, brightness |
๐ค๏ธ weather |
Real-time weather via Open-Meteo API (geocoding + WMO codes) |
โฐ time |
Local time/date with formatted spoken response |
๐ฌ whatsapp |
Open WhatsApp Web chats, compose message drafts |
๐ง email |
Gmail compose URL with pre-filled subject/body |
๐ธ screenshot |
Capture screen via Pillow, save to data/screenshots/ |
๐ต media_control |
Play/pause/next/prev via pyautogui media keys |
๐
reminders |
Schedule reminders/alarms, persist to SQLite, poll every 30s |
๐ค conversation |
Free-form chat via Groq llama-3.1-8b-instant with 10-turn history |
- ๐๏ธ All user turns and AURA responses are logged to SQLite (
aura.db) viaSQLModelORM. - ๐ Memory window shows full conversation history, sortable and browsable.
- โฐ Scheduled reminders and alarms are stored in the
remindertable and polled every 30 seconds via aQTimer. - ๐ Context memory resolves pronouns across turns ("open spotify" โ "close it").
AURA uses an event-driven, pipe-and-filter architecture with strict separation of concerns. Each subsystem communicates exclusively through the EventBus or Qt signals โ no direct cross-module calls at runtime.
graph TD
User([๐ค User Voice]) -->|16kHz PCM| Mic[MicStream\naudio.mic_stream]
Mic -->|30ms frames| Queue[Audio Queue\nmaxsize=300]
Queue -->|frames| VAD[VadManager\nWebRTC VAD]
VAD -->|utterance bytes| Router{State Router\nmain_window}
subgraph Audio Pipeline
Router -->|IDLE/LOCKED| Wake[WakeDetector\naudio.wake_listener]
Router -->|LISTENING| CMD[Command Processor\nmain_window]
end
Wake -->|Groq Whisper| WakeCheck{Wake Phrase\nMatch?}
WakeCheck -->|Yes - LOCKED| Auth[Face Auth\nauth.face_auth]
WakeCheck -->|Yes - IDLE| Listen[State: LISTENING]
Auth -->|Pass| Console[Console UI]
CMD -->|Groq Whisper| STT[WhisperSTT\nspeech.whisper_stt]
STT -->|transcript| Validator[TranscriptValidator]
Validator -->|valid| Engine[IntentEngine\nbrain.core.intent_engine]
subgraph NLU Pipeline
Engine -->|normalize| Normalizer[CommandNormalizer]
Normalizer -->|clean text| Fast[FastRuleMatcher\nStage 1]
Fast -->|no match| LLM[LLMIntentBrain\nGroq LLaMA 3.1]
Fast -->|match| Slots[SlotExtractor]
LLM -->|JSON| Slots
Slots -->|enriched cmd| Guard[ConfidenceGuard]
Guard -->|ParsedCommand| Context[ContextMemory]
end
Context -->|resolved cmd| Policy[SessionGuard\nservices.session_guard]
Policy -->|allowed| Confirm{Requires\nConfirmation?}
Confirm -->|Yes| ConfirmSvc[ConfirmationService]
Confirm -->|No| Dispatch[ActionDispatcher\nservices.action_dispatcher]
ConfirmSvc -->|resolved| Dispatch
Dispatch -->|result| TTS[TTSThread\nSAPI5 SpVoice]
Dispatch -->|result| Memory[MemoryManager\nSQLite aura.db]
TTS -->|speech_started| MicMute[Mic Muted\nduring speech]
TTS -->|speech_ended| MicUnmute[Mic Unmuted\nresume listening]
app.py โ MainWindow.__init__() โ _start_pipeline()
- TTS thread starts and warms up SAPI5 COM object.
MicStreambegins capturing 30 ms PCM frames into a bounded queue (maxsize=300).- A dedicated
vad-consumerdaemon thread pulls frames from the queue and feedsVadManager. - App window stays hidden โ system tray / background only.
VadManager detects speech โ _on_utterance_captured(audio) โ _check_wake(audio) [new Thread]
WakeDetectorsends the audio buffer to Groq Whisper (whisper-large-v3-turbo).- PCM bytes are wrapped into a WAV container in-memory (
io.BytesIO+wave) before upload. - Transcript is checked against the configured wake phrase (default:
"take control").
Wake detected in LOCKED โ QMetaObject.invokeMethod(_start_face_auth) โ AuthWindow
- Window surfaces, camera activates via OpenCV.
- Per-frame:
YuNet.detect()โSFace.alignCrop()โSFace.feature()โcosine_similarity(). - Successful match emits
auth_successsignal โ_on_auth_proceed(username).
State: LISTENING โ utterance captured โ _process_command(audio) [new Thread]
State โ THINKING โ EXECUTING โ SPEAKING
- Audio โ
WhisperSTT.transcribe()โ raw transcript TranscriptValidatorrejects noise/short/repeated textIntentEngine.process()โ normalize โ fast match OR LLM โ extract slots โ resolve contextSessionGuard.verify_access()โ check permissionsConfirmationServiceโ if dangerous, pause and askActionDispatcher.dispatch()โ find handler inActionRegistryโ execute- Response text โ
TTSThread.speak()+_signals.aura_response.emit()(GUI transcript)
TTS: speech_started โ mic muted โ speech_ended โ mic unmuted โ State: LISTENING
- After every response, the active window timer (300s) resets.
- If the timer expires with no further commands, state returns to
IDLE.
| Layer | Component | Version / Details |
|---|---|---|
| Language | ๐ Python | 3.12+ |
| GUI Framework | ๐ผ๏ธ PySide6 (Qt for Python) | Dark Fusion theme, QStackedWidget, custom Orb visualizer |
| STT Engine | ๐๏ธ Groq Whisper API | whisper-large-v3-turbo โ cloud-accelerated transcription |
| LLM / NLU | ๐ง Groq LLaMA | llama-3.1-8b-instant โ structured JSON intent classification |
| TTS Engine | ๐ Windows SAPI5 via comtypes |
SAPI.SpVoice, async + interruptible, male voice selection |
| Face Detection | ๐๏ธ OpenCV YuNet ONNX | face_detection_yunet_2023mar.onnx (~200 KB) |
| Face Recognition | ๐งฌ OpenCV SFace ONNX | face_recognition_sface_2021dec.onnx (~37 MB), 128-d embeddings |
| VAD | ๐ WebRTC VAD (webrtcvad) |
30 ms frames @ 16 kHz, aggressiveness level 2 |
| Audio Capture | ๐ค PyAudio | 16 kHz, mono, 480-sample chunks |
| Storage / ORM | ๐๏ธ SQLite + SQLModel | Local aura.db โ conversations, users, reminders |
| Weather API | ๐ค๏ธ Open-Meteo (free, no key) | Geocoding + WMO weather codes |
| HTTP Client | ๐ httpx |
Async-capable, used for weather API |
| Logging | ๐ Loguru | Rotating file logs + Qt signal bridge for GUI display |
| COM Interop | ๐ comtypes + pythoncom |
Windows SAPI5 SpVoice per-thread COM initialization |
- ๐ Python 3.12+ installed on Windows.
- ๐ A valid Groq API Key โ free at console.groq.com.
- ๐ค A working microphone and ๐ท webcam (webcam only required for face authentication).
1. Clone the Repository:
git clone https://github.com/Omcodesk/AURA-AI-Voice-Assistant-.git
cd AURA-AI-Voice-Assistant-2. Create Virtual Environment:
python -m venv .venv
.venv\Scripts\Activate.ps13. Install Dependencies:
pip install -r requirements.txt4. Configure API Key:
Create a .env file in the root directory:
GROQ_API_KEY=gsk_your_groq_api_key_here5. First Run (auto-initializes database and downloads face models):
python app.py --bypass-auth๐ก On first run, YuNet and SFace ONNX models (~37 MB total) are downloaded automatically from the OpenCV model zoo.
python app.py- AURA starts silently in the background.
- Say "Take Control" โ the window surfaces and the camera activates for face verification.
- After successful authentication, say any command.
python app.py --bypass-auth- Skips biometric verification entirely.
- Launches directly into the console UI, logged in as
Omm. - Ideal for development and testing.
- Click the โ๏ธ Settings tab โ ๐ฅ Enroll New User.
- Follow the on-screen prompts to capture your face from multiple angles.
- Embeddings are stored locally in
aura.dbโ never uploaded anywhere.
AURA/
โ
โโโ actions/ # ๐ ๏ธ All action handlers โ registered in ActionRegistry
โ โโโ app_control.py # Open/close apps via subprocess + psutil
โ โโโ browser_control.py # Browser launch + search routing (Google, YouTube, sites)
โ โโโ conversation.py # LLM chat (Groq LLaMA, 10-turn history)
โ โโโ media_control.py # Media keys (play/pause/next/prev) via pyautogui
โ โโโ reminders.py # Schedule and store reminders/alarms to SQLite
โ โโโ screenshot_service.py # Screen capture via Pillow
โ โโโ system_control.py # OS-level: shutdown/restart/lock/sleep/volume/brightness
โ โโโ time_service.py # Formatted local time/date responses
โ โโโ weather_service.py # Open-Meteo API (geocoding + weather codes)
โ โโโ whatsapp.py # WhatsApp Web URL automation
โ
โโโ audio/ # ๐ค Audio pipeline โ microphone โ VAD โ utterance
โ โโโ mic_stream.py # Threaded PyAudio capture into bounded queue
โ โโโ vad_manager.py # WebRTC VAD: 30ms frames, ring-buffer, pre-pad
โ โโโ wake_listener.py # Wake phrase checker using WhisperSTT
โ
โโโ auth/ # ๐ก๏ธ Biometric security subsystem
โ โโโ enroll_manager.py # Multi-frame face enrollment + embedding storage
โ โโโ face_auth.py # YuNet detection + SFace 128-d embedding + cosine similarity
โ โโโ liveness.py # Anti-spoofing checks (blink / motion detection)
โ โโโ user_registry.py # SQLite user store with aura.db / jarvis.db migration
โ
โโโ brain/ # ๐ง NLU โ intent classification and slot extraction
โ โโโ core/
โ โ โโโ command_normalizer.py # Stopword removal, synonym expansion
โ โ โโโ confidence_guard.py # Low-confidence escalation logic
โ โ โโโ context_memory.py # Pronoun resolution (it/this/that)
โ โ โโโ fast_rule_matcher.py # Stage 1: keyword/synonym pattern matching
โ โ โโโ intent_engine.py # Full NLU pipeline orchestrator
โ โ โโโ llm_intent_brain.py # Stage 2: Groq LLaMA JSON classification
โ โ โโโ slot_extractor.py # Entity extraction (app, site, location, time, query)
โ โ โโโ time_parser.py # NLP time/date parsing for reminders
โ โโโ intent_router.py # Maps engine output โ ParsedCommand objects
โ โโโ memory_manager.py # SQLModel ORM: conversations, reminders, aura.db
โ
โโโ config/ # โ๏ธ Configuration files
โ โโโ settings.yaml # All tunable parameters (VAD, STT, TTS, LLM, session)
โ โโโ app_mappings.json # App name โ executable mappings
โ โโโ site_mappings.json # Site name โ URL mappings
โ โโโ synonym_map.json # Natural language synonym dictionary
โ
โโโ core/ # ๐๏ธ Core infrastructure โ no business logic
โ โโโ action_registry.py # Handler registration table (intent+action โ function)
โ โโโ command_parser.py # Raw intent + slots โ ParsedCommand dataclass
โ โโโ config_loader.py # YAML config + .env loader (singleton)
โ โโโ event_bus.py # Thread-safe pub/sub bus via Qt signals
โ โโโ logger.py # Loguru setup (rotating files + Qt bridge)
โ โโโ policy_engine.py # Safety blocklist (blocks "bye", "thanks", etc.)
โ โโโ result_types.py # ParsedCommand + ExecutionResult dataclasses
โ โโโ session_manager.py # Session lifecycle (auth, touch, auto-lock)
โ โโโ state_machine.py # Validated 7-state FSM with thread-safe transitions
โ
โโโ gui/ # ๐ผ๏ธ PySide6 user interface
โ โโโ admin_window.py # Settings panel
โ โโโ auth_window.py # Face authentication + enrollment screen
โ โโโ confirmation_dialog.py # Voice-triggered visual confirm dialog
โ โโโ console_window.py # Main voice console (orb + transcript + state label)
โ โโโ enroll_dialog.py # New user enrollment dialog
โ โโโ main_window.py # Root window โ wires all subsystems together
โ โโโ memory_window.py # Conversation history browser
โ โโโ theme.qss # Dark cyberpunk Qt stylesheet
โ โโโ widgets/
โ โโโ activity_card.py # "Processing..." activity display
โ โโโ orb_widget.py # Animated orb that reflects system state
โ โโโ status_bar.py # Session countdown + mic status
โ โโโ transcript_panel.py # Scrollable user/AURA conversation cards
โ
โโโ services/ # ๐ Application-layer services
โ โโโ action_dispatcher.py # Routes ParsedCommand to registered handler
โ โโโ confirmation_service.py # Manages pending confirmation state
โ โโโ session_guard.py # Access control โ requires re-auth for sensitive actions
โ
โโโ speech/ # ๐ฃ๏ธ Speech I/O
โ โโโ response_formatter.py # Cleans LLM output for speech (strips markdown etc.)
โ โโโ transcript_validator.py # Rejects noise/too-short/hallucinated transcripts
โ โโโ tts_engine.py # TTSThread: SAPI5 SpVoice, async, interruptible
โ โโโ whisper_stt.py # Groq Whisper: PCM โ WAV โ API โ transcript
โ
โโโ models/face/ # ๐งฌ ONNX face models (auto-downloaded on first run)
โ โโโ face_detection_yunet_2023mar.onnx # ~200 KB
โ โโโ face_recognition_sface_2021dec.onnx # ~37 MB
โ
โโโ tests/ # ๐งช Tests
โ โโโ test_universal_brain.py # Unit tests for intent classification pipeline
โ
โโโ .env.example # ๐ Template โ copy to .env and fill your API key
โโโ app.py # ๐ฏ Main entry point
โโโ requirements.txt # ๐ฆ All Python dependencies
โโโ aura_start.bat # ๐โโ๏ธ One-click Windows launcher
| Category | Example Command |
|---|---|
| ๐๏ธ Wake | "Take Control" |
| ๐ App Launch | "Open Chrome" / "Launch Notepad" / "Open VS Code" |
| ๐ App Close | "Close Spotify" / "Close Chrome" |
| ๐ Web Search | "Search for Python tutorials on Google" |
| ๐บ YouTube | "Search for lo-fi music on YouTube" |
| ๐ Website | "Open GitHub" / "Open web.whatsapp.com" |
| โฐ Time | "What time is it?" / "What's today's date?" |
| ๐ค๏ธ Weather | "Weather in Delhi" / "How's the weather?" |
| โ๏ธ System | "Shutdown the PC" / "Restart" / "Lock the computer" |
| ๐ Volume | "Increase volume" / "Mute" |
| ๐ต Media | "Play" / "Pause" / "Next track" |
| ๐ธ Screenshot | "Take a screenshot" / "Capture screen" |
| ๐ฌ WhatsApp | "Send a WhatsApp message to John" |
| ๐ง Email | "Draft an email to boss" |
| ๐ Reminder | "Remind me to drink water at 6 PM" |
| โฑ๏ธ Alarm | "Set an alarm for 7 AM" |
| ๐ค Conversation | "What is machine learning?" / "Tell me a joke" |
- ๐ด Offline STT โ Local Whisper.cpp integration for 100% air-gapped operation.
- ๐ฏ Custom Wake Phrase Training โ Real-time acoustic model fine-tuning.
- ๐งฉ Plugin System โ Drop-in action handlers via a plugin directory.
- ๐๏ธ Vision Integration โ Screen-reading using a vision-language model.
- ๐ฑ Android Companion App โ Remote monitoring and command via mobile.
Contributions are welcome! ๐
- Fork the Project.
- Create your Feature Branch (
git checkout -b feature/AmazingFeature). - Commit your Changes (
git commit -m 'Add some AmazingFeature'). - Push to the Branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
Distributed under the MIT License. See LICENSE for more information.
Built with โค๏ธ by Omcodesk
