A sophisticated ROS-based vision assistant for intelligent object detection and distance estimation, built for integration with the Juno robot platform. This project combines Google Gemini AI, YOLOv8 object detection, and depth estimation to provide voice-controlled object finding capabilities with natural language interaction.
The Juno Vision Guide implements a distributed ROS architecture with 5 interconnected nodes:
- Speech Recognition - Captures voice commands using Google Speech Recognition
- AI Speech Processing - Uses Google Gemini to extract object names from natural language
- Object Detection - Real-time YOLOv8-based detection with 80+ object classes
- Depth Estimation - Distance calculation using external Depth Pro API
- Text-to-Speech - Provides voice feedback using Google TTS
- Voice-controlled object finding - "Find my phone", "Where is my laptop?"
- Real-time visual detection - Live camera feed with bounding box overlays
- Distance estimation - Accurate depth measurements in meters
- Natural language processing - Understands conversational requests
- Hands-free operation - Complete audio interaction workflow
- OS: Ubuntu 18.04
- ROS: Noetic Ninjemys
- Python: >= 3.10
- Environment: Anaconda Virtual Environment
- Editor: Visual Studio Code
Follow the official guide: http://wiki.ros.org/noetic/Installation/Ubuntu
To avoid conflicts with the default workspace:
$ mkdir -p ~/catkin_ws_2/src
$ cd ~/catkin_ws_2
$ catkin_make
$ cd ~/catkin_ws_2/src/
$ catkin_create_pkg juno_vision_guide rospy roscpp std_msgs
$ cd ~/catkin_ws_2
$ catkin_make
$ echo "source ~/catkin_ws_2/devel/setup.bash" >> ~/.bashrc
$ source ~/.bashrc
$ cd ~/catkin_ws_2/src/
$ git clone https://github.com/NeoSockCheng/juno-vision-guide.git
$ cd juno-vision-guide
Download from: https://www.anaconda.com/products/distribution
Install:
$ bash ~/Downloads/anaconda_distribution.sh
Add Anaconda to PATH:
$ echo "export PATH=/home/<your-username>/anaconda3/bin:$PATH" >> ~/.bashrc
$ source ~/.bashrc
$ conda -V # Check conda installation
$ conda env create -f environment.yml
$ conda activate juno_vision_guide
$ cd ~/catkin_ws_2
$ catkin_make
The system requires Google Gemini API key (free of charge) for full functionality:
- Visit https://aistudio.google.com/app/apikey
- Sign in with your Google account
- Generate an API key and copy it
- Replace
your-gemini-api-key-placeholder
in the.env
file with your actual key
Depth Pro Hosting: We host the Depth Pro model on Hugging Face because it requires GPU to run: https://huggingface.co/spaces/yzh70/depth-pro/tree/main.
- Start ROS core (Terminal 1):
$ roscore
- Launch the complete system (Terminal 2):
$ cd ~/catkin_ws_2
$ source devel/setup.bash
$ roslaunch juno_vision_guide juno_vision_guide.launch
- Start using voice commands:
- Wait for the prompt: "Tell me what you want to find..."
- Say something like: "Find my phone" or "Where is my laptop?"
- The system will detect, locate, and estimate distance to the object
- "Find my phone" β Detects cell phone
- "Where is my laptop?" β Detects laptop
- "Show me the bottle" β Detects bottle
- "Find the chair" β Detects chair
Full object list can be found in
yolo_object_list.json
- Voice Input - Speak your request naturally
- AI Processing - Gemini extracts the target object
- Visual Detection - YOLOv8 finds the object in camera feed
- Distance Calculation - Depth Pro estimates distance
- Voice Response - System announces results
- Loop to Next Query - Once complete, the system prompts for the next object to find automatically
- Default camera device index:
1
(configured ingoogle_sr.py
) - Modify
device_index
parameter if using different camera - Ensure USB camera is connected and accessible
- Microphone device index:
1
(configured ingoogle_sr.py
) - Check available microphones with:
python -c "import speech_recognition as sr; print(sr.Microphone.list_microphone_names())"
- Audio output via
mpg321
- ensure speakers/headphones are connected
- Confidence threshold: 70% (adjustable in
object_detection.py
) - Detection timeout: 20 seconds
- Supported objects: 80 YOLO classes (see
yolo_object_list.json
)
item_finder_input
- Raw speech recognition resultsitem_finder_object
- Extracted target object namesitem_finder_response
- System responses for TTSdetected_object_bbox
- Object detection bounding boxesdetected_object_image/compressed
- Detected object imagesdepth_status
- Depth processing state managementitem_finder_sr_termination
- Speech recognition control
Speech Recognition β Speech Processing (Gemini AI) β Object Detection (YOLOv8) β Depth Estimation β Text-to-Speech
Camera not detected:
- Check USB camera connection
- Verify camera device index in
google_sr.py
- Test camera with:
rostopic echo /usb_cam/image_raw
Audio issues:
- Verify microphone permissions
- Check audio device indices with
speech_recognition
- Ensure
mpg321
is installed for audio playback
API errors:
- Verify
.env
file contains valid API keys - Check internet connection for API access
- Monitor API rate limits and quotas
Object not detected:
- Ensure object is in YOLO's 80-class list
- Improve lighting conditions
- Adjust confidence threshold if needed
- Check camera focus and positioning
juno-vision-guide/
βββ launch/
β βββ juno_vision_guide.launch # ROS launch configuration
βββ scripts/
β βββ google_sr.py # Speech recognition node
β βββ google_tts.py # Text-to-speech node
β βββ speech_input.py # AI speech processing node
β βββ object_detection.py # YOLOv8 detection node
β βββ object_depth_estimation.py # Depth estimation node
β βββ .env # API keys for Gemini and Depth Pro
βββ CMakeLists.txt # CMake build configuration
βββ package.xml # ROS package metadata
βββ environment.yml # Conda environment dependencies
βββ yolo_object_list.json # YOLO class mappings
βββ yolov8n.pt # YOLOv8 model weights
βββ README.md # This file
This project is licensed under the MIT License - see the LICENSE file for details.
- YOLOv8 by Ultralytics for object detection
- Google Gemini AI for natural language processing
- ROS Community for the robotics framework
- OpenCV for computer vision capabilities