EchoSight is designed to assist visually impaired individuals by providing audible descriptions of images captured by a camera. It operates in two modes: one for capturing images using a Raspberry Pi Camera and listening to their voice descriptions, and another for inputting an image path or URL on various operating systems to hear voice descriptions.
The project generates multiple outputs during operation:
- Image Files: Captured or downloaded images are saved in the
output
directory. - Text Descriptions: Text descriptions of the images in both English and Turkish are saved as
.txt
files in theoutput
directory. - Audio Files: The Turkish voice description of the image is saved as a
.wav
file in theoutput
directory. - Log Files: Event logs and errors are recorded and saved in
events.log
files within the respective output subdirectories.
- KEY_ACTION: In
rpi.py
, this is set to 'KEY_S' by default. Modify theKEY_ACTION
variable to change the key action. - CAMERA_DELAY: In
rpi.py
, the default camera delay is '0.1' seconds. Adjust theCAMERA_DELAY
variable to change this setting. - MAX_WIDTH: In
image2speech.py
, the maximum image width for resizing is controlled byMAX_WIDTH
. Alter this parameter as needed.
- Ensure Raspberry Pi OS is installed.
- Use Raspberry Pi Imager to prepare your SD card.
- Test your Raspberry Pi Camera:
libcamera-jpeg -o z.jpg
.
- Obtain your Replicate.com API token:
- For Bash:
echo 'export REPLICATE_API_TOKEN=your_token_here' >> ~/.bashrc
. - For Zsh:
echo 'export REPLICATE_API_TOKEN=your_token_here' >> ~/.zshrc
.
- For Bash:
- Set
keyboard_path
correctly if automatic detection fails. Refer to this guide. - Clone and setup the EchoSight environment:
git clone https://github.com/gusanmaz/echosight cd echosight python3 -m venv env source env/bin/activate pip install -r requirements.txt
(Raspberry Pi) To capture images from Raspberry Pi Camera by pressing a keyboard button (default: S) to listen voice description of the captured image
python3 rpi.py
(ALL OSes) Give an image path or URL to listen voice description of the image
python3 url2speech.py image_path_or_url
This project uses models from https://replicate.com/ to generate voice descriptions of the images. You can find the models used in this project from the links below.
- cogvlm
- Replicate Model: https://replicate.com/cjwbw/cogvlm
- Github Repo: https://github.com/THUDM/CogVLM
- Seamless Communication
- Replicate Model: https://replicate.com/cjwbw/seamless-communication
- Github Repo: https://github.com/facebookresearch/seamless_communication
- Coqui XTTS-v2
- Replicate Model: https://replicate.com/cjwbw/coqui-xtts-v2
- Github Repo: https://github.com/coqui-ai/TTS
Future versions may incorporate different models, and the code could be adapted for easier experimentation with various models.
- Conservative Cost Estimate: 0.2$ per image
- Conservative Runtime Estimate: 40 seconds per image to produce audio (excluding time spent for starting the models on Replicate.com)
Apache License 2.0