An immersive experience allowing for real-time conversation with the digital twin of William Shakespeare. The conversations are context-aware as well as visually engaging with facial animations. It has vision capabilities as well, giving the user the ability to show images to Shakespeare and talk about them. The model has been fine-tuned by training on over 5M+ characters of Shakespeare's works.
The project leverages:
Gemini
API for image analysis and context generation. We're using thegemini-1.5-turbo
model.Convai
API for real-time conversation. We're using voices fromElevenLabs
to generate natural-sounding speech.- NVIDIA Omniverse's
Audio2Face
for facial animation as well asDLSS
for frame generation to optimize performance.
Important
I have cancelled the convai subscription, so the application will start throwing errors right out of the box starting Nov 11 2024, if you need continued support email me at: rajnish.ankt@gmail.com
trailer_sp_lo.mp4
sp_how_to.mp4
-
Install NVIDIA Omniverse Launcher and Audio2Face. Follow steps 1, 2 and 4 from the instructions listed here.
-
Ensure that you have the necessary system requirements for running Omniverse listed here.
-
After installing ominverse, you can choose to set up the project in one of the two ways:
-
Load Manually
- Clone the repository:
git clone https://github.com/yourusername/shakespeare-ai.git
- Open Audio2Face go to
Window > Extensions > Hamburger Menu Icon > Settings
and add the path to theexts
folder in the repository.
- Clone the repository:
-
Load directly from GitHub
- Make sure you have
git
installed on your system. - Open Audio2Face go to
Window > Extensions > Hamburger Menu Icon > Settings
and addgit://github.com/aniketrajnish/Omniverse-Shakespeare-Project?branch=main&dir=omni/shakespeare-ai/exts
. - In case the above method doesn't work, follow the manual method.
- Make sure you have
-
-
If you are on a windows machine, you don't need to install any python dependencies as the project uses a compiled PyQt5 executable.
-
But in case you wish to run the python scripts:
-
To optimize for performance, you can enable
DLSS
in Audio2Face by going toRTX - Real-Time > Settings Hambuger Icon > NVIDIA DLSS > check Frame Generation
. You need to do this every time you openAudio2Face
.
- Enable the extension from
Window > Extensions > Third-Party > Shakespeare AI Server
. - If this is the first time you're using Audio2Face, it will take some time to load the
TensorRT
engine. You can check the progress in theAudio2Face Tool
window. - The extension will open a window with two buttons:
Open Project
andConnect to Server
. - Click on
Open Project
to load the project in Audio2Face having the Shakespeare character along with streaming audio player set up. - Click on
Connect to Server
to start the server and open the PyQt5 executable. - The executable will connect to the server and you can start interacting with the Shakespeare character.
- You can either select an image to show to Shakespeare or start talking directly.
Caution
You can directly run the executable from app/build
folder without having to install Omniverse
to get a feel of the real-time conversation as well as vision capabilities minus the facial animation and visual experience. But it might run into threading issues as the backend has been designed to work with Audio2Face
streaming audio player.
graph TD
A[User opens project from Omniverse extension] --> B[Project loads in Audio2Face]
B --> C[User clicks 'Connect to Server']
C --> D[Socket server starts] & E[PyQt5 executable opens]
E --> F[Executable connects to socket server]
F --> G[PyQt5 window ready for interaction]
G --> H{User selects image?}
H -->|Yes| I[Image sent to Gemini API]
I --> J[Update Shakespeare context in Convai API]
H -->|No| K[User initiates conversation]
J --> K
K --> KA[Mic input captured using PyAudio]
KA --> KB[Audio sent to Convai server]
KB --> L[Convai generates audio response via gRPC]
L --> M[Audio processed with pydub]
M --> N[Processed audio sent to A2F socket server]
N --> O[Audio added to queue using threading.Condition]
O --> P[pydub removes clicks between chunks]
P --> Q[Audio sent in chunks to A2F streaming audio player]
Q --> R[A2F streaming audio player processes audio]
R --> S[Audio converted to blendshapes for facial animation]
S --> T[Auto emotions enabled based on audio]
T --> U[DLSS used for frame generation]
U --> V[Animated Shakespeare character displayed]
V --> W[Conversation continues]
W --> K
W --> X[User ends conversation]
X --> Y[Cleanup and disconnect]
subgraph Omniverse[Omniverse Extension]
A
B
C
D
end
subgraph Bridge[Version Bridge]
D[Socket server starts]
F[PyQt5 executable<br>connects via sockets]
end
subgraph PyQt[PyQt5 Application]
E[PyQt5 v5.15.2]
G
H
end
subgraph API[API Interactions]
I[requests library<br>for Gemini API]
J[requests library<br>for Convai API]
end
subgraph Convai[Convai Integration]
K
KA[Mic input captured<br>using PyAudio]
KB[Data to<br>Convai server]
L[gRPC v1.65.1<br>protobuf v4.21.10]
end
subgraph Audio[Audio Processing]
M[pydub library]
N[struct for packing<br>audio data]
O[threading<br>queue.deque]
end
subgraph A2F[Audio2Face Integration]
P[pydub removes clicks<br>between chunks]
Q[A2F streaming audio player]
R
S
T
U
end
%% Adjust layout for 16:9
Omniverse --> PyQt
Bridge --> API
Convai --> Audio
Audio --> A2F
- Project Initialization:
- The user opens the project from the Omniverse extension.
- The project loads in Audio2Face.
- Server Connection:
- The user connects to the socket server from the extension.
- This action starts a socket server and opens a compiled PyQt5 executable.
- User Interaction:
- Users can choose to select an image or start talking directly.
- If an image is selected, it's sent to the
Gemini
API for analysis. - The context generated is updated in the
Convai
API to update the conversation context.
- Conversation Flow:
- Microphone input is captured using
PyAudio
. - User speech is sent to the
Convai
service viagRPC
. - The response is generated in shakespearean style and sent back.
- Microphone input is captured using
- Audio Processing:
- The audio response is processed using the
pydub
library. - Processed audio is sent to the Audio2Face socket server in chunks.
- The audio is added to a queue using
threading.Condition
. - Clicks between audio chunks are also removed using
pydub
.
- The audio response is processed using the
- Facial Animation & Display:
- Audio is sent in chunks to the
Audio2Face
streaming audio player - The audio is processed and converted to blendshapes for facial animation.
- Auto emotions are enabled based on the audio.
DLSS
is used for frame generation to optimize performance.- The animated Shakespeare character is displayed, speaking the generated response.
- Audio is sent in chunks to the
- Conversation Loop:
- The conversation continues until the user ends it.
- At the end, the system cleans up and disconnects.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Currently working on-
- Packaging the experience into a standalone application.
- Incorporating idle animations in Audio2Face.
MIT
- Aamir Vora - Facial artist for Shakespeare [ LinkedIn | ArtStation ]