This repository contains a code example inspired by the LinkedIn post: On-Device LLMs: The Future of Private, Personalized AI. It demonstrates the use of on-device LLMs with Google's MediaPipe.
To run this project, you need to serve it using a static server. Below are several options based on your environment:
For Python 3:
python3 -m http.server 8000For Python 2:
python -m SimpleHTTPServer 8000Install the http-server package globally:
npm install -g http-serverRun the server:
http-serverphp -S localhost:8000ruby -run -e httpd . -p 8000Run the following command to serve the project via Docker:
docker run --name static-server -v $(pwd):/usr/share/nginx/html:ro -p 8080:80 nginxYou can find the list of supported models here. These models are compatible with the MediaPipe Task Gen AI API.
To ensure compatibility with MediaPipe, use pre-converted models. Below are some examples available for download on Kaggle:
- Gemma2 2B (LiteRT 2b-it-gpu-int8)
 - Gemma1.1 2B (LiteRT 2b-it-gpu-int4)
 - Gemma1.1 7B (LiteRT 7b-it-gpu-int8)
 
Ensure you select the model variation optimized for your hardware, whether GPU or CPU, based on your system's capabilities.
See the complete list of pre-converted models here.
All example prompts can be found in the prompts directory.
To achieve the best results, you must structure your input using Gemma's prompt formatting, which includes tokens like <start_of_turn> and <end_of_turn>. Proper formatting ensures optimal inference performance.
Watch this YouTube video for a hands-on demonstration of the project in action.
- Make sure to serve the project from the root directory to correctly load assets and dependencies.
 - Ensure you select the model variation optimized for your hardware, whether GPU or CPU, based on your system's capabilities.
 - Only use models that have been pre-converted for compatibility with MediaPipe and your hardware