An example with boilerplate for how to get LLMs in browser extensions (without the cloud) via Google's MediaPipe!
Running an edge LLM on the user's machine without the need for cloud's great for many reasons, least of which is no API cost.
| ☁️ Cloud/Traditional LLM ☁️ | ⚡️ Edge LLM ⚡️ | Winner | |
|---|---|---|---|
| Cost | OpEx compound unpredictably, just one GPT-4 API response can cost 1.65 cents | Free including even fine-tuning due to model size |
⚡️ Edge LLM ⚡️ Nothing better than free! |
| Performance | High raw capacity allows for better generalized accuracy, but at the cost of being slow even without the CoT often requried | Optimized Small Language Models (SLMs) leverage 149x higher throughput Models like Microsoft Phi and methodologies like "Solving a Million-Step LLM Task with Zero Errors" proving SOTA accuracy at worst competitive with and at best completely exceeding ☁️ Cloud/Traditional LLM ☁️ for domain-specific tasks (which are most), even if worse in generalized tasks |
⚡️ Edge LLM ⚡️ |
| Latency | Unavoidable & unstable network cost ~50–150ms even ignoring uncontrollable initialization & congestion latency | Zero network overhead easily allowing for sub-10ms response times essential for real-time control systems and human interaction | ⚡️ Edge LLM ⚡️ |
| Network Dependency | Requires high-bandwidth, continuous internet connection, risking buffering at best and complete service failure during outages at worst | Guarantees 100% operational independence, ensuring continuous inference and local functionality, even when completely offline | ⚡️ Edge LLM ⚡️ |
| Customization | Practically no flexibility as proprietary APIs restrict access to model weights, making domain-specific fine-tuning expensive or impossible | Full, direct control over the model stack (GGUF, quantization), allowing deep customization for proprietary datasets and core business logic | ⚡️ Edge LLM ⚡️ |
| Community | Vendor-reliant requiring provider-specific documentation and development, with often poor track records and walked-back decisions, restricting or blocking community involvement | Thriving open-source ecosystems (e.g., Llama.cpp, KAITO) provide rapid innovation, broad toolchains, and peer-driven solutions larger & more accessible than proprietary solutions | ⚡️ Edge LLM ⚡️ |
| Privacy | Requires trusting proven-untrustworthy companies and their 3rd parties as well as transmission over networks, potentially violating data sovereignty, compliance, and residency mandates | All data's local and complete regulatory control (GDPR, HIPAA, etc.) | ⚡️ Edge LLM ⚡️ |
| Censorship | Supplier-imposed guardrails & content-filtering, blocking even legitimate uses | No or configurable guardrails, allowing fine-grained control | ⚡️ Edge LLM ⚡️ |
| Supplier flexibility | Vendor lock-in due to API specificity and proprietary model dependency, resulting in high switching costs | Open standards & portable, enabling seamless adoption of superior models | ⚡️ Edge LLM ⚡️ |
| Redundancy | Centralized point of failure with extensive multi-region deployment strategies to mitigate single-vendor outages, and still countless failures | On device means never fails even if the whole internet dies | ⚡️ Edge LLM ⚡️ |
| Environment | Extreme cumulative energy & water consumption of massive data centers destroying environments, homes, communites, and our planet and still projected to reach petawatt-hour levels by 2026 globally | Localized processing means energy costs magnitudes lower than the ☁️ Cloud/Traditional LLM ☁️ costs while enabling optimization (e.g. Energy Delay Product (EDP)) | ⚡️ Edge LLM ⚡️ |
These cover nearly all of what creators & consumers want in AI while simply being the de facto moral choice whether it comes to monopolies, economy, privacy, environment, or creative expression.
- Install a WebGPU compatible browser and enable WebGPU in settings (you should also enable Vulkan if you have a browser without it enabled by default since without it it's quite unbearably slow)
- Place a MediaPipe compatible model file in
resorces/modelswith pre-converted models like Google's Gemma & Microsoft's Phi being perfect places to start (you can convert basically whichever industry LLMs you want as long as they're not too large). For this demo, I used gemma-3n-E2B-it-int4-Web.litertlm since it's a powerful multimodal model which runs on even toasters, but for certain usecases, like if you want something even lighter, I'd recommend choosing something like gemma3-1b-it-int4-web.task (it runs lightning quick, much, much faster than cloud LLMs, and literally on anything, but is really dumb)! Then, change theDEFAULT_MODEL_NAMEinsrc/index.jsin order to make it use that model file. npm installnpm run build- Load the extension as a temporary one from
about:debugging#/runtime/this-firefoxfor Firefox andchrome://extensionsfor Chrome (make sure to turn onDeveloper Modein the top right) - Click on the extension's icon in order to open a chat window where you can type into the top text field and then press the "Get Response" button to get the LLM's generated response
Google's MediaPipe tutorial is a great place to start with getting an understanding of how MediaPipe works.
The components are commented for your use with the general structure being:
- The extension's popup serving as a frontend.
- An offscreen page to load the LLM so that it can be shared between multiple contexts and doesn't have to be reloaded every time the page is open.
- A backend service acting as a proxy between the popup and offscreen page as required and connected to both via ports.