Source code for the RunPod template: text-generation-webui-aio
This template runs text-generation-webui on RunPod.
It was inspired by TheBloke's deprecated DockerLLM project.
You can use this template to experiment with nearly any LLM hosted on Hugging Face.
⚠️ Disclaimer: This Dockerfile is intended for experimental use only and is not suitable for production workloads. I take no responsibility for any data loss or security issues. Review the code and proceed at your own risk.
Building the Dockerfile:
cd text-generation-webui-docker/
docker build -t text-generation-webui-aio .
- Instructions
- Set Up a RunPod Account
- Setting Up SSH Key (Optional)
- Create Network Volume (Optional)
- Create RunPod Secrets
- Creating Your Pod
- Accessing text-generation-webui
- Downloading and Loading a Model
- Calling the API
- Loading a EXL2 model on multiple GPU's
- Loading a GGUF model on multiple GPU's
- Connecting to Open WebUI
- Connecting to SillyTavern
RunPod is a paid cloud GPU provider. Go to https://www.runpod.io/, create an account, and add funds.
For this example, we’ll use an RTX A5000, which costs around $0.21/hour + storage.
If you want SSH access to your pod, add your SSH public key in your RunPod account settings.
To avoid re-downloading models every time, you can create a persistent network volume. Make sure the volume is in the same region as your pod.
Note: Storage isn't free — 200 GB costs about $15/month.
Create these secrets in RunPod for authentication:
GRADIO_USERNAMEandGRADIO_PASSWORD: Used to log into the web UI.MY_OPENAI_KEY: Used for calling the OpenAI-compatible API.
Your pod is the GPU-powered container that runs this template.
- Go to
Podsand select a GPU likeRTX A5000. Make sure “Secure Cloud” is selected.
- Click
Change Template, search fortext-generation-webui-aio, and select the one frommattipaivike321/runpod-text-generation-webui.
The correct template:
- Click
Edit Template.
- If you don’t need SSH, remove port 22 from Expose TCP Ports.
- You can also adjust the “Volume Disk” size. This is local storage for downloading models. For this guide, use at least 50 GB. Keep in mind that storage also incurs costs (~$0.10/GB/month).
More info: https://docs.runpod.io/pods/storage/types
- Click
Set Overridesand thenDeploy On-Demand.
Deployment may take several minutes depending on your region.
Once the pod is ready:
- Click
Connect, then chooseHTTP Service 7860.
(Port 5000 is reserved for API access.)
- Log in using your
GRADIO_USERNAMEandGRADIO_PASSWORD.
You can run any model that fits into your GPU’s VRAM.
For this guide, we’ll use an EXL2 model:
MikeRoz/mistralai_Mistral-Small-24B-Instruct-2501-6.0bpw-h6-exl2
Make sure the total size of the .safetensors files is within your GPU’s capacity (RTX A5000 has 24 GB VRAM). Also some left over vram is needed for the context.
The same applies to GGUF models. Use this tool to check required VRAM for your model and context window:
LLM VRAM Calculator
- Copy the model path from Hugging Face:
- Go to the
Modeltab, paste the path, and clickDownload. For GGUF models, you must also enter the exact.gguffilename since most repos contain multiple quant versions
- Refresh the model list and select your model from the dropdown.
-
Wait a few seconds — the loader will be auto-selected (e.g.,
ExLlamav2_HFfor EXL2 models, different one for GGUF). -
Increase context size (e.g.,
32768) if desired, then clickLoad. Note that the context also eats away at the vram, so there is a limit!
Loading may take a while depending on model size.
Once loaded:
Go to the Chat tab and try it out:
Use chat-instruct or instruct modes — they provide the correct prompt format. Avoid chat mode due to bugs. For more info, see the official documentation.
Working example:
✅ Don’t forget to terminate your pod when you're done to avoid extra charges!
There is a separate example Python script and README.md file in the api-call-example/ folder of this repository. It explains how to call the text-generation-webui API programmatically.
See this instruction on how to load a large +70GB EXL2 model on multiple GPU's.
See this instruction on how to load a large +70GB GGUF model on multiple GPU's.
See openwebui_example.md for instructions on how to connect text-generation-webui backend to Open WebUI.
See sillytavern_example.md for instructions on how to connect text-generation-webui backend to SillyTavern.





















