Gemma 3 OpenAI-Compatible API Server. Gemma 3 fits openwebui. 中文版本
This project provides a FastAPI backend server that wraps the Google Gemma 3 (gemma-3-27b-it) large language model, exposing it through an API compatible with OpenAI's v1/chat/completions endpoint.
This allows you to run the Gemma 3 model locally on your own hardware and interact with it using clients designed for the OpenAI API, such as Open WebUI.
!! directly run:
python server.py- OpenAI API Compatibility: Exposes
/v1/chat/completionsand/v1/modelsendpoints. - Streaming Support: Provides real-time, token-by-token streaming responses (
stream=True) using Server-Sent Events (SSE). - Non-Streaming Support: Supports standard request/response cycles (
stream=False). - Gemma 3 Integration: Uses the
transformerslibrary to load and run thegoogle/gemma-3-27b-itmodel. - Multi-GPU Support: Leverages
accelerateanddevice_map="auto"for distributing the model across multiple GPUs. - Basic Multimodal Input: Accepts image URLs or base64-encoded images in the OpenAI message format (requires
Pillow). - Configurable: Easily change the model path, visible GPUs, host, and port.
- GPU: One or more powerful NVIDIA GPUs with sufficient VRAM to run the
gemma-3-27b-itmodel (likely > 40GB VRAM total, depending on quantization/configuration). CUDA capability is required. - RAM: Sufficient system RAM.
- Storage: Enough disk space for the model files and Python environment.
- Python: 3.8+ (Developed with 3.11)
- CUDA Toolkit: Compatible version for your GPU drivers and PyTorch.
- Python Packages: See
requirements.txtor install manually (see Installation). - Git: For cloning the repository.
- You need to have the
gemma-3-27b-itmodel files downloaded locally. The default path configured in the script is/home/user/t1/google/gemma-3-27b-it. You must update this path in the script if your model is located elsewhere.
-
Clone the repository:
git clone <your-repository-url> cd <your-repository-name>
-
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install required Python packages:
pip install fastapi uvicorn pydantic requests pillow accelerate torch transformers bitsandbytes python-dotenv # Add any other specific dependencies used # Or, if you create a requirements.txt: # pip install -r requirements.txt
Note: Ensure your
torchinstallation is compatible with your CUDA version.
-
Model Path:
- Open the main Python script (e.g.,
main.py). - Locate the
MODEL_DIRvariable. - Change its value to the absolute path where your
gemma-3-27b-itmodel files are stored.
# --- Configuration --- MODEL_DIR = "/path/to/your/google/gemma-3-27b-it" # <-- UPDATE THIS # ... other configurations
- Open the main Python script (e.g.,
-
GPU Devices:
- The script uses the
CUDA_VISIBLE_DEVICESenvironment variable to determine which GPUs to use. Set this variable in your terminal before running the server. - Example (using GPUs 0, 1, 2, 4, 5, 6, 7):
export CUDA_VISIBLE_DEVICES="0,1,2,4,5,6,7"
- If you only want to use specific GPUs, list their indices. If you have only one GPU (e.g., index 0), use
export CUDA_VISIBLE_DEVICES="0".
- The script uses the
-
Host and Port (Optional):
- You can change the
HOSTandPORTvariables in the script if needed. The defaultHOST="0.0.0.0"allows connections from other machines on your network.
- You can change the
-
Set the
CUDA_VISIBLE_DEVICESenvironment variable:export CUDA_VISIBLE_DEVICES="0,1,2,4,5,6,7" # Adjust indices as needed
-
Start the FastAPI server:
python server.py
- The server will start, load the model (this may take some time), and listen for requests on
http://0.0.0.0:8000.
- The server will start, load the model (this may take some time), and listen for requests on
- Ensure the API server is running.
- Open your Open WebUI instance.
- Navigate to Settings -> Connections.
- Under the "Ollama API" section (or similar for custom connections):
- Set the API Base URL to
http://<your_server_ip>:8000/v1- Replace
<your_server_ip>with the actual IP address of the machine running the FastAPI server. If Open WebUI is on the same machine, you can uselocalhostor127.0.0.1. - Important: Make sure to include the
/v1suffix in the URL.
- Replace
- Save the connection.
- Set the API Base URL to
- Go back to the main chat interface.
- Click on the model selection dropdown.
- You should see the model ID (e.g.,
/home/user/t1/google/gemma-3-27b-itor the path you configured) listed. Select it. - You can now chat with your locally hosted Gemma 3 model, with streaming enabled! You can also try uploading images if your client supports sending them via the OpenAI API format.
-
GET /v1/models- Lists the available model(s). In this setup, it will return the configured
MODEL_DIR. - Response format mimics OpenAI's model list.
- Lists the available model(s). In this setup, it will return the configured
-
POST /v1/chat/completions- The main endpoint for generating chat responses.
- Accepts JSON payloads compatible with the OpenAI Chat Completions API (including
messages,model,stream,max_tokens,temperature, etc.). - Supports
stream=Truefor Server-Sent Events (SSE) streaming andstream=Falsefor a single JSON response.
- Memory: The
gemma-3-27b-itmodel is large and requires significant GPU VRAM. Ensure your hardware meets the requirements. - Image Handling: Image processing is basic. It downloads or decodes images specified by URL or base64 data but doesn't perform complex analysis or adhere to OpenAI's
detailparameter. - Stop Sequences: Custom stop sequences provided in the API request are not currently implemented in the generation logic.
n > 1: Generating multiple choices (n > 1) in a single request is not supported by the current implementation usingmodel.generate.- Error Handling: Error handling is basic; further improvements could be made.
finish_reason(Streaming): Thefinish_reasonin streaming mode is simplified.
Contributions are welcome! Feel free to open issues or submit pull requests.
(Optional) Specify your license here. E.g.: This project is licensed under the MIT License - see the LICENSE file for details.