Note
The commands in this guide are intended for Ubuntu Linux. If you are using a different platform (e.g., Windows or macOS), please refer to the official documentation of the tool for platform-specific instructions.
Further Reading: Ollama vs. vLLM: Choosing the Best Tool for AI Model Workflows
-
- Monitor Prometheus and Grafana
Tip
Check out prometheus_grafana for more details.
- Grafana [Docs]
- Prometheus [Docs]
- OpenAI & other LLM API Pricing Calculator - Calculate the cost of using OpenAI and other Large Language Models (LLMs) APIs
First, clone the repository:
git clone --recurse-submodules https://github.com/xxrjun/local-inference.git
Then, create a new Conda environment and install the required dependencies:
conda env -n local-inference python=3.12
conda activate local-inference
# Install Python dependencies
pip install -r requirements.txt
# Install Ollama on Linux
curl -fsSL https://ollama.com/install.sh | sh
It is recommended to use tmux to manage multiple sessions.
tmux new -s ollama-serve
./examples/ollama_serve.sh
tmux new -s ollama-run
./examples/ollama_run.sh
tmux new -s vllm-serve
./examples/vllm_serve.sh
tmux new -s open-webui
./examples/open_webui.sh
Copy .env.example
to .env
:
cp .env.example .env
Edit .env
with the correct values, then run the test script:
python scripts/test_openai_client.py
If the API is working correctly, the output should resemble the following:
ChatCompletionMessage(content='Hello! How can I help you today? If you have any questions or need assistance, feel free to ask.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], reasoning_content=None)
Refer to My Immersive Translate Setup Guide or Offical Docs,
What is TTS?
Refer to the My TTS Setup Guide for more details.