ROS 2 bridge/wrapper for llama.cpp with models like GPT-OSS-20B for reasoning-based motor control.
- ROS 2 node exposing /llama_input and /llama_output topics
- Uses llama.cpp backend
- Supports GPU (CUDA) and CPU-only inference
- Builds and handles large models in GGUF format
- Ubuntu 24
- ROS 2 (Jazzy recommended)
- Python 3.12 (or compatible)
- Models: huggingface (Tested with GPT-OSS-20B)
Set up ROS 2 is not already
sudo apt install software-properties-common curl
sudo add-apt-repository universe
curl -sSL https://raw.githubusercontent.com/ros/rosdistro/master/ros.key | sudo tee /usr/share/keyrings/ros-archive-keyring.gpg >/dev/null
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/ros-archive-keyring.gpg] http://packages.ros.org/ros2/ubuntu $(. /etc/os-release && echo $UBUNTU_CODENAME) main" | sudo tee /etc/apt/sources.list.d/ros2.list >/dev/null
sudo apt update
sudo apt install ros-jazzy-desktopClone into your ROS 2 workspace
mkdir ~/ros2_ws/src # Or whatver you'd like to call it
cd ~/ros2_ws/src
git clone https://github.com/MaidReality/brain.gitSource venv
cd ..
# Make python ve outside of src if you don't have one yet
python3 -m venv venv
source venv/bin/activateInstall dependencies
python -m pip install --upgrade pip wheel setuptools
cd src/brain
pip install -r requirements.txtCaution
There might be verison mismatch, check that you have the exact ones like:
pip install "setuptools>=68,<70" "setuptools-scm<8"
Export venv's python so ROS 2 can see it
export PYTHONPATH=$HOME/MaidReal/venv-maid/lib/python3.12/site-packages:$PYTHONPATHClone all the submodules
GIT_LFS_SKIP_SMUDGE=1 git submodule update --init --recursiveWarning
Since the submodules are externally sourced, add COLCON_IGNORE to them so colcon does not attempt to build them.
ie. touch src/brain/llama.cpp/COLCON_IGNORE
ie. touch src/brain/gpt-oss-20b/COLCON_IGNORE
And touch the venv as well: touch venv/COLCON_IGNORE
Install llama.cpp
Important
To enable GPU support:
First check if you have CUDA installed: sudo apt install nvidia-cuda-toolkit
Then check if your drivers are installed: nvidia-smi
Then run cmake with additional command -DGGML_CUDA=ON
Or if you're only using CPU with RAM, skip this and run the following:
Warning
If running on WSL or Ubuntu make sure to run:
sudo apt-get install ninja-build
sudo apt-get install build-essentialcd llama.cpp
cmake -S . -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_SERVER=ON \
#-DGGML_CUDA=ON
cmake --build build --config Release
sudo cmake --install build --config Release
# Refresh cache of shared libraries
sudo ldconfig
# Test
llama-cli --helpDownload the model (was split into 3 parts) + the tokenizer.json -> here
cd gpt-oss-20bReplace the pointers with the downloaded files (cut & paste)
Install dependencies
cd .. # you should be in src/brain
python -m pip install --upgrade -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --no-cache-dir # For CUDA (NVIDIA GPU) supportConvert the model to gguf
python llama.cpp/convert_hf_to_gguf.py gpt-oss-20b/ --outfile models/gpt-oss-20b.ggufBuild workspace and then launch
cd ~/ros2_ws
colcon build --symlink-install
source install/setup.bash
ros2 launch brain self_llama.launch.pyIn another terminal publish a test prompt:
source /opt/ros/jazzy/setup.bash
ros2 topic pub /llama_input std_msgs/String "data: 'hello llama-chan'" -1In another terminal echo the output:
ros2 topic echo /llama_outputFollow the guide below for Web Server (GUI) to start the llama server.
cd ~/ros2_ws
colcon build --symlink-install
source install/setup.bash
ros2 launch brain llama_bridge.launch.pyIn another terminal publish a test prompt:
source /opt/ros/jazzy/setup.bash
ros2 topic pub /llama_input std_msgs/String "data: 'hello llama-chan'" -1In another terminal echo the output:
ros2 topic echo /llama_output --truncate-length 0Replace # with amount of layers you want to store on cpu (rest will be allocated to gpu)
cd ..
llama-cli -m models/gpt-oss-20b.gguf --jinja -ngl 99 -fa --n-cpu-moe #Replace # with amount of layers you want to store on cpu (rest will be allocated to gpu)
cd ..
llama-server -m models/gpt-oss-20b.gguf --jinja -ngl 99 -fa --n-cpu-moe #Tested and optimal for speed on 6GB VRAM: # = 16 -> gives around 30 tps
http://127.0.0.1:8080/completion
http://localhost:8080/v1/chat/completions
https://github.com/ggml-org/llama.cpp/tree/master/tools/server#post-v1chatcompletions-openai-compatible-chat-completions-api
More information: https://github.com/ggml-org/llama.cpp/tree/master/tools/server#post-completion-given-a-prompt-it-returns-the-predicted-completion
https://huggingface.co/openai/gpt-oss-20b
https://github.com/ggml-org/llama.cpp
https://blog.steelph0enix.dev/posts/llama-cpp-guide/
ggml-org/llama.cpp#15396