Skip to content

poolsideai/triton-server

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Triton Inference Server - Poolside fork

Original README

Installation with TRT-LLM without containers for local development

Assuming Ubuntu 22.04 with CUDA 12.2 - those are the requirements for building TRT-LLM. There are two steps: building TRT-LLM and then build Triton.

Building TRT-LLM

Packages:

sudo apt install openmpi-bin libopenmpi-dev cuda-command-line-tools-12-2 cuda-nvcc-12-2 cuda-nvtx-12-2 libcublas-dev-12-2 libcurand-dev-12-2 libcufft-dev-12-2 libcusolver-dev-12-2 cuda-nvrtc-dev-12-2 libcusparse-dev-12-2 cuda-profiler-api-12-2 git-lfs

TensorRT:

wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.linux.x86_64-gnu.cuda-12.2.tar.gz
tar -xf tensorrt-9.3.0.1.linux.x86_64-gnu.cuda-12.2.tar.gz
sudo mv TensorRT-9.3.0.1 /opt/TensorRT-9.3.0.1
rm tensorrt-9.3.0.1.linux.x86_64-gnu.cuda-12.2.tar.gz

Clone TRT-LLM:

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/poolsideai/TensorRT-LLM
cd TensorRT-LLM
git submodule update --init
git config remote.origin.lfsurl https://github.com/nvidia/TensorRT-LLM.git/info/lfs
git lfs pull

Build:

./scripts/build_wheel.py  --skip_building_wheel --cuda_architectures 90-real --trt_root /opt/TensorRT-9.3.0.1/ --build_type RelWithDebInfo --extra-cmake-vars 'USE_CXX11_ABI=1' --cpp_only

Building Triton

Packages:

sudo apt install zlib1g-dev libarchive-dev libxml2-dev libnuma-dev libre2-dev libssl-dev libgoogle-perftools-dev libb64-dev libcurl4-openssl-dev rapidjson-dev datacenter-gpu-manager=1:3.2.6 libcudnn8-dev

Symlink TRT-LLM source directory inside Triton build directory:

mkdir -p build/tensorrtllm
ln -s path/to/TensorRT-LLM/ build/tensorrtllm/tensorrt_llm

Build Triton:

./build.py -v --no-container-build --build-dir=$(pwd)/build --enable-logging --enable-stats --enable-metrics --enable-cpu-metrics --enable-gpu-metrics --enable-gpu --backend tensorrtllm --backend=ensemble --backend=python --endpoint http

Running the Battle model

mpirun -n 2 build/tritonserver/install/bin/tritonserver --http-port=8080 --log-verbose=10 --backend-directory=/home/ubuntu/server/build/opt/tritonserver/backends --model-repository=/scratch/checkpoints/battle-trt-repo/ --load-model=poolside

Query:

curl -X POST localhost:8080/v2/models/ensemble/generate -d '{
        "text_input": "<root>How do I count to nine in French?</root>",
        "parameters": {
            "temperature": 0.0,
            "top_p": 0.9,
            "random_seed": 777,
            "stream": false,
            "max_tokens": 100,
            "bad_words":[],
            "stop_words":[]
    }
}'

Replace /generate with /generate_stream for streaming.

About

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 51.6%
  • Shell 24.3%
  • C++ 19.8%
  • CMake 1.5%
  • Java 1.4%
  • Roff 1.0%
  • Other 0.4%