Description
System Info
- Ubuntu 20.04
- NVIDIA A100
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- docker run -itd --gpus=all --shm-size=1g -p8000:8000 -p8001:8001 -p8002:8002 -v /share/datasets:/share/datasets nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
- code version is 0.11.0
git clone https://github.com/NVIDIA/TensorRT-LLM.git
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git - Perform some serving inference calls by aiohttp
Expected behavior
All request are successfully processed and no error
actual behavior
When the server performs multiple inferences, such as 5000 times, it raise error
malloc(): unaligned tcache chunk detected
Signal (6) received.
Both continuous and intermittent (such as one day) inference will cause this error.
When I calls 8000 inferences in one test, it raise error
pinned_memory_manager.cc:170] "failed to allocate pinned system memory, falling back to non-pinned system memory
Finally I set parameter cuda-memory-pool-byte-size to 512M and pinned-memory-pool-byte-size to 512M and solve this problem, but these two parameters are not exposed in the script scripts/launch_triton_server.py, so I want to ask why this problem occurs and if there is any other way to solve this problem.
When I call the server with high concurrency it raise error
malloc_consolidate(): unaligned fastbin chunk detected
Signal (6) received.
Hope you can help me solve these problems, thanks very much!
additional notes
I think this seems to be because the server does not completely clean up the memory after each inference is completed.