Triton inference memory allocation

All tests were launched on an Nvidia Tesla V100S-PCIE-32GB.

bash scripts/run.sh

The dummy model

In this example, we experience a behaviour similar to a memory leak. When under a lot of calls, the model grows in VRAM and the memory is never freed after the effort. By reading this doc we know it could be the memory allocator strategy. Is there something we could do for switching strategy of the memory allocator in the case of VRAM ?

The pyannote model

Under stress and for a specific config, some pyannote processes disapear on version 22.12. The same config but with version 23.08, the processes do not disappear. Was this really fixed ? or is it just luck that our configs are behaving ok ?

Also, models are now loaded concurrently which generates a race for writing files here. The flag --model-load-thread-count=1 is supposed to serialize the loading, but this is not the case. We have forced the serialization in model.py by spleeping proportionally to the instance id.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docker		docker
logs_v22.12		logs_v22.12
logs_v23.07		logs_v23.07
logs_v23.08		logs_v23.08
models		models
scripts		scripts
.gitignore		.gitignore
README.md		README.md
clean-code.mp3		clean-code.mp3
requirements-server.txt		requirements-server.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Triton inference memory allocation

The dummy model

The pyannote model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

gladiaio/triton-inference-memory-allocation

Folders and files

Latest commit

History

Repository files navigation

Triton inference memory allocation

The dummy model

The pyannote model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages