The repository contains code for running two models (text and image embedders) with torchserve.
- Installing dependencies
git clone https://github.com/pytorch/serve.git
cd serve
python ./ts_scripts/install_dependencies.py --cuda=cu121
pip install -r requirements.txt
- Download models from HuggingFace
export HF_HOME=<folder for storing models>
export HF_HUB_CACHE=<folder for storing models>
huggingface-cli download sentence-transformers/paraphrase-multilingual-mpnet-base-v2 sentence-transformers/clip-ViT-B-32
- Check the models, save them to .bin files
python convert_models_to_bin.py
- Create .mar files from models for serving using handler files
. scripts/create_mar_files.sh
- Specify the necessary parameters in config.properties, start the server
. scripts/torchserve_start.sh
- Measuring performance using locustfile.py
. scripts/locust_test.sh
Examples of sending requests:
curl -X POST http://127.0.0.1:9980/predictions/text_embedder -T ./sample_text.txt
curl -X POST http://127.0.0.1:9980/predictions/image_embedder -T ./sample_image.jpg
Performance Tests For two separate models with 1 worker:
- batchSize=8 - 520 rps
- batchSize=16 - 550 rps
- batchSize=32 - 580 rps
When running both models simultaneously, the best result of 460 rps is achieved with batchSize=8.
TODO
- try optimizing models via TensorRT/ONNX
- before sending the image to the model, resize it to the input size of the model (to reduce the amount of bytes sent)
- optimize data transfer - use pickle and imageio
- pass metrics to Prometheus
- use docker/docker-compose