Description
🚀 The feature
Optional command line argument for torchserve
to configure number of initial workers. If the argument is not supplied, then proceed with autoscaling.
Motivation, pitch
This would make rapid experimentation of model serving more pleasant and seamless on less-memory-capable machines.
Currently, as noted in the Getting Started docs, running TorchServe "automatically scales backend workers." This is a neat feature, but it creates a pain point for folks trying to run TorchServe on their laptop or otherwise less-memory-capable machine.
For example:
I ran TorchServe on my laptop (M2 Mac, 32 GB RAM, 10 core) to serve an embedding model (~4 GB). The autoscaling attempted to spawn 10 workers and it predictably crashed my laptop. Colleague of mine experienced the same thing as well. Ultimately, I had to use the Management API endpoints to (1) start the server, (2) register the model, (3) scale to 1 worker, before testing out served inference.
The simplicity of just calling torchserve to startup the server and initialize a worker is basically out of reach for anyone experimenting with a regular laptop.
Alternatives
Currently, as noted in the Getting Started docs, it's possible to use the fine-grained control offered by the Management API endpoints to (1) start the server, (2) register the model, (3) scale to 1 worker, before testing out served inference. However, as mentioned above, I wish I could just use the simple torchserve
command on my laptop 🥲.
Additional context
No response