Open
Description
📚 The doc issue
I am new to TorchServe and was looking for some features that I need to be able to consider using TorchServe for LLM text generation.
Today, there are a couple inference serving solutions out there, including text-generation-inference and vLLM. It would be great if the documentation can mention how TorchServe compares with these at the moment. For instance,
- Does TorchServe support continuous batching?
- Does TorchServe support paged attention?
- Does TorchServe support streaming generated text through its inference API?
- What are some LLMs that TorchServe is known to work well with, e.g. Llama2, Falcon? Apart from the Hugging Face integration example provided.
Suggest a potential alternative/fix
A dedicated page for text generation and LLM inference could make sense given that there would be a lot of people interested in this.