Closed
Description
It would be nice if VLLM could serve Transformer-based embedding models (ex. BERT) as well.
Having one host server that supports generative and embedding LLM APIs makes deployment of applications involving vector indexing easier (document retrieval and memory insertion into prompts).
This may be related to #187 for BERT-derived models, since they are encoder-only.
Metadata
Metadata
Assignees
Labels
No labels