Skip to content

[Feature] Serving embedding and reranking model using vLLM #1203

Open
@lvliang-intel

Description

@lvliang-intel

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Xeon-GNR

Running nodes

Single Node

Description

Feature: Serving Embedding and Reranking Models Using vLLM on Xeon and Gaudi
Description:
Integrate vLLM as a serving framework to enhance the performance and scalability of embedding and reranking models. This feature involves:

Leveraging vLLM's high-throughput serving capabilities to efficiently handle embedding and reranking requests.
Integration with the ChatQnA pipeline.
Optimizing the vLLM configuration for use cases involving embeddings and reranking, ensuring lower latency and better resource utilization.
Comparing vLLM's performance against the current TEI to determine the best setup for production.

Expected Outcome:

Applied another serving framework for embedding and reranking models, expect better performance on Gaudi.
Improved throughput for embedding and reranking services.
Enhanced flexibility to switch between serving frameworks based on specific requirements.

Metadata

Metadata

Assignees

Labels

A3MaintainfeatureNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions