[Feature] Serving embedding and reranking model using vLLM

### Priority

P1-Stopper

### OS type

Ubuntu

### Hardware type

Xeon-GNR

### Running nodes

Single Node

### Description

Feature: Serving Embedding and Reranking Models Using vLLM on Xeon and Gaudi
**Description:**
Integrate vLLM as a serving framework to enhance the performance and scalability of embedding and reranking models. This feature involves:

Leveraging vLLM's high-throughput serving capabilities to efficiently handle embedding and reranking requests.
Integration with the ChatQnA pipeline.
Optimizing the vLLM configuration for use cases involving embeddings and reranking, ensuring lower latency and better resource utilization.
Comparing vLLM's performance against the current TEI to determine the best setup for production.

**Expected Outcome:**

Applied another serving framework for embedding and reranking models, expect better performance on Gaudi.
Improved throughput for embedding and reranking services.
Enhanced flexibility to switch between serving frameworks based on specific requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Serving embedding and reranking model using vLLM #1203

Priority

OS type

Hardware type

Running nodes

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Serving embedding and reranking model using vLLM #1203

Description

Priority

OS type

Hardware type

Running nodes

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions