[Feature]: Reduce LoRA latency via speculative decoding

### 🚀 The feature, motivation and pitch

The speculative decoding framework allows the target model to have LoRAs, however the work to set up batch expansion has not yet been done. We can implement batch expansion for LoRA and allow speculative decoding for LoRA.

The work required is basically to implement batch expansion but pass through the LoRA arguments. See "Let’s talk about code" in the following notes: https://docs.google.com/document/d/1z4Tgb1FcDr3YXvFPelyn-T-DEnLqqrlrxRi3TvIyAmg/edit

I expect this to work well for larger models (e.g. 70B) but more difficult with smaller models due to latency constraints and vLLM overheads. Perhaps with a speculator like ngram / eagle / mlpspeculator it can work for 7b models as well.

Note this work does not include applying LoRA to the speculator; that can be a future work.

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Reduce LoRA latency via speculative decoding #6912

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Reduce LoRA latency via speculative decoding #6912

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions