llama : add batched inference endpoint to server

for those not familiar with C like me.
it would be great if a new endpoint added to server.cpp to make batch inference.
for example:
endpoint: /completions
post: {"prompts":["promptA","promptB","promptC"]}
response:{"results":["sequenceA","sequenceB","sequenceC"]}

it is easy to do so with Hugging Face Transformers (as i do right now), but it's quite inefficient，hope to use llama.cpp to increase the efficiency oneday, cause I am not familiar with C, so can not use baby llama. I can only use javascript to Interact data with server.cpp。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : add batched inference endpoint to server #3478

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : add batched inference endpoint to server #3478

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions