Skip to content

Generation with Prefix-cache are slower than the ones without it ? #3154

Closed as not planned
@vin136

Description

@vin136

I'm running the tutorial vllm/offline_inference_with_prefix.py and measuring the generation times, again below is the same code with generation times

`
import argparse
from typing import List, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams

import time
from vllm import LLM, SamplingParams

prefix = (
"You are an expert school principal, skilled in effectively managing "
"faculty and staff. Draft 10-15 questions for a potential first grade "
"Head Teacher for my K-12, all-girls', independent school that emphasizes "
"community, joyful discovery, and life-long learning. The candidate is "
"coming in for a first-round panel interview for a 8th grade Math "
"teaching role. They have 5 years of previous teaching experience "
"as an assistant teacher at a co-ed, public school with experience "
"in middle school math teaching. Based on these information, fulfill "
"the following paragraph: ")

Sample prompts.

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]

Create a sampling params object.

sampling_params = SamplingParams(temperature=0.0)

if name == 'main':
# Create an LLM.
llm = LLM(model="facebook/opt-125m")

generating_prompts = [prefix + prompt for prompt in prompts]

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
st = time.perf_counter()
outputs = llm.generate(generating_prompts, sampling_params)
end = time.perf_counter()
print(f"without caching time:{end-st}")

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("-" * 80)

# -1 since the last token can change when concatenating prompts.
prefix_pos = len(llm.llm_engine.tokenizer.encode(prefix)) - 1

# The llm.generate call will batch all prompts and send the batch at once if resources allow.
# The prefix will only be cached after the first batch is processed, so we need to call generate once
# to calculate the prefix and cache it.
outputs = llm.generate(generating_prompts[0],
                    sampling_params,
                    prefix_pos=[prefix_pos])

# Subsequent batches can leverage the cached prefix
st = time.perf_counter()
outputs = llm.generate(generating_prompts,
                    sampling_params,
                    prefix_pos=[prefix_pos] * len(generating_prompts))
end = time.perf_counter()
print(f"with caching time:{end-st}")

# Print the outputs. You should see the same outputs as before
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

`

Output:
with caching time:1.9611055543646216
without caching time:0.07439832389354706

VLLM: vllm==0.3.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions