Closed
Description
I'm trying to use rope scaling to increase the max_seq_len
. I refer to #555 and modify the model's config.json to add the key rope_scaling
:
{
"_name_or_path": "m42-health/med42-70b",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 28672,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"pad_token_id": 0,
"rms_norm_eps": 1e-05,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"transformers_version": "4.28.1",
"use_cache": true,
"vocab_size": 32000,
"rope_scaling": {
"factor": 2.0,
"type": "dynamic"
}
}
And I initiated vLLM engine by
cache_dir = "/secure/hf_cache"
model_name_or_path = "m42-health/med42-70b"
llm = LLM(model=model_name_or_path, download_dir=cache_dir, tensor_parallel_size=4, dtype="auto")
However, when I performed inference on long prompts, I still got the warning:
WARNING 01-20 16:48:15 scheduler.py:149] Input prompt (2380 tokens) is too long and exceeds limit of 2048
Does anyone have this issue before?
p.s., version of my vllm is 0.2.7.
Metadata
Metadata
Assignees
Labels
No labels