-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run long conetxt error : CUDA error: an illegal memory access was encountered #1700
Comments
llama model max_model_len is 2048, so I modify the max_model_len to 60000 by force in 'vllm/config.py'. |
It happen to me too when i tried to apply Dynamic-NTK rope scaling. Error show as bellow:
|
Long text cannot be used. I have encountered the same problem as them, which is very serious. Please help me solve it |
Side note: I tried with HF transformers and for single A100 80GB it enough to make 12k tokens inference with falcon-7b. But when I tried with vLLM, Iam only use 4k token prompt which is much smaller and should be fit in 80GB GPU ram. So it not a problem of OOM here. |
prompt len: 6495, max_tokens: 21000
running command :
python benchmark_serving.py --backend=vllm --host=localhost --port=8888 --dataset=/mnt/vllm/benchmarks/fake_data --tokenizer=/mnt/disk2/lama-tokenizer --num-prompts=1
python -m vllm.entrypoints.api_server --model=/mnt/disk2/llama-2-13b-chat-hf/ --tokenizer=/mnt/disk2/lama-tokenizer --tensor-parallel-size=2 --swap-space=64 --engine-use-ray --worker-use-ray --max-num-batched-tokens=60000
The text was updated successfully, but these errors were encountered: