-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any plan to support dynamic lora for qwen/chatglm models? #101
Comments
can support the Alibaba open-source Qwen model will be wonderful |
Hey @KrisWongz @felixstander, thanks for trying out LoRAX! Looking at the code for Qwen, it looks pretty similar to Llama. It sounds like the main difference may be the use of a bias in the QKV computation, which shouldn't be a problem. I can definitely try taking a stab at it and see how it goes. |
Hey @KrisWongz @felixstander, #103 should add support for Qwen. The base model appears to generate results consistent with the example on Huggingface Hub. Do you have an adapter I can use to test that the adapter loading works as expected? |
Note that you'll need to run with |
I pulled the latest docker and set --trust-remote-code on startup. Startup code: sudo docker run --gpus all But it still reports an error: 2023-12-06T06:11:56.409578Z INFO lorax_launcher: Args { model_id: "/data/Tongyi-Finance-14B-Chat", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "4d4c7a004768", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2023-12-06T06:12:25.164707Z ERROR download: lorax_launcher: Download encountered an error: Traceback (most recent call last): File "/opt/conda/bin/lorax-server", line 8, in File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 199, in download_weights File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 173, in _download_weights File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/convert.py", line 112, in convert_files File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/convert.py", line 71, in convert_file File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 809, in load File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1172, in _load File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1142, in persistent_load File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1112, in load_tensor File "/opt/conda/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 579, in _raise_timeout_error ValueError: Loading this model requires you to execute custom code contained in the model repository on your local machine. Please set the option Error: DownloadError |
Really appreciate your work! Haven't tested yet, but I will upload couple of my fine-tuned adapters to huggingface hub for your guys to test soon. |
Thanks @felixstander! @KrisWongz it looks like the model weights This is one of the issues with pickle, though, which is it can do unpredictable things like this. Can you try converting the weights to safetensors format and trying again? |
Currently we rely on flash attention, but we can definitely explore alternatives, like falling back to paged attention during prefill if needed. |
@tgaddair I'm testing with Qwen-14-gptq-int4 on RTX3090 right now, 2023-12-07T06:58:35.748970Z INFO shard-manager: lorax_launcher: Shard ready in 8.511468629s rank=0 The above exception was the direct cause of the following exception: Traceback (most recent call last):
2023-12-07T06:58:37.188701Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease |
Setting max_prefill_tokens to be the same with max_input_length isn't working. |
Thanks a lot! I successfully ran qwen and multi lora. My solution is to convert qwen to .safetensors locally. But there is currently a small problem that I have been dealing with for a long time. I use qwen to reason and cannot stop until the maximum length limit every time. I judge that this may be due to stop words.
error:
But I tried other parameters successfully, except 'stop':
By the way, 'do_sample=True' will make output better, but not every time. |
Hey @KrisWongz, can you try using the param Example:
|
Hey @felixstander, it looks like the error about decreasing the max batch size is misleading, the actual error here is:
Let me see if I can reproduce this error on my side. |
Hey @felixstander, I wasn't able to reproduce the error using the most recent Docker image. Can you try pulling the latest Docker and trying again? There were a couple recent changes for GPT-Q that may have fixed this issue. |
It works on the base model, but it seems useless after adding the lora adapter. It may be a problem with the template settings when I fine-tuned it. But I ran it successfully under qwen's original web_demo. |
Any plan for ChatGLM model support ? thanks. |
Hey @thincal, we can definitely add ChatGLM support. I can create a separate issue to track that. |
@tgaddair it seems that qwen model type is qwen2 now, so what's the supported version in current implementation of lorax ? |
Feature request
Cool job! I have successfully run mulit-lora with llama2-70b.
I would like to ask if the author has any plans to support other models, such as qwen, which would be very helpful.
Motivation
null
Your contribution
null
The text was updated successfully, but these errors were encountered: