Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any plan to support dynamic lora for qwen/chatglm models? #101

Open
KrisWongz opened this issue Dec 5, 2023 · 21 comments
Open
Labels
enhancement New feature or request

Comments

@KrisWongz
Copy link

Feature request

Cool job! I have successfully run mulit-lora with llama2-70b.
I would like to ask if the author has any plans to support other models, such as qwen, which would be very helpful.

Motivation

null

Your contribution

null

@felixstander
Copy link

can support the Alibaba open-source Qwen model will be wonderful

@tgaddair
Copy link
Contributor

tgaddair commented Dec 5, 2023

Hey @KrisWongz @felixstander, thanks for trying out LoRAX!

Looking at the code for Qwen, it looks pretty similar to Llama. It sounds like the main difference may be the use of a bias in the QKV computation, which shouldn't be a problem.

I can definitely try taking a stab at it and see how it goes.

@tgaddair tgaddair added the enhancement New feature or request label Dec 5, 2023
@tgaddair
Copy link
Contributor

tgaddair commented Dec 5, 2023

Hey @KrisWongz @felixstander, #103 should add support for Qwen. The base model appears to generate results consistent with the example on Huggingface Hub. Do you have an adapter I can use to test that the adapter loading works as expected?

@tgaddair
Copy link
Contributor

tgaddair commented Dec 5, 2023

Note that you'll need to run with --trust-remote-code when launching LoRAX as the tokenizer is custom and hosted on HF.

@KrisWongz
Copy link
Author

Note that you'll need to run with --trust-remote-code when launching LoRAX as the tokenizer is custom and hosted on HF.

I pulled the latest docker and set --trust-remote-code on startup. Startup code:

sudo docker run --gpus all
--shm-size 10g
-p 8081:80
-v /home/shaohongen/Temp/Models/Qwen:/data ghcr.io/predibase/lorax:latest
--model-id /data/Tongyi-Finance-14B-Chat
--trust-remote-code \

But it still reports an error:

2023-12-06T06:11:56.409578Z INFO lorax_launcher: Args { model_id: "/data/Tongyi-Finance-14B-Chat", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "4d4c7a004768", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2023-12-06T06:11:56.409692Z WARN lorax_launcher: trust_remote_code is set. Trusting that model /data/Tongyi-Finance-14B-Chat do not contain malicious code.
2023-12-06T06:11:56.409982Z INFO download: lorax_launcher: Starting download process.
2023-12-06T06:12:04.991708Z WARN lorax_launcher: cli.py:143 No safetensors weights found for model /data/Tongyi-Finance-14B-Chat at revision None. Converting PyTorch weights to safetensors.

2023-12-06T06:12:25.164707Z ERROR download: lorax_launcher: Download encountered an error: Traceback (most recent call last):

File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 199, in download_weights
_download_weights(model_id, revision, extension, auto_convert, source)

File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 173, in _download_weights
utils.convert_files(local_pt_files, local_st_files, discard_names)

File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/convert.py", line 112, in convert_files
convert_file(pt_file, sf_file, discard_names)

File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/convert.py", line 71, in convert_file
loaded = torch.load(pt_file, map_location="cpu")

File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 809, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)

File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1172, in _load
result = unpickler.load()

File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1142, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))

File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1112, in load_tensor
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage

File "/opt/conda/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 579, in _raise_timeout_error
raise ValueError(

ValueError: Loading this model requires you to execute custom code contained in the model repository on your local machine. Please set the option trust_remote_code=True to permit loading of this model.

Error: DownloadError

@felixstander
Copy link

Hey @KrisWongz @felixstander, #103 should add support for Qwen. The base model appears to generate results consistent with the example on Huggingface Hub. Do you have an adapter I can use to test that the adapter loading works as expected?

Really appreciate your work! Haven't tested yet, but I will upload couple of my fine-tuned adapters to huggingface hub for your guys to test soon.

@tgaddair
Copy link
Contributor

tgaddair commented Dec 6, 2023

Thanks @felixstander!

@KrisWongz it looks like the model weights .bin file is trying to execute some code on deserialization. I wasn't able to repro this using the base model from Huggingface here: https://huggingface.co/jxy/Tongyi-Finance-14B-Chat.

This is one of the issues with pickle, though, which is it can do unpredictable things like this. Can you try converting the weights to safetensors format and trying again?

@felixstander
Copy link

Does Lorax support Qwen-4bit-gptq version without the need of flash attention v2?

As far as I can see, All models you support now are built on top of flash attention by default. Unfortunately, some of our GPUs for inference are still V100, which is not liked by flash attn.:(
image

@tgaddair
Copy link
Contributor

tgaddair commented Dec 7, 2023

Currently we rely on flash attention, but we can definitely explore alternatives, like falling back to paged attention during prefill if needed.

@felixstander
Copy link

@tgaddair I'm testing with Qwen-14-gptq-int4 on RTX3090 right now,
My launch parameters are as follow:
lorax-launcher --model-id /root/autodl-tmp/Qwen-14B-Chat-Int4 --quantize gptq --trust-remote-code --port 6006
But I got the following error:
2023-12-07T06:58:35.659747Z INFO lorax_launcher: server.py:263 Server started at unix:///tmp/lorax-server-0

2023-12-07T06:58:35.748970Z INFO shard-manager: lorax_launcher: Shard ready in 8.511468629s rank=0
2023-12-07T06:58:35.843122Z INFO lorax_launcher: Starting Webserver
2023-12-07T06:58:35.855480Z WARN lorax_router: router/src/main.rs:169: Could not find a fast tokenizer implementation for /root/autodl-tmp/Qwen-14B-Chat-Int4
2023-12-07T06:58:35.855561Z WARN lorax_router: router/src/main.rs:172: Rust input length validation and truncation is disabled
2023-12-07T06:58:35.855586Z WARN lorax_router: router/src/main.rs:197: no pipeline tag found for model /root/autodl-tmp/Qwen-14B-Chat-Int4
2023-12-07T06:58:35.876373Z INFO lorax_router: router/src/main.rs:216: Warming up model
2023-12-07T06:58:37.188252Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 843, in warmup
_, batch = self.generate_token(batch)
File "/root/miniconda3/envs/lorax/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 939, in generate_token
raise e
File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 936, in generate_token
out = self.forward(batch, adapter_data)
File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 895, in forward
return self.model.forward(
File "/root/lorax/server/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 471, in forward
hidden_states = self.transformer(
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/lorax/server/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 427, in forward
hidden_states, residual = layer(
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/lorax/server/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 352, in forward
attn_output = self.attn(
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/lorax/server/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 235, in forward
paged_attn.reshape_and_cache(
File "/root/lorax/server/lorax_server/utils/paged_attn.py", line 23, in reshape_and_cache
cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slot_mapping)
RuntimeError: expected scalar type Int but found Long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/root/miniconda3/envs/lorax/bin/lorax-server", line 8, in
sys.exit(app())
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/root/lorax/server/lorax_server/cli.py", line 83, in serve
server.serve(
File "/root/lorax/server/lorax_server/server.py", line 271, in serve
asyncio.run(
File "/root/miniconda3/envs/lorax/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/root/miniconda3/envs/lorax/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/root/miniconda3/envs/lorax/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/root/miniconda3/envs/lorax/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/root/miniconda3/envs/lorax/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
return await self.intercept(

File "/root/lorax/server/lorax_server/interceptor.py", line 38, in intercept
return await response
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/root/lorax/server/lorax_server/server.py", line 74, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 845, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens

2023-12-07T06:58:37.188701Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens"))
2023-12-07T06:58:37.245694Z ERROR lorax_launcher: Webserver Crashed
2023-12-07T06:58:37.245719Z INFO lorax_launcher: Shutting down shards
2023-12-07T06:58:37.510784Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed

@felixstander
Copy link

Setting max_prefill_tokens to be the same with max_input_length isn't working.
2023-12-07T07:09:11.100552Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens"))
2023-12-07T07:09:11.154507Z ERROR lorax_launcher: Webserver Crashed
2023-12-07T07:09:11.154534Z INFO lorax_launcher: Shutting down shards
2023-12-07T07:09:11.441994Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed

@felixstander
Copy link

Even change the max-input-length and max-batch-prefill-tokens down to 100 tokens, it still pops up the Not Enough Memory warning. And I noticed the memory utilization rate jumps to 100% before it crushes.
image
image

@KrisWongz
Copy link
Author

KrisWongz commented Dec 7, 2023

Thanks @felixstander!

@KrisWongz it looks like the model weights .bin file is trying to execute some code on deserialization. I wasn't able to repro this using the base model from Huggingface here: https://huggingface.co/jxy/Tongyi-Finance-14B-Chat.

This is one of the issues with pickle, though, which is it can do unpredictable things like this. Can you try converting the weights to safetensors format and trying again?

Thanks a lot! I successfully ran qwen and multi lora. My solution is to convert qwen to .safetensors locally.

But there is currently a small problem that I have been dealing with for a long time. I use qwen to reason and cannot stop until the maximum length limit every time. I judge that this may be due to stop words.
In 'generate()', I found that there is a 'stop=[]' parameter, but I got an error when I tried to enter it.

client.generate(prompt, max_new_tokens=32,temperature=0.7,stop=["<|im_end|>"]).generated_text

error:

Traceback (most recent call last):
File "/home/shaohongen/Temp/WZ_test/lorax/test_lorax_qwen.py", line 20, in
print(client.generate(prompt, max_new_tokens=32,temperature=0.7,stop=["<|im_end|>"]).generated_text)
TypeError: generate() got an unexpected keyword argument 'stop'

But I tried other parameters successfully, except 'stop':

generate{parameters=GenerateParameters { adapter_id: None, adapter_source: None, best_of: None, temperature: Some(0.7), repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: 32, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None }

By the way, 'do_sample=True' will make output better, but not every time.

@tgaddair
Copy link
Contributor

tgaddair commented Dec 8, 2023

Hey @KrisWongz, can you try using the param stop_sequences instead of stop?

Example:

client.generate(prompt, max_new_tokens=32, temperature=0.7, stop_sequences=["<|im_end|>"]).generated_text

@tgaddair
Copy link
Contributor

tgaddair commented Dec 8, 2023

Hey @felixstander, it looks like the error about decreasing the max batch size is misleading, the actual error here is:

File "/root/lorax/server/lorax_server/utils/paged_attn.py", line 23, in reshape_and_cache
cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slot_mapping)
RuntimeError: expected scalar type Int but found Long

Let me see if I can reproduce this error on my side.

@tgaddair
Copy link
Contributor

tgaddair commented Dec 8, 2023

Hey @felixstander, I wasn't able to reproduce the error using the most recent Docker image. Can you try pulling the latest Docker and trying again? There were a couple recent changes for GPT-Q that may have fixed this issue.

@KrisWongz
Copy link
Author

Hey @KrisWongz, can you try using the param stop_sequences instead of stop?

Example:

client.generate(prompt, max_new_tokens=32, temperature=0.7, stop_sequences=["<|im_end|>"]).generated_text

It works on the base model, but it seems useless after adding the lora adapter. It may be a problem with the template settings when I fine-tuned it. But I ran it successfully under qwen's original web_demo.
I'll keep trying, thanks for your help!

@thincal
Copy link
Contributor

thincal commented Feb 26, 2024

Any plan for ChatGLM model support ? thanks.

@tgaddair
Copy link
Contributor

Hey @thincal, we can definitely add ChatGLM support. I can create a separate issue to track that.

@tgaddair
Copy link
Contributor

#280

@thincal
Copy link
Contributor

thincal commented Mar 4, 2024

@tgaddair it seems that qwen model type is qwen2 now, so what's the supported version in current implementation of lorax ?
Ref: https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat-AWQ/file/view/master/config.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants