Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ uint4 quantization broken #207

Open
2 of 4 tasks
endomorphosis opened this issue Aug 11, 2024 · 2 comments
Open
2 of 4 tasks

GPTQ uint4 quantization broken #207

endomorphosis opened this issue Aug 11, 2024 · 2 comments

Comments

@endomorphosis
Copy link

System Info

root@laion-gaudi2-00:/home/sdp# docker run -p 8081:80 -v $volume:/data --runtime=habana -e HUGGING_FACE_HUB_TOKEN=$hf_token -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=2,3,4,5,6,7 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi --model-id $model --sharded true --num-shard 5 --max-input-tokens 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 8242
2024-08-11T22:11:55.546868Z INFO text_generation_launcher: Args {
model_id: "hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4",
revision: None,
validation_workers: 2,
sharded: Some(
true,
),
num_shard: Some(
5,
),
quantize: None,
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
4096,
),
max_input_length: None,
max_total_tokens: Some(
8192,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
8242,
),
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "0d708a2172ae",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
}
2024-08-11T22:11:55.546942Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link

2024-08-11T22:11:55.740778Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-08-11T22:11:55.740792Z INFO text_generation_launcher: Sharding model on 5 processes
2024-08-11T22:11:55.740878Z INFO download: text_generation_launcher: Starting download process.
2024-08-11T22:11:58.162183Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-08-11T22:11:58.543915Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-08-11T22:11:58.544105Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-08-11T22:12:03.786121Z INFO text_generation_launcher: CLI SHARDED = True DTYPE = bfloat16

2024-08-11T22:12:03.786323Z INFO text_generation_launcher: CLI SHARDED = 5

2024-08-11T22:12:03.786401Z INFO text_generation_launcher: CLI server start deepspeed =deepspeed --num_nodes 1 --num_gpus 5 --no_local_rank /usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py --model_id hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 --revision None --sharded True --dtype bfloat16 --trust_remote_code False --uds_path /tmp/text-generation-server

2024-08-11T22:12:08.551807Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-11T22:12:18.560690Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-11T22:12:28.568664Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-11T22:12:38.578065Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-11T22:12:43.403761Z ERROR text_generation_launcher: deepspeed --num_nodes 1 --num_gpus 5 --no_local_rank /usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py --model_id hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 --revision None --sharded True --dtype bfloat16 --trust_remote_code False --uds_path /tmp/text-generation-server exited with status = 1

2024-08-11T22:12:44.082499Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

[WARNING|utils.py:212] 2024-08-11 22:12:02,456 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but habana-frameworks v1.16.0.526 was found, this could lead to undefined behavior!
[WARNING|utils.py:225] 2024-08-11 22:12:03,010 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but the driver version is v1.16.2, this could lead to undefined behavior!
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[WARNING|utils.py:212] 2024-08-11 22:12:19,432 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but habana-frameworks v1.16.0.526 was found, this could lead to undefined behavior!
[WARNING|utils.py:212] 2024-08-11 22:12:19,435 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but habana-frameworks v1.16.0.526 was found, this could lead to undefined behavior!
[WARNING|utils.py:212] 2024-08-11 22:12:19,437 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but habana-frameworks v1.16.0.526 was found, this could lead to undefined behavior!
[WARNING|utils.py:212] 2024-08-11 22:12:19,437 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but habana-frameworks v1.16.0.526 was found, this could lead to undefined behavior!
[WARNING|utils.py:212] 2024-08-11 22:12:19,440 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but habana-frameworks v1.16.0.526 was found, this could lead to undefined behavior!
[WARNING|utils.py:225] 2024-08-11 22:12:20,124 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but the driver version is v1.16.2, this could lead to undefined behavior!
[WARNING|utils.py:225] 2024-08-11 22:12:20,588 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but the driver version is v1.16.2, this could lead to undefined behavior!
[WARNING|utils.py:225] 2024-08-11 22:12:20,747 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but the driver version is v1.16.2, this could lead to undefined behavior!
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
2024-08-11 22:12:20.920 | INFO | main:main:10 - TGIService: starting tgi service ....
2024-08-11 22:12:20.921 | INFO | main:main:11 - TGIService: --model_id hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4, --revision None, --sharded True, --speculate None, --dtype bfloat16, --trust_remote_code True, --uds_path /tmp/text-generation-server
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
2024-08-11 22:12:21.398 | INFO | main:main:10 - TGIService: starting tgi service ....
2024-08-11 22:12:21.399 | INFO | main:main:11 - TGIService: --model_id hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4, --revision None, --sharded True, --speculate None, --dtype bfloat16, --trust_remote_code True, --uds_path /tmp/text-generation-server
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
2024-08-11 22:12:21.561 | INFO | main:main:10 - TGIService: starting tgi service ....
2024-08-11 22:12:21.561 | INFO | main:main:11 - TGIService: --model_id hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4, --revision None, --sharded True, --speculate None, --dtype bfloat16, --trust_remote_code True, --uds_path /tmp/text-generation-server
[WARNING|utils.py:225] 2024-08-11 22:12:21,580 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but the driver version is v1.16.2, this could lead to undefined behavior!
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[WARNING|utils.py:225] 2024-08-11 22:12:21,759 >> optimum-habana v1.13.0.dev0 has been validated for SynapseAI v1.17.0 but the driver version is v1.16.2, this could lead to undefined behavior!
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
2024-08-11 22:12:22.487 | INFO | main:main:10 - TGIService: starting tgi service ....
2024-08-11 22:12:22.487 | INFO | main:main:11 - TGIService: --model_id hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4, --revision None, --sharded True, --speculate None, --dtype bfloat16, --trust_remote_code True, --uds_path /tmp/text-generation-server
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
2024-08-11 22:12:22.611 | INFO | main:main:10 - TGIService: starting tgi service ....
2024-08-11 22:12:22.612 | INFO | main:main:11 - TGIService: --model_id hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4, --revision None, --sharded True, --speculate None, --dtype bfloat16, --trust_remote_code True, --uds_path /tmp/text-generation-server
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 0
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375276 KB

Loading 9 checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s]Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 37, in
main(args)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 223, in serve
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 189, in serve_inner
model = get_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/init.py", line 101, in get_model
return CausalLM(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 629, in init
model = self.get_deepspeed_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 783, in get_deepspeed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 340, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 165, in init
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 412, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 345, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 603, in replace_module
replaced_module, _ = _replace_module(model, policy, state_dict=sd)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 639, in _replace_module
replaced_module = policies[child.class][0](child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 316, in replace_fn
new_module = replace_wo_policy(child, _policy, prefix=prefix, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 299, in replace_wo_policy
return _autotp._replace_module(module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 464, in _replace_module
self._replace_module(child, name, class_name)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 443, in _replace_module
Loading.load(child, self.state_dict, checking_key, self.mp_group)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 161, in load
module.weight = mp_replace.copy(module.weight.data, state_dict[prefix + 'weight'])
KeyError: 'model.layers.3.self_attn.q_proj.weight'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 37, in
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 37, in
main(args)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 223, in serve
main(args)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 223, in serve
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 189, in serve_inner
model = get_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/init.py", line 101, in get_model
return future.result()
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 189, in serve_inner
return CausalLM(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 629, in init
model = get_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/init.py", line 101, in get_model
return CausalLM(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 629, in init
model = self.get_deepspeed_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 783, in get_deepspeed_model
model = self.get_deepspeed_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 783, in get_deepspeed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 340, in init_inference
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 340, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 165, in init
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 412, in _apply_injection_policy
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 165, in init
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 412, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 345, in replace_transformer_layer
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 345, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 603, in replace_module
replaced_module = replace_module(model=model,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 603, in replace_module
replaced_module, _ = _replace_module(model, policy, state_dict=sd)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
replaced_module, _ = _replace_module(model, policy, state_dict=sd)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 639, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 639, in _replace_module
replaced_module = policies[child.class][0](child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 316, in replace_fn
replaced_module = policies[child.class][0](child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 316, in replace_fn
new_module = replace_wo_policy(child, _policy, prefix=prefix, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 299, in replace_wo_policy
new_module = replace_wo_policy(child, _policy, prefix=prefix, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 299, in replace_wo_policy
return _autotp._replace_module(module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 464, in _replace_module
return _autotp._replace_module(module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 464, in _replace_module
self._replace_module(child, name, class_name)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 443, in _replace_module
self._replace_module(child, name, class_name)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 443, in _replace_module
Loading.load(child, self.state_dict, checking_key, self.mp_group)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 161, in load
module.weight = mp_replace.copy(module.weight.data, state_dict[prefix + 'weight'])
Loading.load(child, self.state_dict, checking_key, self.mp_group)
KeyError File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 161, in load
: 'model.layers.3.self_attn.q_proj.weight'
module.weight = mp_replace.copy(module.weight.data, state_dict[prefix + 'weight'])
KeyError: 'model.layers.3.self_attn.q_proj.weight'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 37, in
main(args)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 223, in serve
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 189, in serve_inner
model = get_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/init.py", line 101, in get_model
return CausalLM(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 629, in init
model = self.get_deepspeed_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 783, in get_deepspeed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 340, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 165, in init
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 412, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 345, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 603, in replace_module
replaced_module, _ = _replace_module(model, policy, state_dict=sd)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 639, in _replace_module
replaced_module = policies[child.class][0](child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 316, in replace_fn
new_module = replace_wo_policy(child, _policy, prefix=prefix, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 299, in replace_wo_policy
return _autotp._replace_module(module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 464, in _replace_module
self._replace_module(child, name, class_name)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 443, in _replace_module
Loading.load(child, self.state_dict, checking_key, self.mp_group)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 161, in load
module.weight = mp_replace.copy(module.weight.data, state_dict[prefix + 'weight'])
KeyError: 'model.layers.3.self_attn.q_proj.weight'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 37, in
main(args)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 223, in serve
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 189, in serve_inner
model = get_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/init.py", line 101, in get_model
return CausalLM(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 629, in init
model = self.get_deepspeed_model(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 783, in get_deepspeed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 340, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 165, in init
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 412, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 345, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 603, in replace_module
replaced_module, _ = _replace_module(model, policy, state_dict=sd)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 663, in _replace_module
_, layer_id = _replace_module(child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 639, in _replace_module
replaced_module = policies[child.class][0](child,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 316, in replace_fn
new_module = replace_wo_policy(child, _policy, prefix=prefix, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 299, in replace_wo_policy
return _autotp._replace_module(module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 464, in _replace_module
self._replace_module(child, name, class_name)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 443, in _replace_module
Loading.load(child, self.state_dict, checking_key, self.mp_group)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/auto_tp.py", line 161, in load
module.weight = mp_replace.copy(module.weight.data, state_dict[prefix + 'weight'])
KeyError: 'model.layers.3.self_attn.q_proj.weight'
Loading 9 checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s]
Loading 9 checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s]
Loading 9 checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s]
Loading 9 checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s]
Loading 9 checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s] rank=0
2024-08-11T22:12:44.083532Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-11T22:12:44.083544Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

docker run -p 8081:80 -v $volume:/data --runtime=habana -e HUGGING_FACE_HUB_TOKEN=$hf_token -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=2,3,4,5,6,7 -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi --model-id $model --sharded true --num-shard 5 --max-input-tokens 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 8242

Expected behavior

it should run.

@yuanwu2017
Copy link
Collaborator

Wait for next release. It should work.
#217

@yuanwu2017
Copy link
Collaborator

The 2.0.6 is released. TGI-gaudi can support the GPTQ unint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants