Error loading Llama-2-70b gptq weights from local directory #728
Description
System Info
Docker deployment version 0.9.4
Hardware: AWS g5.12xlarge
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Running using docker-compose with the following compose file:
version: "3.5"
services:
text-generation-inference:
image: ghcr.io/huggingface/text-generation-inference:0.9.4
container_name: text-generation-inference
entrypoint: text-generation-launcher
restart: always
stdin_open: true
tty: true
env_file:
- tgi.env
shm_size: '1gb'
ports:
- 8080:80
volumes:
- type: bind
source: /home/ubuntu/efs/llm_downloads
target: /llm_downloads
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
# device_ids: ['0', '3']
capabilities: [gpu]
networks:
default:
driver: bridge
and the following env variables in the tgi.env file:
MODEL_ID=/llm_downloads/TheBloke/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True
QUANTIZE=gptq
GPTQ_BITS=4
GPTQ_GROUPSIZE=128
SHARDED=true
NUM_SHARD=4
MAX_CONCURRENT_REQUESTS=128
MAX_BEST_OF=5
MAX_STOP_SEQUENCES=4
MAX_INPUT_LENGTH=4000
MAX_TOTAL_TOKENS=8192
WAITING_SERVED_RATIO=1.2
MAX_BATCH_TOTAL_TOKENS=16000
MAX_WAITING_TOKENS=20
MAX_BATCH_PREFILL_TOKENS=4096
HUGGINGFACE_HUB_CACHE=/llm_downloads/tgi_hf_cache
Which gives the following error:
2023-07-28T13:20:04.621775Z INFO text_generation_launcher: Args { model_id: "/llm_downloads/TheBloke/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True", revision: None, validation_workers: 2, sharded: Some(true), num_shard: Some(4), quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 5, max_stop_sequences: 4, max_input_length: 4000, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: Some(16000), max_waiting_tokens: 20, hostname: "fdd9e32f6611", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/llm_downloads/tgi_hf_cache"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-07-28T13:20:04.621814Z INFO text_generation_launcher: Sharding model on 4 processes
2023-07-28T13:20:04.621894Z INFO download: text_generation_launcher: Starting download process.
2023-07-28T13:20:06.114282Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2023-07-28T13:20:06.423909Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-28T13:20:06.424273Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-07-28T13:20:06.424931Z INFO shard-manager: text_generation_launcher: Starting shard rank=3
2023-07-28T13:20:06.424404Z INFO shard-manager: text_generation_launcher: Starting shard rank=2
2023-07-28T13:20:06.424931Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-07-28T13:20:11.906010Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
self.query_key_value = _load_gqa(config, prefix, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
bits, groupsize = self._get_gptq_params()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
raise e
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
bits = self.get_tensor("gptq_bits").item()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist
2023-07-28T13:20:11.979402Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
self.query_key_value = _load_gqa(config, prefix, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
bits, groupsize = self._get_gptq_params()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
raise e
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
bits = self.get_tensor("gptq_bits").item()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist
2023-07-28T13:20:11.980989Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
self.query_key_value = _load_gqa(config, prefix, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
bits, groupsize = self._get_gptq_params()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
raise e
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
bits = self.get_tensor("gptq_bits").item()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist
2023-07-28T13:20:11.984688Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
self.query_key_value = _load_gqa(config, prefix, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
bits, groupsize = self._get_gptq_params()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
raise e
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
bits = self.get_tensor("gptq_bits").item()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist
2023-07-28T13:20:12.431682Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
[W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
self.query_key_value = _load_gqa(config, prefix, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
bits, groupsize = self._get_gptq_params()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
raise e
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
bits = self.get_tensor("gptq_bits").item()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist
rank=3
2023-07-28T13:20:12.530175Z ERROR text_generation_launcher: Shard 3 failed to start
2023-07-28T13:20:12.530205Z INFO text_generation_launcher: Shutting down shards
2023-07-28T13:20:12.555460Z INFO shard-manager: text_generation_launcher: Shard terminated rank=1
2023-07-28T13:20:12.555658Z INFO shard-manager: text_generation_launcher: Shard terminated rank=2
2023-07-28T13:20:12.614893Z INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Error: ShardCannotStart
Expected behavior
Expect the model to load correctly.
I did a little digging into where the error was happening and I can see it's when it tries to load the gptq config settings in the _get_gptq_params
method in
the server/text_generation_server/utils/weights.py
file.
I'm not entirely sure why it doesn't seem to pick up these settings from the local dir as the quantize_config.json file does exist there.
I modified the _get_gptq_params
method to revert to getting this from env variables if it errors (see below) as was the case before this last release. I rebuilt the image and this seems to successfully load the model
def _get_gptq_params(self) -> Tuple[int, int]:
try:
bits = self.get_tensor("gptq_bits").item()
groupsize = self.get_tensor("gptq_groupsize").item()
except (SafetensorError, RuntimeError) as e:
try:
bits = self.gptq_bits
groupsize = self.gptq_groupsize
except Exception:
try:
import os
bits = int(os.getenv("GPTQ_BITS"))
groupsize = int(os.getenv("GPTQ_GROUPSIZE"))
except Exception:
raise e
return bits, groupsize