Skip to content

Error loading Llama-2-70b gptq weights from local directory #728

Closed
@hmcp22

Description

System Info

Docker deployment version 0.9.4
Hardware: AWS g5.12xlarge

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Running using docker-compose with the following compose file:

version: "3.5"
services:
  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:0.9.4
    container_name: text-generation-inference
    entrypoint: text-generation-launcher
    restart: always
    stdin_open: true 
    tty: true 
    env_file:
      - tgi.env
    shm_size: '1gb'
    ports:
      - 8080:80
    volumes:
      - type: bind
        source: /home/ubuntu/efs/llm_downloads
        target: /llm_downloads
    deploy:
      resources:
        reservations:
          devices: 
            - driver: nvidia
              count: all
              # device_ids: ['0', '3']
              capabilities: [gpu]
networks:
  default:
    driver: bridge

and the following env variables in the tgi.env file:

MODEL_ID=/llm_downloads/TheBloke/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True
QUANTIZE=gptq
GPTQ_BITS=4
GPTQ_GROUPSIZE=128
SHARDED=true
NUM_SHARD=4
MAX_CONCURRENT_REQUESTS=128
MAX_BEST_OF=5
MAX_STOP_SEQUENCES=4 
MAX_INPUT_LENGTH=4000 
MAX_TOTAL_TOKENS=8192
WAITING_SERVED_RATIO=1.2 
MAX_BATCH_TOTAL_TOKENS=16000 
MAX_WAITING_TOKENS=20

MAX_BATCH_PREFILL_TOKENS=4096

HUGGINGFACE_HUB_CACHE=/llm_downloads/tgi_hf_cache

Which gives the following error:

2023-07-28T13:20:04.621775Z  INFO text_generation_launcher: Args { model_id: "/llm_downloads/TheBloke/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True", revision: None, validation_workers: 2, sharded: Some(true), num_shard: Some(4), quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 5, max_stop_sequences: 4, max_input_length: 4000, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: Some(16000), max_waiting_tokens: 20, hostname: "fdd9e32f6611", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/llm_downloads/tgi_hf_cache"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-07-28T13:20:04.621814Z  INFO text_generation_launcher: Sharding model on 4 processes
2023-07-28T13:20:04.621894Z  INFO download: text_generation_launcher: Starting download process.
2023-07-28T13:20:06.114282Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-28T13:20:06.423909Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-28T13:20:06.424273Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-07-28T13:20:06.424931Z  INFO shard-manager: text_generation_launcher: Starting shard rank=3
2023-07-28T13:20:06.424404Z  INFO shard-manager: text_generation_launcher: Starting shard rank=2
2023-07-28T13:20:06.424931Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-07-28T13:20:11.906010Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.979402Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.980989Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.984688Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:12.431682Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

[W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits

AttributeError: 'Weights' object has no attribute 'gptq_bits'


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight gptq_bits does not exist
 rank=3
2023-07-28T13:20:12.530175Z ERROR text_generation_launcher: Shard 3 failed to start
2023-07-28T13:20:12.530205Z  INFO text_generation_launcher: Shutting down shards
2023-07-28T13:20:12.555460Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
2023-07-28T13:20:12.555658Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=2
2023-07-28T13:20:12.614893Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Error: ShardCannotStart

Expected behavior

Expect the model to load correctly.
I did a little digging into where the error was happening and I can see it's when it tries to load the gptq config settings in the _get_gptq_params method in
the server/text_generation_server/utils/weights.py file.
I'm not entirely sure why it doesn't seem to pick up these settings from the local dir as the quantize_config.json file does exist there.
I modified the _get_gptq_params method to revert to getting this from env variables if it errors (see below) as was the case before this last release. I rebuilt the image and this seems to successfully load the model

    def _get_gptq_params(self) -> Tuple[int, int]:
        try:
            bits = self.get_tensor("gptq_bits").item()
            groupsize = self.get_tensor("gptq_groupsize").item()
        except (SafetensorError, RuntimeError) as e:
            try:
                bits = self.gptq_bits
                groupsize = self.gptq_groupsize
            except Exception:
                try:
                    import os
                    bits = int(os.getenv("GPTQ_BITS"))
                    groupsize = int(os.getenv("GPTQ_GROUPSIZE"))
                except Exception:
                    raise e

        return bits, groupsize

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions