Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Formats that work (and don't) #601

Closed
ssmi153 opened this issue Jul 13, 2023 · 30 comments
Closed

GPTQ Formats that work (and don't) #601

ssmi153 opened this issue Jul 13, 2023 · 30 comments
Labels

Comments

@ssmi153
Copy link
Contributor

ssmi153 commented Jul 13, 2023

Now that we can load GPTQ files that haven't been quantized by TGI's quantization script, I thought I'd do a set of tests to see which formats work and which don't. I'm using https://huggingface.co/TheBloke/OpenOrca-Preview1-13B-GPTQ as an example set.

  1. The 'most compatible' format ([main] branch) doesn't work. This throws the following error: RuntimeError: weight model.layers.0.self_attn.q_proj.g_idx does not exist
  2. Fortunately, the other formats provided by TheBloke do seem to work. In particular: gptq-4bit-128g-actorder_True definitely loads correctly. To use this, you need to set the following environment variables: GPTQ_BITS = 4, GPTQ_GROUPSIZE = 128 (matching the groupsize of the quantized model). Additionally, you need to pass in REVISION = gptq-4bit-128g-actorder_True to pull the correct version of this model (rather than the default version which still doesn't work).

So overall this is great news - we can now load GPTQ files that other people have converted rather than relying on the inbuilt quantizer in TGI!

@TheBloke
Copy link

TheBloke commented Jul 13, 2023

That's great to hear. I started adding those extra quant formats recently with software like TGI and ExLlama in mind.

To the developers of the TGI GPTQ code I'd like to ask: is there any chance you could add support for the quantize_config.json file? It's produced automatically by AutoGPTQ when making a quantisation, and I provide it with every one of my GPTQ files, even the ones made with GPTQ-for-LLaMa. It contains all the GPTQ parameters, and it could easily be used as a source for the GPTQ params, saving the user the need to set them manually via env vars.

Here's an example quantize_config.json:

{
  "bits": 4,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": true,
  "sym": true,
  "true_sequential": true,
  "model_name_or_path": null,
  "model_file_base_name": null
}

Pretty self explanatory. You'd just need to read bits and group_size from this file, found in the model folder, and it could work automatically.

I'd be happy to PR a change myself, if someone confirms it would be merged.

@TheBloke
Copy link

TheBloke commented Jul 13, 2023

PS. Now I have confirmation that TGI works with the other formats, I will add mention of this fact to my READMEs.

And yes maybe the main = 'most compatible' is no longer correct in light of TGI. I called it that because it used to be that using GPTQ-for-LLaMa CUDA branch - which is what I use to make the GPTQ in main - would ensure the GPTQ would work with every local UI (text-generation-webui, KoboldAI, etc), including when partially offloaded to CPU. I briefly tried moving to using AutoGPTQ for all quants a few weeks back, and got complaints from some users that they then couldn't CPU offload it. Hence I stuck with GPTQ-for-LLaMa, and regarded it as 'most compatible'.

Maybe I'll call main the 'old' format.

Either way, I will clarify in the README which ones work with TGI.

@kaleko
Copy link

kaleko commented Jul 13, 2023

Similar issue for me, trying to get Vicuna 7B GPTQ models to run with TGI
RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

So far haven't gotten it to work. Looking forward to the README updates with more info about using these models with TGI @TheBloke !

@TheBloke
Copy link

TheBloke commented Jul 13, 2023

@kaleko I updated the Vicuna 7B v1.3 GPTQs yesterday. Download one of the models from the other branches, as listed under Provided Files. They will work if you manually set the GPTQ_BITS and GPTQ_GROUPSIZE as ssmi153 mentioned in the first post.

You can download from alternate branches using the REVISION parameter in TGI.

https://huggingface.co/TheBloke/vicuna-7B-v1.3-GPTQ

image

@Narsil
Copy link
Collaborator

Narsil commented Jul 14, 2023

is there any chance you could add support for the quantize_config.json file?

This is actually much cleaner than the ENV variables I added. I'm more than happy to switch to it.
Is it correct to assume that for inference we can discard all other config other than bits and groupsize ?
Just want to avoid loading/running a model and output garbage if we can raise an error early.

I'm actually very fine aligning and outputting such a config instead of putting the values in the weights. Wdyt @OlivierDehaene ?
(If we can reduce the current split around GPTQ it's all for the best)

We also have some work to use a better GPTQ kernel: #553 if that's interesting.

The reason for missing g_idx is the use of autogpt, correct ?

@ssmi153
Copy link
Contributor Author

ssmi153 commented Jul 14, 2023

The AutoGPTQ quants from TheBloke seem to work actually. It's the ones that TheBloke is converting using GPTQ-for-Llama that are causing the missing g_idx error. Weirdly though, the GPTQ model file that I've created using GPTQ-for-Llama loads correctly, so there must be a difference somewhere in the way TheBloke is processing those files, or in the settings he's chosen for them.

@TheBloke , out of interest, how do your GPTQ-for_Llama conversion settings compare to this?:
%run -i 'GPTQ-for-LLaMa/llama.py' {INPUT_MODEL_FULL_NAME} wikitext2 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors {OUTPUT_MODEL_FULL_NAME_SAFETENSORS}
My conversion of a 33B Llama model with these settings works with the current TGI implementation. Maybe we can identify any different settings on your side which might allow us to isolate where the problem is.

@TheBloke
Copy link

TheBloke commented Jul 14, 2023

My settings are effectively identical. The reason that my GPTQ-for-LLaMA quants aren't working is that they're using the "old" GPTQ format. There's never been any official GPTQ version naming, but some implementations call it "v1", versus the current "v2" format which has g_idx.

AutoGPTQ produces the new format and can load either format, but TGI uses an implementation that can only load the newer v2 format.

Going forward I will always be making multiple GPTQs available, nearly all of which will be made with AutoGPTQ and will work. For now I am still also providing an old GPTQ-for-LLaMa produced version which doesn't work. I haven't decided yet if if I will continue making that old format version as well - I need to take a survey of the current state of all the various UIs out there and confirm they can all load the v2 format.

I wanted to switch to making all quants with AutoGPTQ a while back, but as soon as I did I got complaints from users using KobaldAI that the newer format models didn't work with CPU offload. So to avoid hassle (and because I lacked the time to test it on KoboldAI myself), I just went back to GPTQ-for-LLaMA for making Llama GPTQs.

That issue may even be fixed now, so I should re-evaluate that and hope to do so soon.

Over the last few days I've uploaded multiple options of AutoGPTQ-produced quants to 62 GPTQ repos. And all my non-Llama models (Falcon, MPT, Bloom, Starcoder) were already produced with AutoGPTQ, so should already be TGI compatible. Most of the repos I haven't done are ones I don't plan to do, because they're old and superseded. Eg I didn't do Vicuna v1.1, just v1.3.

My next step is to update my SuperHOT GPTQ repos also - that should happen this weekend.

@TheBloke
Copy link

TheBloke commented Jul 14, 2023

is there any chance you could add support for the quantize_config.json file?

This is actually much cleaner than the ENV variables I added. I'm more than happy to switch to it. Is it correct to assume that for inference we can discard all other config other than bits and groupsize ? Just want to avoid loading/running a model and output garbage if we can raise an error early.

Fantastic, thanks!

The other parameter that sometimes matters is desc_act, which in GPTQ-for-LLaMa is called Act Order. With AutoGPTQ it needs to know whether that's true or false else inference will produce gibberish.

But as you don't have an ENV var for it, I assume that your GPTQ code is able to auto detect that somehow, so it presumably it isn't required. It isn't required in ExLlama so that's further confirmation that it is possible to auto detect it.

And yes the other params in the file can be ignored, they're irrelevant during inference.

I'm actually very fine aligning and outputting such a config instead of putting the values in the weights. Wdyt @OlivierDehaene ? (If we can reduce the current split around GPTQ it's all for the best)

We also have some work to use a better GPTQ kernel: #553 if that's interesting.

The reason for missing g_idx is the use of autogpt, correct ?

Sorry I missed your earlier reply when I replied to ssmi's ping, else I'd have replied to you directly, but this is answered above: AutoGPTQ outputs g_idx and the files work fine, it's the old GPTQ-for-LLaMa CUDA version I have been using that doesn't output g_idx.

I would really love to stop using that old GPTQ-for-LLaMa code and will do as soon as I've confirmed there's no need to do so any more.

But either way, I'll always have AutoGPTQ-produced GPTQs in future which it's confirmed TGI can load OK.

We also have some work to use a better GPTQ kernel: #553 if that's interesting.

Excellent! ExLlama's kernels are really amazing for performance and VRAM usage.

@TheBloke
Copy link

I've just tried to use TGI with GPTQs for myself for the first time, using the Docker container on a Lambda Labs H100 system.

Running unquantised models works fine, so TGI itself seems to be OK.

But I can't so far load any GPTQ models, because the server keeps crashing. There's no logs to help me debug this - is there any way I can get logs shown using the Docker?

Here's the full output of my attempting to run TheBloke/openchat_v2_openorca_preview-GPTQ, which @ssmi153 said was working for them. I tried the gptq-4bit-128g-actorder_True branch.

ᐅ docker run --rm --name tgi --shm-size=1gb -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 --runtime=nvidia --gpus all -p 8080:8080/tcp -v /workspace/data:/data ghcr.io/huggingface/text-generation-inference:0.9.2 --model-id TheBloke/OpenOrca-Preview1-13B-GPTQ --revision gptq-4bit-128g-actorder_True --hostname 0.0.0.0 --port 8080 --max-concurrent-requests 20  --quantize gptq
2023-07-17T13:39:01.039097Z  INFO text_generation_launcher: Args { model_id: "TheBloke/OpenOrca-Preview1-13B-GPTQ", revision: Some("gptq-4bit-128g-actorder_True"), validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 20, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-17T13:39:01.039343Z  INFO text_generation_launcher: Starting download process.
2023-07-17T13:39:03.386135Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-17T13:39:03.844196Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-17T13:39:03.844670Z  INFO text_generation_launcher: Starting shard 0
2023-07-17T13:39:07.059936Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-17T13:39:13.066283Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-07-17T13:39:13.163861Z  INFO text_generation_launcher: Shard 0 ready in 9.317614619s
2023-07-17T13:39:13.252940Z  INFO text_generation_launcher: Starting Webserver
2023-07-17T13:39:13.527964Z  WARN text_generation_router: router/src/main.rs:165: Could not find a fast tokenizer implementation for TheBloke/OpenOrca-Preview1-13B-GPTQ
2023-07-17T13:39:13.528017Z  WARN text_generation_router: router/src/main.rs:168: Rust input length validation and truncation is disabled
2023-07-17T13:39:13.768428Z  INFO text_generation_router: router/src/main.rs:346: Serving revision 0b78faa0d35ea4386acafbf12dbbd6c014df25c0 of model TheBloke/OpenOrca-Preview1-13B-GPTQ
2023-07-17T13:39:13.778367Z  INFO text_generation_router: router/src/main.rs:212: Warming up model
2023-07-17T13:39:14.960712Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-07-17T13:39:15.060420Z ERROR text_generation_launcher: Webserver Crashed
2023-07-17T13:39:15.060477Z  INFO text_generation_launcher: Shutting down shards
2023-07-17T13:39:15.068091Z ERROR text_generation_launcher: Shard process was signaled to shutdown with signal 6
Error: WebserverFailed

I can see with nvtop that it does start loading the model, then it crashes with no further info.

Here's a log of me loading unquantised, just to show that works fine:

Log of loading unquantised model (lmsys/vicuna-33b-v1.3) without problems
ᐅ docker run --rm --name tgi --shm-size=1gb --runtime=nvidia --gpus all -p 8080:8080/tcp -v /workspace/data:/data ghcr.io/huggingface/text-generation-inference:0.9.2 --model-id lmsys/vicuna-33b-v1.3 --hostname 0.0.0.0 --port 8080 --max-concurrent-requests 20 --max-batch-total-tokens 8192
2023-07-17T13:41:07.153247Z  INFO text_generation_launcher: Args { model_id: "lmsys/vicuna-33b-v1.3", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 20, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 8192, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-17T13:41:07.153553Z  INFO text_generation_launcher: Starting download process.
2023-07-17T13:41:09.634012Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-17T13:41:10.059871Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-17T13:41:10.060361Z  INFO text_generation_launcher: Starting shard 0
2023-07-17T13:41:13.323937Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-17T13:41:20.082090Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-07-17T13:41:27.101627Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-07-17T13:41:27.195299Z  INFO text_generation_launcher: Shard 0 ready in 17.133052272s
2023-07-17T13:41:27.285728Z  INFO text_generation_launcher: Starting Webserver
2023-07-17T13:41:27.601753Z  WARN text_generation_router: router/src/main.rs:165: Could not find a fast tokenizer implementation for lmsys/vicuna-33b-v1.3
2023-07-17T13:41:27.601809Z  WARN text_generation_router: router/src/main.rs:168: Rust input length validation and truncation is disabled
2023-07-17T13:41:27.601825Z  WARN text_generation_router: router/src/main.rs:324: `--revision` is not set
2023-07-17T13:41:27.601833Z  WARN text_generation_router: router/src/main.rs:325: We strongly advise to set it to a known supported commit.
2023-07-17T13:41:27.858779Z  INFO text_generation_router: router/src/main.rs:346: Serving revision 7d7373f8b7c3ad92f7377562ad6a56938786faef of model lmsys/vicuna-33b-v1.3
2023-07-17T13:41:27.871117Z  INFO text_generation_router: router/src/main.rs:212: Warming up model
2023-07-17T13:41:30.834414Z  INFO text_generation_router: router/src/main.rs:221: Connected
2023-07-17T13:41:48.649128Z  INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=curl/7.68.0 otel.kind=server trace_id=f52482293b2f3e1b4cf6122aaeb11ab4}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: 17, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="740.908633ms" validation_time="137.399µs" queue_time="124.915µs" inference_time="740.647103ms" time_per_token="43.567476ms" seed="None"}: text_generation_router::server: router/src/server.rs:289: Success

Thanks in advance

@Narsil
Copy link
Collaborator

Narsil commented Jul 17, 2023

I would really love to stop using that old GPTQ-for-LLaMa code and will do as soon as I've confirmed there's no need to do so any more.

You mean using https://github.com/PanQiWei/AutoGPTQ instead of https://github.com/qwopqwop200/GPTQ-for-LLaMa correct ?
I used specifically https://github.com/qwopqwop200/GPTQ-for-LLaMa because I found the code easier to reason about.

Notably we don't need the modeling code at all in either lib.
I was able to refactor the code to load nothing on the CPU by default, and pull only layer per layer (on CUDA, but could be CPU) for quantization. This makes EVERY model from transformers work too. (afaik at least, we're really just using AutoModelForCausalLM.from_pretrained).
Makes everything a bit easier to work with. We currently don't provide a way to select the data sent during quantization but I'm not sure how much that really matter (Didn't seem to matter that much for llama derived models)

All that to say that I'm hesitant to pull from either codebase because they are quite large and we only use a very tiny fraction of the code. Trying to find common grounds would be nice.

@OlivierDehaene
Copy link
Member

@TheBloke, TGI seems to have issues with H100s I'm not sure why yet.
Any chance you could test on another device? I was able to launch the model on 1xA10 for example.

You can also use ghcr.io/huggingface/text-generation-inference:sha-44acf72 with the env var LOG_LEVEL=info,text_generation_launcher=debug for more logs.

@TheBloke
Copy link

TheBloke commented Jul 17, 2023

@Narsil Sorry, I didn't mean you should stop using GPTQ-for-LLaMa. I was just talking about my own quantising processes. I meant I would like to stop using the old CUDA fork of GPTQ-for-LLaMa for making new quants and uploading them to HF.

I have used an old CUDA fork because for a long time for my 'main' branch GPTQs because there were UIs out there that couldn't support the g_idx GPTQ format in all scenarios. But that might not be the case any more and I plan to check that.

If you don't even use the modelling code then it sounds like you have a very lightweight implementation so that's great.

@OlivierDehaene Thank you. And I just noticed there's already an issue posted for this, so I'll move there.

@TheBloke
Copy link

TheBloke commented Jul 17, 2023

I just realised the H100 issue is already reported here: #613 . I'll post there and stop de-railing this thread as it's obviously H100 specific.

(For completeness on this thread, looks like it's because the pytorch 2.0 in the Docker doesn't support compute 90. If I make a new container with Pytorch 2.1 nightly I think it will work.)

@ssmi153
Copy link
Contributor Author

ssmi153 commented Jul 18, 2023

@TheBloke - I just did some benchmarking of TGI on Runpod instances using a range of GPU combinations. (https://docs.google.com/spreadsheets/d/1Ph_GeybAtNVoTs7w4mkCfd7p1lGywsNhJf9z-8fTcUE/edit?usp=sharing - if you're interested). This benchmarking was designed to reflect how I would use TGI and may not be as robust as some of the more formal benchmarks. One thing to note when benchmarking using the Docker image is that I get (and also you look like you've got) a warning saying WARN shard-manager: text_generation_launcher: We're not using custom kernels. . I think this is because by default the combination of Runpod + the docker image doesn't have NVIDIA NVCC installed (it requires the developer version of the NVIDIA container toolkit rather than the standard one), so it can't build the custom kernels or vLLM for PagedAttention. I'm struggling to work out whether this warning is a red herring or not, as TGI is still impressively fast. If it is indeed correct, then there should be a further uplift in performance above and beyond what I've seen so far. I haven't yet created an issue about the warning because it felt like I was raising a million Runpod related issues, but if other people are also seeing this (and on other platforms using the Docker image) then it might be worth exploring it further.

@OlivierDehaene
Copy link
Member

@ssmi153, this warning is a bit dismissive. If you don't see import errors and your architecture is one of the optimized architecture (as displayed in the README), you are using flash and paged attention.
This warning only applies to BLOOM and non-flash neox.

@Ichigo3766
Copy link

@ssmi153 Can you confirm the new llama 2 gptq versions working? I am getting error with Bloke's autogptq branch:
RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

WizardCoder is also giving an error when trying to load gptq version of it.:

raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

I am running the docker command with the variables you mentioned at start and there are models that do work so wondering why this one is throwing error as it works fine in fp16.

@Ichigo3766
Copy link

Ichigo3766 commented Jul 19, 2023

The fix seems to work for llama 70b but is pretty slow. Thank you!

Was wondering if you got the time to check wizardcoder. Still having the issues of loading the gptq version:

raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

@fxmarty
Copy link
Contributor

fxmarty commented Jul 20, 2023

@bloodsucker99 You need to pass the environment variable GPTQ_BITS (though I think gptq_bits and gptq_groupsize could be directly inferred from the shapes of qweights, qzeros, g_idx?)

@TheBloke
Copy link

Narsil was also looking into adding automatic support for quantize_config.json which would provide the bits and group_size without the user needing to specify any env vars.

But yeah, I think ExLlama can automatically detect quantise config, so maybe neither env vars nor quantize_config.json are needed and it can just be auto detected? That would be the ideal scenario.

Narsil pushed a commit that referenced this issue Jul 20, 2023
As per title & reported
#601 (comment)
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```
@keelezibel
Copy link

@TheBloke I am trying to run your quantized model: TheBloke/Llama-2-70B-chat-GPTQ on V100 but TGI is complaining about compute capability < 7.5 detected. Is it that flash attn is not supported for V100 GPU? Does that mean all GPTQ models can only run on Ampere or later cards as well?

@Narsil
Copy link
Collaborator

Narsil commented Jul 21, 2023

Indeed flash isn't supported on V100, and sharding requires flash for llama.

@keelezibel
Copy link

Indeed flash isn't supported on V100, and sharding requires flash for llama.

I tried to set env var for Sharding to False. But it doesn’t work too. Means I definitely have to load full model on v100 only? I rmb I tried bitsandbytes and it worked but was relatively slower since there is offloading to ram.

@keelezibel
Copy link

@Narsil don’t mind if I clarify, this PR won’t help with running gptq models on older models? It’s just for reading the config from the model folder for gptq?

@Narsil
Copy link
Collaborator

Narsil commented Jul 21, 2023

older models?

what do you mean ? All TheBlock models have this quantization configuration, no ?

@keelezibel
Copy link

keelezibel commented Jul 21, 2023

older models?

what do you mean ? All TheBlock models have this quantization configuration, no ?

Sorry i meant this PR wont help to resolve running GPTQ models on older GPU cards such as V100, right?

@Ichigo3766
Copy link

So the problem is in flash santacoder modeling, its trying to use to bits values from the weights and not using environment values. Manually forced the use if env variables fixed the issue :)

Narsil added a commit that referenced this issue Jul 25, 2023
…ariables. (#671)

- Current PR is not great because we're side stepping the
  `Weights.__init__` but Weights shouldn't requires anything related
  to the config or the model_id as it aims to be a simple Wrapper
  over multi file loading.
- Ideal solution would be to use something like Rust enum
  ```
  enum Quantize{
    Bitandbytes(Bitsandbytes),
    GPTQ(bits: usize, groupsize: usize)
  ```
  And passing that around during load. Unfortunately we don't
  have access to this, so for now, side-stepping seems easier.

- Re-enabling groupsize<0 with exllama (confirmed it works.)

Helps #601 

In next steps we should make sure our quantization script uses that
format and make it standard.


# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
@taoari
Copy link

taoari commented Jul 27, 2023

@TheBloke When I tried to run with TheBloke/Llama-2-7b-Chat-GPTQ, I got the following error:

warmup{max_input_length=4096 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens"))

Actually, I am able to serve the official meta-llama/Llama-2-7b-chat-hf with --quantize bitsandtyes on single T4 GPU. When I change the model to TheBloke/Llama-2-7b-Chat-GPTQ with --quantize gptq. I got NOT enough memory error. Even when I changed --max-batch-prefill-tokens=2048, this error still happens. Since the bitsandtypes-quantized version can be served on single T4 GPU, gptq-quantized version should be of no problem, do you happen to know why?

@AIApprentice101
Copy link

I have the same error when trying to load TheBloke/Llama-2-7b-Chat-GPTQ

@samos123
Copy link

I'm hitting the same issue as @taori when trying llama 2 70b chat. The error message:

"level":"ERROR","message":"Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":33,"span":{"name":"warmup"},"spans":[{"max_input_length":1024,"max_prefill_tokens":4096,"name":"warmup"},{"name":"warmup"}]}

This is my YAML manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: ghcr.io/huggingface/text-generation-inference:1.0.3
        resources:
          limits:
            nvidia.com/gpu: 2
        env:
        - name: MODEL_ID
          value: TheBloke/Llama-2-70B-chat-GPTQ
        - name: NUM_SHARD
          value: "2"
        - name: QUANTIZE
          value: gptq
        - name: GPTQ_BITS
          value: "4"
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4

verdant621 pushed a commit to verdant621/text-generation-inference that referenced this issue Oct 19, 2023
As per title & reported
huggingface/text-generation-inference#601 (comment)
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```
cr313 added a commit to cr313/text-generation-inference-load-test that referenced this issue Apr 19, 2024
As per title & reported
huggingface/text-generation-inference#601 (comment)
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 30, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 5, 2024
tjluyao added a commit to mlsys-io/kv.run that referenced this issue Jul 7, 2024
Init

fix: cleanup

Add load testing

Refactored gRPC interface
Added validation logic

ValidationError was not correctly handled

Use axum

feat: Docker image

feat: Add AML deployment

Update aml deployment

feat: Improve error handling

feat: Add arguments to CLI

v0.1.0

fix(validation): Fix error messages

feat(router): Add max_waiting_tokens

Create LICENSE (#2)

feat(server): Use safetensors

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(client): Simplify sharded logic

feat(server): Support bitsandbytes

feat(server): Support all AutoModelForCausalLM on a best effort basis

feat: Use json formatter by default in docker image

fix(models): Revert buggy support for AutoModel

feat(server): Support generic AutoModelForCausalLM

feat(server): Support AutoModelForSeq2SeqLM

feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard

feat(server): Improved doc

fix(server): Fix Transformers fork version

feat(server): Clarify CausalLMBatch concatenate method

feat(rust): Update to 1.65

fix(router): Fix HTTP status codes

fix(readme): Typo

fix(router): Handle tokenizer errors

feat(server): Support Galactica (#4)

fix(batching): Avoid theoretical hang in batcher loop (#5)

- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute

Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>

feat(server): Add model tests (#6)

fix(server): Only pad to multiple of 8 on GPUs

feat: Support stop sequences (#7)

feat: Return logprobs (#8)

feat(launcher): Add integration tests (#9)

fix(server): Fix stop sequences (#11)

fix(server): Check for device type correctly when determining initial padding (#16)

AFAIK there is no torch device type called "gpu".

fix(router): Include special tokens when tokenizing (#14)

There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.

feat(router): Add const parameters to validation logic  (#15)

I noticed some opportunity to collapse some of the logic, in case you
are interested.

fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)

Fixes #12 in the easiest way I could think of.

feat(launcher): Log server stdout (#19)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): Minor refactorization using new_zeros (#24)

- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher

fix(router): Obey max batch size (#23)

feat(server): Support SantaCoder (#26)

fix(server): Fix position ids (#28)

feat(docker): Make the image compatible with api-inference (#29)

fix(docker): fix api-inference deployment (#30)

fix(router): fix api-inference deployment (#31)

fix(dockerfile): fix docker build (#32)

feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)

feat(router): Remove second lock from batcher hot path (#27)

@njhill

feat: Support sampling seeding (#37)

Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>

feat: Add token streaming using ServerSideEvents support (#36)

Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is:

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```

Revert "feat: Add token streaming using ServerSideEvents support" (#40)

Reverts huggingface/text-generation-inference#36

fix(server): fix seeding on gpu (#42)

fix(server): fix seeding with multiple shards (#44)

feat: Add token streaming using ServerSideEvents support (#41)

fix(server): fix quantization for sharded models (#45)

feat(server): Support GPT-Neox (#39)

feat(ci): Docker build and push (#46)

feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)

feat(server): support repetition penalty (#47)

feat(server): allow the server to use a local weight cache (#49)

fix(server): allow greedy repetition penalty (#51)

feat(router): use background task to manage request queue (#52)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

breaking(router): modify /generate API to only return generated text (#50)

@njhill, @yk FYI

generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.

We also remove the unused Vec.

feat(router): refactor API and add openAPI schemas (#53)

feat(docs): Clarify installation steps (#54)

Adds some bits for first-time users (like me 😄 )

feat(ci): push to AML registry (#56)

fix(server): better handling of inference mode (#57)

V0.2.1 (#58)

feat(server): support t5 (#59)

fix(docker): increase shm size (#60)

fixed SSE naming (#61)

https://en.wikipedia.org/wiki/Server-sent_events

feat: add distributed tracing (#62)

feat: add safetensors conversion (#63)

feat(server): improve download logging (#66)

feat(launcher): add disable_custom_kernels arg (#67)

feat(router): add max_total_tokens and empty_input validation (#68)

closes #65

fix(launcher): copy current env vars to subprocesses (#70)

closes #69

feat(router): add prometheus metrics scrape endpoint (#71)

v0.3.0 (#72)

feat(router): add cors allow origin options (#73)

feat(server): enable hf-transfer (#76)

fix(server): remove position_ids from galactica forward (#82)

closes #80

feat(server): pre-allocate max attention mask (#75)

v0.3.1 (#84)

feat(server): add special token bool (#85)

fix(docs): fix openapi schema (#86)

fix(server): fix token_is_special (#87)

feat(router): add legacy route for api-inference support (#88)

feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)

feat(router): add api-inference headers (#91)

feat(server): add logits watermark (#90)

feat(server): update to hf_transfer==0.1.2 (#93)

feat(ci): improve CI speed (#94)

fix(launcher): add router parameters to launcher (#95)

feat(server): fix transformers commit (#96)

v0.3.2 (#97)

fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)

feat: allow local models (#101)

closes #99

feat: add supported models (#102)

feat(clients): Python client (#103)

fix(server): fix galactica batch (#106)

closes #105

feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)

feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)

fix(python-client): stream not set on the sync client (#109)

fix(server): fix index out of range for watermarking (#110)

feat: support typical sampling (#114)

closes #112

fix(server): do not warp prefill logits (#116)

feat(router): support left truncation (#115)

closes #111

feat(router): add best_of parameter (#117)

feat(python-client): add new parameters (#118)

v0.4.0 (#119)

feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)

…ed models

fix(server): revert gpt-neox optims (#123)

fix(server): add position ids to neox (#126)

fix(server): use server tokenizer as gt (#128)

fix(python-client): relax dependencies (#129)

feat(python-client): add cookies to Client constructors and requests (#132)

I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.

Note: I couldn't get the client tests to pass - do you need to have an
HF token?

```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```

feat(ci): add ci paths (#134)

feat: Add note about NVIDIA drivers (#64)

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

feat(python-client): release v0.4.0 (#135)

feat(python-client): add CI (#136)

feat(server): flash neoX (#133)

fix(server): fix flash-neox scores warping (#137)

feat(server): cleanup flash neox loading (#139)

v0.4.1 (#140)

fix(server): Avoid using try/except to determine kind of AutoModel (#142)

feat(server): Add mypy-protobuf (#141)

Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.

feat(server): clear cache on error (#143)

feat(server): reduce mlp and attn in one op for flash neox (#145)

feat: aws sagemaker compatible image (#147)

The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

fix(ci): fix sagemaker action (#148)

feat(benchmark): tui based benchmarking tool (#149)

fix(server): fix flash neox rotary embeddings (#150)

v0.4.2 (#151)

v0.4.3 (#152)

feat(server): flash santacoder (#153)

docs(readme): provide link Logits Warper README (#154)

fix(server): fix escape characters in stop sequence (#155)

feat(docker): improve flash_attention caching (#160)

feat(launcher): allow disabling hf_transfer (#161)

fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)

fix(router): use buckets for metrics histograms (#163)

feat(router): make router input validation optional (#164)

feat(server): add flash attention llama (#144)

feat(server): support OPT models (#55)

OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.

v0.5.0 (#168)

feat(server): optimize decode for sane tokenizers (#170)

feat(server): support sharded santacoder (#167)

fix(launcher): revert change on shard errors (#173)

fix(ci): fix CVE in github-slug-action (#174)

feat(ci): add image signing with cosign (#175)

feat(ci): add Trivy and scan docker image (#178)

feat(ci): use large runners (#179)

feat(ci): faster scanning (#180)

fix(ci): fix ci permissions (#181)

fea(dockerfile): better layer caching (#159)

fix(ci): fix cosign error (#183)

fix(docker): fix docker image (#184)

fix(docker): fix image (#185)

fix(docker): revert dockerfile changes (#186)

fix(docker): fix docker image dependencies (#187)

fix(router): fix truncation (#190)

closes #189

feat(python-client): get list of currently deployed tgi models using the inference API (#191)

feat(router): add info route (#196)

close #125

feat(server): support quantization for flash models (#200)

closes #197

feat(server): check cuda capability when importing flash models (#201)

close #198

fix(server): fix hf_transfer issue with private repos (#203)

fix(docker): remove unused dependencies (#205)

fix(router): add auth token to get model info (#207)

feat(router): add git sha to info route (#208)

feat(router): drop requests when client closes the channel (#202)

fix(ci): fix sha in docker image (#212)

feat(server): flash attention past key value optimizations (#213)

feat(router): add device and dtype info (#215)

fix(server): fix past key values logic (#216)

@njhill fyi

fix(server): cleanup new flash past_key_values logic (#217)

fix(server): fix flash causal (#218)

fix(server): fix flash causal (#219)

fix(server): fix flash batch filtering (#220)

misc: update to rust 1.69 (#221)

v0.6.0 (#222)

feat(server): reduce memory requirement (#214)

chore(server): update huggingface-hub (#227)

feat(router): use number of tokens in batch as input for dynamic batching (#226)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

feat(router): add endpoint info to /info route (#228)

chore(server): update safetensors version (#235)

fix(python-client): add auth headers to is supported requests (#234)

Starting some routing tests. (#233)

fix(benchmarking): fix benchmarking tool

chore(launcher): refactor logic (#242)

Hopefully it's cleaner

feat(router): add tests to validation (#237)

feat(router): new healthcheck that skips the queue (#244)

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)

Introduced in #214

Fixes #249

fix(server): Small tidy of code from recent changes (#251)

remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()

chore(server): update transformers (#250)

feat(server): add watermarking tests (#248)

feat(docker): add nvidia env vars (#255)

doc(launcher): add more docs to the `launcher` itself and link in the README (#257)

feat(benchmark): add support for private tokenizers (#262)

Adding docs on how dynamic batching works. (#258)

This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.

Maybe some drawings could help too but I kept it to text for now.

chore(github): add templates (#264)

fix(server): fix typo in tokenizers decode (#269)

closes #268

feat(server): support hf endpoint weight layout (#266)

fix(launcher): pass weights cache override to the download process (#274)

closes #273

fix(launcher): handle hub branches (#278)

fix(server): Removes the parallelism in file convertion (during download) (#275)

feat(launcher): Improve error message when download process fails. (#276)

fix(server): fix convert (#284)

chore: add `flash-attention` to docker ignore (#287)

included when building docker locally.
(Where the local dirs might have the flash-attention folder.)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fea(server): decrease convert RAM requirements (#286)

fix(dockerfile): fix nvidia env vars (#297)

Fixes #291

feat(router): Adding response schema for compat_generate (#292)

feat(docker): add benchmarking tool to docker image (#298)

fix(docker): fix docker build (#299)

feat(server): optim flash causal lm decode_token (#285)

fix(docker): fix nvidia env vars (#305)

fix(docker): remove nvidia require cuda env (#310)

feat(server): shard token decode (#303)

feat(server): use float16 (#304)

fix(docker): remove CUDA_VERSION

feat(server): use cuda graph in logits warping (#302)

fix(server): fix multinomial implem in Sampling

feat(server): GPTQ quantization (step1) (#277)

Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore(docker): use nvidia base image (#318)

fix(docker): remove quantize default

fix(docker): use ubuntu20.04

Hotfixes for santacoder/bigcode. (#294)

Hotfixes:

- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Lifting check_unitialized. (#325)

Lifting check_unitialized.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Removing dead variables. (#327)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(ci): custom gpu runners (#328)

Single place for TP layers + Dropout Layer Norm + FastLinear (#329)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: add snapshot testing (#282)

feat(integration-tests): improve comparison and health checks (#336)

fix(server): fix decode token (#334)

Fixes #333

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

fix: set MODEL_ID in sagemaker-entrypoint script (#343)

feat(server): Support BLOOMChat-176B (#348) (#351)

@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): fix init for flash causal lm (#352)

Fixes #347

fix(server): t5 cannot run in f16 (#356)

Fix #349

fix(ci): fix security group (#359)

Switch security group used for ci
(open outbound rules)

Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>

feat: add nightly load testing (#358)

chore(sever): update requirements (#357)

Fixes #338

feat(server): support fp16 for t5 (#360)

Fixes #349

feat(server): do not use device_map auto on single GPU (#362)

feat(server): support trust_remote_code (#363)

feat(router): log input/ouput at debug level (#364)

@njhill FYI

v0.7.0 (#353)

feat: decrease IPC proto size (#367)

Closes #307 #308

feat(benchmarker): add summary tables (#368)

feat(server): support vectorized warpers in flash causal lm (#317)

Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>

Fix issue when load AutoModelForSeq2SeqLM model (#370)

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(server): fix quantization

feat(server): support RefinedWeb models (#379)

v0.8.0

increase health checks

feat(server): add retry on download (#384)

fix(server): fix bnb quantization for CausalLM models (#385)

v0.8.1

fix(server): fix has_position_ids (#395)

Fix #389

feat(server): remove trust_remote_code requirement for falcon models (#396)

feat(server): load santacoder/starcoder models with safetensors (#393)

Fix #366

v0.8.2

feat(sagemaker): add trust remote code to entrypoint (#394)

feat(launcher): parse oom signal (#404)

feat(server): only compute prefill logprobs when asked (#406)

Close #288

feat(server): batch tokenization for flash causal lm (#411)

chore: update openapi schema

feat(server): Rework model loading (#344)

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(server): optimize dist ops (#434)

docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)

It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.

fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)

This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`

<!-- Remove if not applicable -->

Fixes #422

feat(server): pre-allocate past key values for flash causal LM (#412)

feat(router): add ngrok integration (#453)

feat(server): improve flash attention import errors (#465)

@lewtun, is this enough?

Closes #458
Closes #456

fix(server): fix warpers on CPU (#472)

Closes #471

fix(server): Fixing T5 in case the names are mixed up. (#475)

feat(server): Update convert logic. (#483)

Should be more robust to shared tensors (ok when using
      `from_pretrained). But forcing us to add new checks in our loading
      code (since the chosen key to keep might be different from
      `transformers`).

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>

feat(server): Adding new ignore_rule for conversion. (#485)

fix(router): add timeout on flume sends (#488)

feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)

Let's start discussing implementation.

- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).

Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.

My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): Do not init process group if already initialized (#388)

feat(router): add header option to disable buffering for the generate_stream response (#498)

generate_stream endpoint response stream.

Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.

Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.

feat(server): add paged attention to flash models (#516)

Closes #478

feat(router): arg validation (#519)

feat: Add the option to force another dtype than `f16`. (#513)

fix(launcher): fix issue where launcher does not properly report shard failures (#522)

v0.9.0 (#525)

feat(server): Add Non flash MPT. (#514)

This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290

fix: Update server/Makefile to include Makefile-vllm (#520)

For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)

fix(server): Handle loading from local files for MPT (#534)

This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.

fix(server): avoid errors for very small top_p values (#544)

See https://github.com/huggingface/transformers/pull/24111

I didn't add validation to the `__init__` method since it's not done for
other values/warpers.

feat(server): use latest flash attention commit (#543)

@njhill FYI

feat(router): add argument for hostname in router (#545) (#550)

In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with

```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health'  # failed before this commit
```

Trigger CI

---------

Co-authored-by: Phil Chen <philchen2000@gmail.com>

fix(server): decrease memory fragmentation (#557)

v0.9.1 (#558)

fix(server): harden the weights choice to save on disk. (#561)

- Look at `transformers` base class to check for
  `_key_to_ignore_on_load_missing` or `_tied_weights` which are the
  standard attributes to select the keys to NOT save on disk (since they
  are ignored)

- Modified safetensors code (to be reflected in safetensors even if it's
  an internal function).

- Will not work for trust_remote_code=True repos (like santacoder).

Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593

feat: better errors for warmup and TP (#575)

Close #571

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)

Fixes #555

feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)

Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.

Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.

This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.

Fixes #500

chore: migrate ci region for more availability. (#581)

fix(server): T5 weights names. (#582)

Fixes #541

fix(server): Adding logger import to t5_modeling.py (#585)

Logger is referenced during the apex importing but is not imported,
causing a NameError

fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)

This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.

Thanks @Narsil for the original fix.

feat(server): Implements sharding for non divisible `vocab_size`. (#583)

- The code is relatively easy (just disable the checks on Embedding and
Head)

This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.

feat(server): empty cache on errors

GPTQ Env vars: catch correct type of error (#596)

When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.

@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.

feat(launcher): add arg validation and drop subprocess (#595)

feat(router): explicit warning if revision is not set (#608)

docs: README: Add logo + baseline (#611)

![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984)

fix(server): blacklist local files (#609)

Close #589 #602

v0.9.2 (#616)

fix(server): empty_cache when stopped

fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)

fea(launcher): debug logs (#623)

feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Reworking the quantization script so it's still universal (not llama
specific)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Still need to investigate the potential differences in quantization
results.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(server): flash attention v2 (#624)

feat(server): add support for llamav2 (#633)

v0.9.3 (#634)

fix(server): fix llamav2 config (#635)

feat(server): auto max_batch_total_tokens for flash att models (#630)

feat(router): ngrok edge (#642)

docs: Update README.md (#639)

docs: Update README.md (#643)

Add trust_remote_code to quantize script (#647)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.

With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

 -->

fix(server): llama v2 GPTQ (#648)

As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```

fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)

fix(server): use mem_get_info to get kv cache size (#664)

Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636

feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)

Just trying to get the integration tests to pass.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>

Directly load GPTBigCode to specified device (#618)

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

feat(server): add local prom and health routes if running w/ ngrok

feat: add cuda memory fraction (#659)

Close #673

fix(server): fix exllama buffers (#689)

Close #683

feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)

- Current PR is not great because we're side stepping the
  `Weights.__init__` but Weights shouldn't requires anything related
  to the config or the model_id as it aims to be a simple Wrapper
  over multi file loading.
- Ideal solution would be to use something like Rust enum
  ```
  enum Quantize{
    Bitandbytes(Bitsandbytes),
    GPTQ(bits: usize, groupsize: usize)
  ```
  And passing that around during load. Unfortunately we don't
  have access to this, so for now, side-stepping seems easier.

- Re-enabling groupsize<0 with exllama (confirmed it works.)

Helps #601

In next steps we should make sure our quantization script uses that
format and make it standard.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(README): update readme

fix(server): fix quantization python requirements (#708)

fix(server): fix missing datasets in quantize

feat(server): support new falcon config (#712)

v0.9.4 (#713)

Add section about TGI on other AI hardware accelerators in README (#715)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

As per title.

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs: Add hardware section to TOC in README (#721)

feat(server): update vllm version (#723)

chore: update license to HFOIL (#725)

v1.0.0 (#727)

Local gptq support. (#738)

Redoes #719

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix typing in `Model.generate_token` (#733)

This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.

All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591

I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

CC @OlivierDehaene

Adding Rope scaling. (#741)

- Adds Rope NTK scaling.

Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653

- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).

Fixes #512

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore: fix typo in mpt_modeling.py (#737)

Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

implemetation -> implementation

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update…
tjluyao added a commit to mlsys-io/kv.run that referenced this issue Jul 7, 2024
Init

fix: cleanup

Add load testing

Refactored gRPC interface
Added validation logic

ValidationError was not correctly handled

Use axum

feat: Docker image

feat: Add AML deployment

Update aml deployment

feat: Improve error handling

feat: Add arguments to CLI

v0.1.0

fix(validation): Fix error messages

feat(router): Add max_waiting_tokens

Create LICENSE (#2)

feat(server): Use safetensors

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(client): Simplify sharded logic

feat(server): Support bitsandbytes

feat(server): Support all AutoModelForCausalLM on a best effort basis

feat: Use json formatter by default in docker image

fix(models): Revert buggy support for AutoModel

feat(server): Support generic AutoModelForCausalLM

feat(server): Support AutoModelForSeq2SeqLM

feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard

feat(server): Improved doc

fix(server): Fix Transformers fork version

feat(server): Clarify CausalLMBatch concatenate method

feat(rust): Update to 1.65

fix(router): Fix HTTP status codes

fix(readme): Typo

fix(router): Handle tokenizer errors

feat(server): Support Galactica (#4)

fix(batching): Avoid theoretical hang in batcher loop (#5)

- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute

Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>

feat(server): Add model tests (#6)

fix(server): Only pad to multiple of 8 on GPUs

feat: Support stop sequences (#7)

feat: Return logprobs (#8)

feat(launcher): Add integration tests (#9)

fix(server): Fix stop sequences (#11)

fix(server): Check for device type correctly when determining initial padding (#16)

AFAIK there is no torch device type called "gpu".

fix(router): Include special tokens when tokenizing (#14)

There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.

feat(router): Add const parameters to validation logic  (#15)

I noticed some opportunity to collapse some of the logic, in case you
are interested.

fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)

Fixes #12 in the easiest way I could think of.

feat(launcher): Log server stdout (#19)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): Minor refactorization using new_zeros (#24)

- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher

fix(router): Obey max batch size (#23)

feat(server): Support SantaCoder (#26)

fix(server): Fix position ids (#28)

feat(docker): Make the image compatible with api-inference (#29)

fix(docker): fix api-inference deployment (#30)

fix(router): fix api-inference deployment (#31)

fix(dockerfile): fix docker build (#32)

feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)

feat(router): Remove second lock from batcher hot path (#27)

@njhill

feat: Support sampling seeding (#37)

Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>

feat: Add token streaming using ServerSideEvents support (#36)

Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is:

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```

Revert "feat: Add token streaming using ServerSideEvents support" (#40)

Reverts huggingface/text-generation-inference#36

fix(server): fix seeding on gpu (#42)

fix(server): fix seeding with multiple shards (#44)

feat: Add token streaming using ServerSideEvents support (#41)

fix(server): fix quantization for sharded models (#45)

feat(server): Support GPT-Neox (#39)

feat(ci): Docker build and push (#46)

feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)

feat(server): support repetition penalty (#47)

feat(server): allow the server to use a local weight cache (#49)

fix(server): allow greedy repetition penalty (#51)

feat(router): use background task to manage request queue (#52)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

breaking(router): modify /generate API to only return generated text (#50)

@njhill, @yk FYI

generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.

We also remove the unused Vec.

feat(router): refactor API and add openAPI schemas (#53)

feat(docs): Clarify installation steps (#54)

Adds some bits for first-time users (like me 😄 )

feat(ci): push to AML registry (#56)

fix(server): better handling of inference mode (#57)

V0.2.1 (#58)

feat(server): support t5 (#59)

fix(docker): increase shm size (#60)

fixed SSE naming (#61)

https://en.wikipedia.org/wiki/Server-sent_events

feat: add distributed tracing (#62)

feat: add safetensors conversion (#63)

feat(server): improve download logging (#66)

feat(launcher): add disable_custom_kernels arg (#67)

feat(router): add max_total_tokens and empty_input validation (#68)

closes #65

fix(launcher): copy current env vars to subprocesses (#70)

closes #69

feat(router): add prometheus metrics scrape endpoint (#71)

v0.3.0 (#72)

feat(router): add cors allow origin options (#73)

feat(server): enable hf-transfer (#76)

fix(server): remove position_ids from galactica forward (#82)

closes #80

feat(server): pre-allocate max attention mask (#75)

v0.3.1 (#84)

feat(server): add special token bool (#85)

fix(docs): fix openapi schema (#86)

fix(server): fix token_is_special (#87)

feat(router): add legacy route for api-inference support (#88)

feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)

feat(router): add api-inference headers (#91)

feat(server): add logits watermark (#90)

feat(server): update to hf_transfer==0.1.2 (#93)

feat(ci): improve CI speed (#94)

fix(launcher): add router parameters to launcher (#95)

feat(server): fix transformers commit (#96)

v0.3.2 (#97)

fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)

feat: allow local models (#101)

closes #99

feat: add supported models (#102)

feat(clients): Python client (#103)

fix(server): fix galactica batch (#106)

closes #105

feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)

feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)

fix(python-client): stream not set on the sync client (#109)

fix(server): fix index out of range for watermarking (#110)

feat: support typical sampling (#114)

closes #112

fix(server): do not warp prefill logits (#116)

feat(router): support left truncation (#115)

closes #111

feat(router): add best_of parameter (#117)

feat(python-client): add new parameters (#118)

v0.4.0 (#119)

feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)

…ed models

fix(server): revert gpt-neox optims (#123)

fix(server): add position ids to neox (#126)

fix(server): use server tokenizer as gt (#128)

fix(python-client): relax dependencies (#129)

feat(python-client): add cookies to Client constructors and requests (#132)

I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.

Note: I couldn't get the client tests to pass - do you need to have an
HF token?

```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```

feat(ci): add ci paths (#134)

feat: Add note about NVIDIA drivers (#64)

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

feat(python-client): release v0.4.0 (#135)

feat(python-client): add CI (#136)

feat(server): flash neoX (#133)

fix(server): fix flash-neox scores warping (#137)

feat(server): cleanup flash neox loading (#139)

v0.4.1 (#140)

fix(server): Avoid using try/except to determine kind of AutoModel (#142)

feat(server): Add mypy-protobuf (#141)

Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.

feat(server): clear cache on error (#143)

feat(server): reduce mlp and attn in one op for flash neox (#145)

feat: aws sagemaker compatible image (#147)

The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

fix(ci): fix sagemaker action (#148)

feat(benchmark): tui based benchmarking tool (#149)

fix(server): fix flash neox rotary embeddings (#150)

v0.4.2 (#151)

v0.4.3 (#152)

feat(server): flash santacoder (#153)

docs(readme): provide link Logits Warper README (#154)

fix(server): fix escape characters in stop sequence (#155)

feat(docker): improve flash_attention caching (#160)

feat(launcher): allow disabling hf_transfer (#161)

fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)

fix(router): use buckets for metrics histograms (#163)

feat(router): make router input validation optional (#164)

feat(server): add flash attention llama (#144)

feat(server): support OPT models (#55)

OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.

v0.5.0 (#168)

feat(server): optimize decode for sane tokenizers (#170)

feat(server): support sharded santacoder (#167)

fix(launcher): revert change on shard errors (#173)

fix(ci): fix CVE in github-slug-action (#174)

feat(ci): add image signing with cosign (#175)

feat(ci): add Trivy and scan docker image (#178)

feat(ci): use large runners (#179)

feat(ci): faster scanning (#180)

fix(ci): fix ci permissions (#181)

fea(dockerfile): better layer caching (#159)

fix(ci): fix cosign error (#183)

fix(docker): fix docker image (#184)

fix(docker): fix image (#185)

fix(docker): revert dockerfile changes (#186)

fix(docker): fix docker image dependencies (#187)

fix(router): fix truncation (#190)

closes #189

feat(python-client): get list of currently deployed tgi models using the inference API (#191)

feat(router): add info route (#196)

close #125

feat(server): support quantization for flash models (#200)

closes #197

feat(server): check cuda capability when importing flash models (#201)

close #198

fix(server): fix hf_transfer issue with private repos (#203)

fix(docker): remove unused dependencies (#205)

fix(router): add auth token to get model info (#207)

feat(router): add git sha to info route (#208)

feat(router): drop requests when client closes the channel (#202)

fix(ci): fix sha in docker image (#212)

feat(server): flash attention past key value optimizations (#213)

feat(router): add device and dtype info (#215)

fix(server): fix past key values logic (#216)

@njhill fyi

fix(server): cleanup new flash past_key_values logic (#217)

fix(server): fix flash causal (#218)

fix(server): fix flash causal (#219)

fix(server): fix flash batch filtering (#220)

misc: update to rust 1.69 (#221)

v0.6.0 (#222)

feat(server): reduce memory requirement (#214)

chore(server): update huggingface-hub (#227)

feat(router): use number of tokens in batch as input for dynamic batching (#226)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

feat(router): add endpoint info to /info route (#228)

chore(server): update safetensors version (#235)

fix(python-client): add auth headers to is supported requests (#234)

Starting some routing tests. (#233)

fix(benchmarking): fix benchmarking tool

chore(launcher): refactor logic (#242)

Hopefully it's cleaner

feat(router): add tests to validation (#237)

feat(router): new healthcheck that skips the queue (#244)

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)

Introduced in #214

Fixes #249

fix(server): Small tidy of code from recent changes (#251)

remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()

chore(server): update transformers (#250)

feat(server): add watermarking tests (#248)

feat(docker): add nvidia env vars (#255)

doc(launcher): add more docs to the `launcher` itself and link in the README (#257)

feat(benchmark): add support for private tokenizers (#262)

Adding docs on how dynamic batching works. (#258)

This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.

Maybe some drawings could help too but I kept it to text for now.

chore(github): add templates (#264)

fix(server): fix typo in tokenizers decode (#269)

closes #268

feat(server): support hf endpoint weight layout (#266)

fix(launcher): pass weights cache override to the download process (#274)

closes #273

fix(launcher): handle hub branches (#278)

fix(server): Removes the parallelism in file convertion (during download) (#275)

feat(launcher): Improve error message when download process fails. (#276)

fix(server): fix convert (#284)

chore: add `flash-attention` to docker ignore (#287)

included when building docker locally.
(Where the local dirs might have the flash-attention folder.)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fea(server): decrease convert RAM requirements (#286)

fix(dockerfile): fix nvidia env vars (#297)

Fixes #291

feat(router): Adding response schema for compat_generate (#292)

feat(docker): add benchmarking tool to docker image (#298)

fix(docker): fix docker build (#299)

feat(server): optim flash causal lm decode_token (#285)

fix(docker): fix nvidia env vars (#305)

fix(docker): remove nvidia require cuda env (#310)

feat(server): shard token decode (#303)

feat(server): use float16 (#304)

fix(docker): remove CUDA_VERSION

feat(server): use cuda graph in logits warping (#302)

fix(server): fix multinomial implem in Sampling

feat(server): GPTQ quantization (step1) (#277)

Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore(docker): use nvidia base image (#318)

fix(docker): remove quantize default

fix(docker): use ubuntu20.04

Hotfixes for santacoder/bigcode. (#294)

Hotfixes:

- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Lifting check_unitialized. (#325)

Lifting check_unitialized.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Removing dead variables. (#327)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(ci): custom gpu runners (#328)

Single place for TP layers + Dropout Layer Norm + FastLinear (#329)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: add snapshot testing (#282)

feat(integration-tests): improve comparison and health checks (#336)

fix(server): fix decode token (#334)

Fixes #333

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

fix: set MODEL_ID in sagemaker-entrypoint script (#343)

feat(server): Support BLOOMChat-176B (#348) (#351)

@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): fix init for flash causal lm (#352)

Fixes #347

fix(server): t5 cannot run in f16 (#356)

Fix #349

fix(ci): fix security group (#359)

Switch security group used for ci
(open outbound rules)

Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>

feat: add nightly load testing (#358)

chore(sever): update requirements (#357)

Fixes #338

feat(server): support fp16 for t5 (#360)

Fixes #349

feat(server): do not use device_map auto on single GPU (#362)

feat(server): support trust_remote_code (#363)

feat(router): log input/ouput at debug level (#364)

@njhill FYI

v0.7.0 (#353)

feat: decrease IPC proto size (#367)

Closes #307 #308

feat(benchmarker): add summary tables (#368)

feat(server): support vectorized warpers in flash causal lm (#317)

Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>

Fix issue when load AutoModelForSeq2SeqLM model (#370)

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(server): fix quantization

feat(server): support RefinedWeb models (#379)

v0.8.0

increase health checks

feat(server): add retry on download (#384)

fix(server): fix bnb quantization for CausalLM models (#385)

v0.8.1

fix(server): fix has_position_ids (#395)

Fix #389

feat(server): remove trust_remote_code requirement for falcon models (#396)

feat(server): load santacoder/starcoder models with safetensors (#393)

Fix #366

v0.8.2

feat(sagemaker): add trust remote code to entrypoint (#394)

feat(launcher): parse oom signal (#404)

feat(server): only compute prefill logprobs when asked (#406)

Close #288

feat(server): batch tokenization for flash causal lm (#411)

chore: update openapi schema

feat(server): Rework model loading (#344)

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(server): optimize dist ops (#434)

docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)

It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.

fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)

This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`

<!-- Remove if not applicable -->

Fixes #422

feat(server): pre-allocate past key values for flash causal LM (#412)

feat(router): add ngrok integration (#453)

feat(server): improve flash attention import errors (#465)

@lewtun, is this enough?

Closes #458
Closes #456

fix(server): fix warpers on CPU (#472)

Closes #471

fix(server): Fixing T5 in case the names are mixed up. (#475)

feat(server): Update convert logic. (#483)

Should be more robust to shared tensors (ok when using
      `from_pretrained). But forcing us to add new checks in our loading
      code (since the chosen key to keep might be different from
      `transformers`).

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>

feat(server): Adding new ignore_rule for conversion. (#485)

fix(router): add timeout on flume sends (#488)

feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)

Let's start discussing implementation.

- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).

Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.

My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): Do not init process group if already initialized (#388)

feat(router): add header option to disable buffering for the generate_stream response (#498)

generate_stream endpoint response stream.

Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.

Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.

feat(server): add paged attention to flash models (#516)

Closes #478

feat(router): arg validation (#519)

feat: Add the option to force another dtype than `f16`. (#513)

fix(launcher): fix issue where launcher does not properly report shard failures (#522)

v0.9.0 (#525)

feat(server): Add Non flash MPT. (#514)

This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290

fix: Update server/Makefile to include Makefile-vllm (#520)

For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)

fix(server): Handle loading from local files for MPT (#534)

This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.

fix(server): avoid errors for very small top_p values (#544)

See https://github.com/huggingface/transformers/pull/24111

I didn't add validation to the `__init__` method since it's not done for
other values/warpers.

feat(server): use latest flash attention commit (#543)

@njhill FYI

feat(router): add argument for hostname in router (#545) (#550)

In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with

```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health'  # failed before this commit
```

Trigger CI

---------

Co-authored-by: Phil Chen <philchen2000@gmail.com>

fix(server): decrease memory fragmentation (#557)

v0.9.1 (#558)

fix(server): harden the weights choice to save on disk. (#561)

- Look at `transformers` base class to check for
  `_key_to_ignore_on_load_missing` or `_tied_weights` which are the
  standard attributes to select the keys to NOT save on disk (since they
  are ignored)

- Modified safetensors code (to be reflected in safetensors even if it's
  an internal function).

- Will not work for trust_remote_code=True repos (like santacoder).

Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593

feat: better errors for warmup and TP (#575)

Close #571

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)

Fixes #555

feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)

Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.

Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.

This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.

Fixes #500

chore: migrate ci region for more availability. (#581)

fix(server): T5 weights names. (#582)

Fixes #541

fix(server): Adding logger import to t5_modeling.py (#585)

Logger is referenced during the apex importing but is not imported,
causing a NameError

fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)

This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.

Thanks @Narsil for the original fix.

feat(server): Implements sharding for non divisible `vocab_size`. (#583)

- The code is relatively easy (just disable the checks on Embedding and
Head)

This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.

feat(server): empty cache on errors

GPTQ Env vars: catch correct type of error (#596)

When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.

@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.

feat(launcher): add arg validation and drop subprocess (#595)

feat(router): explicit warning if revision is not set (#608)

docs: README: Add logo + baseline (#611)

![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984)

fix(server): blacklist local files (#609)

Close #589 #602

v0.9.2 (#616)

fix(server): empty_cache when stopped

fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)

fea(launcher): debug logs (#623)

feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Reworking the quantization script so it's still universal (not llama
specific)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Still need to investigate the potential differences in quantization
results.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(server): flash attention v2 (#624)

feat(server): add support for llamav2 (#633)

v0.9.3 (#634)

fix(server): fix llamav2 config (#635)

feat(server): auto max_batch_total_tokens for flash att models (#630)

feat(router): ngrok edge (#642)

docs: Update README.md (#639)

docs: Update README.md (#643)

Add trust_remote_code to quantize script (#647)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.

With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

 -->

fix(server): llama v2 GPTQ (#648)

As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```

fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)

fix(server): use mem_get_info to get kv cache size (#664)

Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636

feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)

Just trying to get the integration tests to pass.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>

Directly load GPTBigCode to specified device (#618)

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

feat(server): add local prom and health routes if running w/ ngrok

feat: add cuda memory fraction (#659)

Close #673

fix(server): fix exllama buffers (#689)

Close #683

feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)

- Current PR is not great because we're side stepping the
  `Weights.__init__` but Weights shouldn't requires anything related
  to the config or the model_id as it aims to be a simple Wrapper
  over multi file loading.
- Ideal solution would be to use something like Rust enum
  ```
  enum Quantize{
    Bitandbytes(Bitsandbytes),
    GPTQ(bits: usize, groupsize: usize)
  ```
  And passing that around during load. Unfortunately we don't
  have access to this, so for now, side-stepping seems easier.

- Re-enabling groupsize<0 with exllama (confirmed it works.)

Helps #601

In next steps we should make sure our quantization script uses that
format and make it standard.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(README): update readme

fix(server): fix quantization python requirements (#708)

fix(server): fix missing datasets in quantize

feat(server): support new falcon config (#712)

v0.9.4 (#713)

Add section about TGI on other AI hardware accelerators in README (#715)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

As per title.

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs: Add hardware section to TOC in README (#721)

feat(server): update vllm version (#723)

chore: update license to HFOIL (#725)

v1.0.0 (#727)

Local gptq support. (#738)

Redoes #719

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix typing in `Model.generate_token` (#733)

This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.

All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591

I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

CC @OlivierDehaene

Adding Rope scaling. (#741)

- Adds Rope NTK scaling.

Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653

- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).

Fixes #512

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore: fix typo in mpt_modeling.py (#737)

Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

implemetation -> implementation

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update…
tjluyao added a commit to mlsys-io/kv.run that referenced this issue Jul 7, 2024
Init

fix: cleanup

Add load testing

Refactored gRPC interface
Added validation logic

ValidationError was not correctly handled

Use axum

feat: Docker image

feat: Add AML deployment

Update aml deployment

feat: Improve error handling

feat: Add arguments to CLI

v0.1.0

fix(validation): Fix error messages

feat(router): Add max_waiting_tokens

Create LICENSE (#2)

feat(server): Use safetensors

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(client): Simplify sharded logic

feat(server): Support bitsandbytes

feat(server): Support all AutoModelForCausalLM on a best effort basis

feat: Use json formatter by default in docker image

fix(models): Revert buggy support for AutoModel

feat(server): Support generic AutoModelForCausalLM

feat(server): Support AutoModelForSeq2SeqLM

feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard

feat(server): Improved doc

fix(server): Fix Transformers fork version

feat(server): Clarify CausalLMBatch concatenate method

feat(rust): Update to 1.65

fix(router): Fix HTTP status codes

fix(readme): Typo

fix(router): Handle tokenizer errors

feat(server): Support Galactica (#4)

fix(batching): Avoid theoretical hang in batcher loop (#5)

- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute

Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>

feat(server): Add model tests (#6)

fix(server): Only pad to multiple of 8 on GPUs

feat: Support stop sequences (#7)

feat: Return logprobs (#8)

feat(launcher): Add integration tests (#9)

fix(server): Fix stop sequences (#11)

fix(server): Check for device type correctly when determining initial padding (#16)

AFAIK there is no torch device type called "gpu".

fix(router): Include special tokens when tokenizing (#14)

There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.

feat(router): Add const parameters to validation logic  (#15)

I noticed some opportunity to collapse some of the logic, in case you
are interested.

fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)

Fixes #12 in the easiest way I could think of.

feat(launcher): Log server stdout (#19)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): Minor refactorization using new_zeros (#24)

- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher

fix(router): Obey max batch size (#23)

feat(server): Support SantaCoder (#26)

fix(server): Fix position ids (#28)

feat(docker): Make the image compatible with api-inference (#29)

fix(docker): fix api-inference deployment (#30)

fix(router): fix api-inference deployment (#31)

fix(dockerfile): fix docker build (#32)

feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)

feat(router): Remove second lock from batcher hot path (#27)

@njhill

feat: Support sampling seeding (#37)

Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>

feat: Add token streaming using ServerSideEvents support (#36)

Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is:

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```

Revert "feat: Add token streaming using ServerSideEvents support" (#40)

Reverts huggingface/text-generation-inference#36

fix(server): fix seeding on gpu (#42)

fix(server): fix seeding with multiple shards (#44)

feat: Add token streaming using ServerSideEvents support (#41)

fix(server): fix quantization for sharded models (#45)

feat(server): Support GPT-Neox (#39)

feat(ci): Docker build and push (#46)

feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)

feat(server): support repetition penalty (#47)

feat(server): allow the server to use a local weight cache (#49)

fix(server): allow greedy repetition penalty (#51)

feat(router): use background task to manage request queue (#52)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

breaking(router): modify /generate API to only return generated text (#50)

@njhill, @yk FYI

generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.

We also remove the unused Vec.

feat(router): refactor API and add openAPI schemas (#53)

feat(docs): Clarify installation steps (#54)

Adds some bits for first-time users (like me 😄 )

feat(ci): push to AML registry (#56)

fix(server): better handling of inference mode (#57)

V0.2.1 (#58)

feat(server): support t5 (#59)

fix(docker): increase shm size (#60)

fixed SSE naming (#61)

https://en.wikipedia.org/wiki/Server-sent_events

feat: add distributed tracing (#62)

feat: add safetensors conversion (#63)

feat(server): improve download logging (#66)

feat(launcher): add disable_custom_kernels arg (#67)

feat(router): add max_total_tokens and empty_input validation (#68)

closes #65

fix(launcher): copy current env vars to subprocesses (#70)

closes #69

feat(router): add prometheus metrics scrape endpoint (#71)

v0.3.0 (#72)

feat(router): add cors allow origin options (#73)

feat(server): enable hf-transfer (#76)

fix(server): remove position_ids from galactica forward (#82)

closes #80

feat(server): pre-allocate max attention mask (#75)

v0.3.1 (#84)

feat(server): add special token bool (#85)

fix(docs): fix openapi schema (#86)

fix(server): fix token_is_special (#87)

feat(router): add legacy route for api-inference support (#88)

feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)

feat(router): add api-inference headers (#91)

feat(server): add logits watermark (#90)

feat(server): update to hf_transfer==0.1.2 (#93)

feat(ci): improve CI speed (#94)

fix(launcher): add router parameters to launcher (#95)

feat(server): fix transformers commit (#96)

v0.3.2 (#97)

fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)

feat: allow local models (#101)

closes #99

feat: add supported models (#102)

feat(clients): Python client (#103)

fix(server): fix galactica batch (#106)

closes #105

feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)

feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)

fix(python-client): stream not set on the sync client (#109)

fix(server): fix index out of range for watermarking (#110)

feat: support typical sampling (#114)

closes #112

fix(server): do not warp prefill logits (#116)

feat(router): support left truncation (#115)

closes #111

feat(router): add best_of parameter (#117)

feat(python-client): add new parameters (#118)

v0.4.0 (#119)

feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)

…ed models

fix(server): revert gpt-neox optims (#123)

fix(server): add position ids to neox (#126)

fix(server): use server tokenizer as gt (#128)

fix(python-client): relax dependencies (#129)

feat(python-client): add cookies to Client constructors and requests (#132)

I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.

Note: I couldn't get the client tests to pass - do you need to have an
HF token?

```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```

feat(ci): add ci paths (#134)

feat: Add note about NVIDIA drivers (#64)

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

feat(python-client): release v0.4.0 (#135)

feat(python-client): add CI (#136)

feat(server): flash neoX (#133)

fix(server): fix flash-neox scores warping (#137)

feat(server): cleanup flash neox loading (#139)

v0.4.1 (#140)

fix(server): Avoid using try/except to determine kind of AutoModel (#142)

feat(server): Add mypy-protobuf (#141)

Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.

feat(server): clear cache on error (#143)

feat(server): reduce mlp and attn in one op for flash neox (#145)

feat: aws sagemaker compatible image (#147)

The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

fix(ci): fix sagemaker action (#148)

feat(benchmark): tui based benchmarking tool (#149)

fix(server): fix flash neox rotary embeddings (#150)

v0.4.2 (#151)

v0.4.3 (#152)

feat(server): flash santacoder (#153)

docs(readme): provide link Logits Warper README (#154)

fix(server): fix escape characters in stop sequence (#155)

feat(docker): improve flash_attention caching (#160)

feat(launcher): allow disabling hf_transfer (#161)

fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)

fix(router): use buckets for metrics histograms (#163)

feat(router): make router input validation optional (#164)

feat(server): add flash attention llama (#144)

feat(server): support OPT models (#55)

OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.

v0.5.0 (#168)

feat(server): optimize decode for sane tokenizers (#170)

feat(server): support sharded santacoder (#167)

fix(launcher): revert change on shard errors (#173)

fix(ci): fix CVE in github-slug-action (#174)

feat(ci): add image signing with cosign (#175)

feat(ci): add Trivy and scan docker image (#178)

feat(ci): use large runners (#179)

feat(ci): faster scanning (#180)

fix(ci): fix ci permissions (#181)

fea(dockerfile): better layer caching (#159)

fix(ci): fix cosign error (#183)

fix(docker): fix docker image (#184)

fix(docker): fix image (#185)

fix(docker): revert dockerfile changes (#186)

fix(docker): fix docker image dependencies (#187)

fix(router): fix truncation (#190)

closes #189

feat(python-client): get list of currently deployed tgi models using the inference API (#191)

feat(router): add info route (#196)

close #125

feat(server): support quantization for flash models (#200)

closes #197

feat(server): check cuda capability when importing flash models (#201)

close #198

fix(server): fix hf_transfer issue with private repos (#203)

fix(docker): remove unused dependencies (#205)

fix(router): add auth token to get model info (#207)

feat(router): add git sha to info route (#208)

feat(router): drop requests when client closes the channel (#202)

fix(ci): fix sha in docker image (#212)

feat(server): flash attention past key value optimizations (#213)

feat(router): add device and dtype info (#215)

fix(server): fix past key values logic (#216)

@njhill fyi

fix(server): cleanup new flash past_key_values logic (#217)

fix(server): fix flash causal (#218)

fix(server): fix flash causal (#219)

fix(server): fix flash batch filtering (#220)

misc: update to rust 1.69 (#221)

v0.6.0 (#222)

feat(server): reduce memory requirement (#214)

chore(server): update huggingface-hub (#227)

feat(router): use number of tokens in batch as input for dynamic batching (#226)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

feat(router): add endpoint info to /info route (#228)

chore(server): update safetensors version (#235)

fix(python-client): add auth headers to is supported requests (#234)

Starting some routing tests. (#233)

fix(benchmarking): fix benchmarking tool

chore(launcher): refactor logic (#242)

Hopefully it's cleaner

feat(router): add tests to validation (#237)

feat(router): new healthcheck that skips the queue (#244)

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)

Introduced in #214

Fixes #249

fix(server): Small tidy of code from recent changes (#251)

remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()

chore(server): update transformers (#250)

feat(server): add watermarking tests (#248)

feat(docker): add nvidia env vars (#255)

doc(launcher): add more docs to the `launcher` itself and link in the README (#257)

feat(benchmark): add support for private tokenizers (#262)

Adding docs on how dynamic batching works. (#258)

This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.

Maybe some drawings could help too but I kept it to text for now.

chore(github): add templates (#264)

fix(server): fix typo in tokenizers decode (#269)

closes #268

feat(server): support hf endpoint weight layout (#266)

fix(launcher): pass weights cache override to the download process (#274)

closes #273

fix(launcher): handle hub branches (#278)

fix(server): Removes the parallelism in file convertion (during download) (#275)

feat(launcher): Improve error message when download process fails. (#276)

fix(server): fix convert (#284)

chore: add `flash-attention` to docker ignore (#287)

included when building docker locally.
(Where the local dirs might have the flash-attention folder.)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fea(server): decrease convert RAM requirements (#286)

fix(dockerfile): fix nvidia env vars (#297)

Fixes #291

feat(router): Adding response schema for compat_generate (#292)

feat(docker): add benchmarking tool to docker image (#298)

fix(docker): fix docker build (#299)

feat(server): optim flash causal lm decode_token (#285)

fix(docker): fix nvidia env vars (#305)

fix(docker): remove nvidia require cuda env (#310)

feat(server): shard token decode (#303)

feat(server): use float16 (#304)

fix(docker): remove CUDA_VERSION

feat(server): use cuda graph in logits warping (#302)

fix(server): fix multinomial implem in Sampling

feat(server): GPTQ quantization (step1) (#277)

Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore(docker): use nvidia base image (#318)

fix(docker): remove quantize default

fix(docker): use ubuntu20.04

Hotfixes for santacoder/bigcode. (#294)

Hotfixes:

- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Lifting check_unitialized. (#325)

Lifting check_unitialized.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Removing dead variables. (#327)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(ci): custom gpu runners (#328)

Single place for TP layers + Dropout Layer Norm + FastLinear (#329)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: add snapshot testing (#282)

feat(integration-tests): improve comparison and health checks (#336)

fix(server): fix decode token (#334)

Fixes #333

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

fix: set MODEL_ID in sagemaker-entrypoint script (#343)

feat(server): Support BLOOMChat-176B (#348) (#351)

@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): fix init for flash causal lm (#352)

Fixes #347

fix(server): t5 cannot run in f16 (#356)

Fix #349

fix(ci): fix security group (#359)

Switch security group used for ci
(open outbound rules)

Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>

feat: add nightly load testing (#358)

chore(sever): update requirements (#357)

Fixes #338

feat(server): support fp16 for t5 (#360)

Fixes #349

feat(server): do not use device_map auto on single GPU (#362)

feat(server): support trust_remote_code (#363)

feat(router): log input/ouput at debug level (#364)

@njhill FYI

v0.7.0 (#353)

feat: decrease IPC proto size (#367)

Closes #307 #308

feat(benchmarker): add summary tables (#368)

feat(server): support vectorized warpers in flash causal lm (#317)

Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>

Fix issue when load AutoModelForSeq2SeqLM model (#370)

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(server): fix quantization

feat(server): support RefinedWeb models (#379)

v0.8.0

increase health checks

feat(server): add retry on download (#384)

fix(server): fix bnb quantization for CausalLM models (#385)

v0.8.1

fix(server): fix has_position_ids (#395)

Fix #389

feat(server): remove trust_remote_code requirement for falcon models (#396)

feat(server): load santacoder/starcoder models with safetensors (#393)

Fix #366

v0.8.2

feat(sagemaker): add trust remote code to entrypoint (#394)

feat(launcher): parse oom signal (#404)

feat(server): only compute prefill logprobs when asked (#406)

Close #288

feat(server): batch tokenization for flash causal lm (#411)

chore: update openapi schema

feat(server): Rework model loading (#344)

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(server): optimize dist ops (#434)

docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)

It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.

fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)

This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`

<!-- Remove if not applicable -->

Fixes #422

feat(server): pre-allocate past key values for flash causal LM (#412)

feat(router): add ngrok integration (#453)

feat(server): improve flash attention import errors (#465)

@lewtun, is this enough?

Closes #458
Closes #456

fix(server): fix warpers on CPU (#472)

Closes #471

fix(server): Fixing T5 in case the names are mixed up. (#475)

feat(server): Update convert logic. (#483)

Should be more robust to shared tensors (ok when using
      `from_pretrained). But forcing us to add new checks in our loading
      code (since the chosen key to keep might be different from
      `transformers`).

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>

feat(server): Adding new ignore_rule for conversion. (#485)

fix(router): add timeout on flume sends (#488)

feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)

Let's start discussing implementation.

- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).

Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.

My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): Do not init process group if already initialized (#388)

feat(router): add header option to disable buffering for the generate_stream response (#498)

generate_stream endpoint response stream.

Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.

Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.

feat(server): add paged attention to flash models (#516)

Closes #478

feat(router): arg validation (#519)

feat: Add the option to force another dtype than `f16`. (#513)

fix(launcher): fix issue where launcher does not properly report shard failures (#522)

v0.9.0 (#525)

feat(server): Add Non flash MPT. (#514)

This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290

fix: Update server/Makefile to include Makefile-vllm (#520)

For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)

fix(server): Handle loading from local files for MPT (#534)

This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.

fix(server): avoid errors for very small top_p values (#544)

See https://github.com/huggingface/transformers/pull/24111

I didn't add validation to the `__init__` method since it's not done for
other values/warpers.

feat(server): use latest flash attention commit (#543)

@njhill FYI

feat(router): add argument for hostname in router (#545) (#550)

In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with

```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health'  # failed before this commit
```

Trigger CI

---------

Co-authored-by: Phil Chen <philchen2000@gmail.com>

fix(server): decrease memory fragmentation (#557)

v0.9.1 (#558)

fix(server): harden the weights choice to save on disk. (#561)

- Look at `transformers` base class to check for
  `_key_to_ignore_on_load_missing` or `_tied_weights` which are the
  standard attributes to select the keys to NOT save on disk (since they
  are ignored)

- Modified safetensors code (to be reflected in safetensors even if it's
  an internal function).

- Will not work for trust_remote_code=True repos (like santacoder).

Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593

feat: better errors for warmup and TP (#575)

Close #571

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)

Fixes #555

feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)

Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.

Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.

This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.

Fixes #500

chore: migrate ci region for more availability. (#581)

fix(server): T5 weights names. (#582)

Fixes #541

fix(server): Adding logger import to t5_modeling.py (#585)

Logger is referenced during the apex importing but is not imported,
causing a NameError

fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)

This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.

Thanks @Narsil for the original fix.

feat(server): Implements sharding for non divisible `vocab_size`. (#583)

- The code is relatively easy (just disable the checks on Embedding and
Head)

This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.

feat(server): empty cache on errors

GPTQ Env vars: catch correct type of error (#596)

When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.

@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.

feat(launcher): add arg validation and drop subprocess (#595)

feat(router): explicit warning if revision is not set (#608)

docs: README: Add logo + baseline (#611)

![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984)

fix(server): blacklist local files (#609)

Close #589 #602

v0.9.2 (#616)

fix(server): empty_cache when stopped

fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)

fea(launcher): debug logs (#623)

feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Reworking the quantization script so it's still universal (not llama
specific)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Still need to investigate the potential differences in quantization
results.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(server): flash attention v2 (#624)

feat(server): add support for llamav2 (#633)

v0.9.3 (#634)

fix(server): fix llamav2 config (#635)

feat(server): auto max_batch_total_tokens for flash att models (#630)

feat(router): ngrok edge (#642)

docs: Update README.md (#639)

docs: Update README.md (#643)

Add trust_remote_code to quantize script (#647)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.

With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

 -->

fix(server): llama v2 GPTQ (#648)

As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```

fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)

fix(server): use mem_get_info to get kv cache size (#664)

Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636

feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)

Just trying to get the integration tests to pass.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>

Directly load GPTBigCode to specified device (#618)

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

feat(server): add local prom and health routes if running w/ ngrok

feat: add cuda memory fraction (#659)

Close #673

fix(server): fix exllama buffers (#689)

Close #683

feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)

- Current PR is not great because we're side stepping the
  `Weights.__init__` but Weights shouldn't requires anything related
  to the config or the model_id as it aims to be a simple Wrapper
  over multi file loading.
- Ideal solution would be to use something like Rust enum
  ```
  enum Quantize{
    Bitandbytes(Bitsandbytes),
    GPTQ(bits: usize, groupsize: usize)
  ```
  And passing that around during load. Unfortunately we don't
  have access to this, so for now, side-stepping seems easier.

- Re-enabling groupsize<0 with exllama (confirmed it works.)

Helps #601

In next steps we should make sure our quantization script uses that
format and make it standard.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(README): update readme

fix(server): fix quantization python requirements (#708)

fix(server): fix missing datasets in quantize

feat(server): support new falcon config (#712)

v0.9.4 (#713)

Add section about TGI on other AI hardware accelerators in README (#715)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

As per title.

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs: Add hardware section to TOC in README (#721)

feat(server): update vllm version (#723)

chore: update license to HFOIL (#725)

v1.0.0 (#727)

Local gptq support. (#738)

Redoes #719

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix typing in `Model.generate_token` (#733)

This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.

All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591

I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

CC @OlivierDehaene

Adding Rope scaling. (#741)

- Adds Rope NTK scaling.

Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653

- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).

Fixes #512

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore: fix typo in mpt_modeling.py (#737)

Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

implemetation -> implementation

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update…
tjluyao added a commit to mlsys-io/kv.run that referenced this issue Jul 7, 2024
Init

fix: cleanup

Add load testing

Refactored gRPC interface
Added validation logic

ValidationError was not correctly handled

Use axum

feat: Docker image

feat: Add AML deployment

Update aml deployment

feat: Improve error handling

feat: Add arguments to CLI

v0.1.0

fix(validation): Fix error messages

feat(router): Add max_waiting_tokens

Create LICENSE (#2)

feat(server): Use safetensors

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(client): Simplify sharded logic

feat(server): Support bitsandbytes

feat(server): Support all AutoModelForCausalLM on a best effort basis

feat: Use json formatter by default in docker image

fix(models): Revert buggy support for AutoModel

feat(server): Support generic AutoModelForCausalLM

feat(server): Support AutoModelForSeq2SeqLM

feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard

feat(server): Improved doc

fix(server): Fix Transformers fork version

feat(server): Clarify CausalLMBatch concatenate method

feat(rust): Update to 1.65

fix(router): Fix HTTP status codes

fix(readme): Typo

fix(router): Handle tokenizer errors

feat(server): Support Galactica (#4)

fix(batching): Avoid theoretical hang in batcher loop (#5)

- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute

Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>

feat(server): Add model tests (#6)

fix(server): Only pad to multiple of 8 on GPUs

feat: Support stop sequences (#7)

feat: Return logprobs (#8)

feat(launcher): Add integration tests (#9)

fix(server): Fix stop sequences (#11)

fix(server): Check for device type correctly when determining initial padding (#16)

AFAIK there is no torch device type called "gpu".

fix(router): Include special tokens when tokenizing (#14)

There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.

feat(router): Add const parameters to validation logic  (#15)

I noticed some opportunity to collapse some of the logic, in case you
are interested.

fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)

Fixes #12 in the easiest way I could think of.

feat(launcher): Log server stdout (#19)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): Minor refactorization using new_zeros (#24)

- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher

fix(router): Obey max batch size (#23)

feat(server): Support SantaCoder (#26)

fix(server): Fix position ids (#28)

feat(docker): Make the image compatible with api-inference (#29)

fix(docker): fix api-inference deployment (#30)

fix(router): fix api-inference deployment (#31)

fix(dockerfile): fix docker build (#32)

feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)

feat(router): Remove second lock from batcher hot path (#27)

@njhill

feat: Support sampling seeding (#37)

Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>

feat: Add token streaming using ServerSideEvents support (#36)

Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is:

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```

Revert "feat: Add token streaming using ServerSideEvents support" (#40)

Reverts huggingface/text-generation-inference#36

fix(server): fix seeding on gpu (#42)

fix(server): fix seeding with multiple shards (#44)

feat: Add token streaming using ServerSideEvents support (#41)

fix(server): fix quantization for sharded models (#45)

feat(server): Support GPT-Neox (#39)

feat(ci): Docker build and push (#46)

feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)

feat(server): support repetition penalty (#47)

feat(server): allow the server to use a local weight cache (#49)

fix(server): allow greedy repetition penalty (#51)

feat(router): use background task to manage request queue (#52)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

breaking(router): modify /generate API to only return generated text (#50)

@njhill, @yk FYI

generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.

We also remove the unused Vec.

feat(router): refactor API and add openAPI schemas (#53)

feat(docs): Clarify installation steps (#54)

Adds some bits for first-time users (like me 😄 )

feat(ci): push to AML registry (#56)

fix(server): better handling of inference mode (#57)

V0.2.1 (#58)

feat(server): support t5 (#59)

fix(docker): increase shm size (#60)

fixed SSE naming (#61)

https://en.wikipedia.org/wiki/Server-sent_events

feat: add distributed tracing (#62)

feat: add safetensors conversion (#63)

feat(server): improve download logging (#66)

feat(launcher): add disable_custom_kernels arg (#67)

feat(router): add max_total_tokens and empty_input validation (#68)

closes #65

fix(launcher): copy current env vars to subprocesses (#70)

closes #69

feat(router): add prometheus metrics scrape endpoint (#71)

v0.3.0 (#72)

feat(router): add cors allow origin options (#73)

feat(server): enable hf-transfer (#76)

fix(server): remove position_ids from galactica forward (#82)

closes #80

feat(server): pre-allocate max attention mask (#75)

v0.3.1 (#84)

feat(server): add special token bool (#85)

fix(docs): fix openapi schema (#86)

fix(server): fix token_is_special (#87)

feat(router): add legacy route for api-inference support (#88)

feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)

feat(router): add api-inference headers (#91)

feat(server): add logits watermark (#90)

feat(server): update to hf_transfer==0.1.2 (#93)

feat(ci): improve CI speed (#94)

fix(launcher): add router parameters to launcher (#95)

feat(server): fix transformers commit (#96)

v0.3.2 (#97)

fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)

feat: allow local models (#101)

closes #99

feat: add supported models (#102)

feat(clients): Python client (#103)

fix(server): fix galactica batch (#106)

closes #105

feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)

feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)

fix(python-client): stream not set on the sync client (#109)

fix(server): fix index out of range for watermarking (#110)

feat: support typical sampling (#114)

closes #112

fix(server): do not warp prefill logits (#116)

feat(router): support left truncation (#115)

closes #111

feat(router): add best_of parameter (#117)

feat(python-client): add new parameters (#118)

v0.4.0 (#119)

feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)

…ed models

fix(server): revert gpt-neox optims (#123)

fix(server): add position ids to neox (#126)

fix(server): use server tokenizer as gt (#128)

fix(python-client): relax dependencies (#129)

feat(python-client): add cookies to Client constructors and requests (#132)

I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.

Note: I couldn't get the client tests to pass - do you need to have an
HF token?

```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```

feat(ci): add ci paths (#134)

feat: Add note about NVIDIA drivers (#64)

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

feat(python-client): release v0.4.0 (#135)

feat(python-client): add CI (#136)

feat(server): flash neoX (#133)

fix(server): fix flash-neox scores warping (#137)

feat(server): cleanup flash neox loading (#139)

v0.4.1 (#140)

fix(server): Avoid using try/except to determine kind of AutoModel (#142)

feat(server): Add mypy-protobuf (#141)

Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.

feat(server): clear cache on error (#143)

feat(server): reduce mlp and attn in one op for flash neox (#145)

feat: aws sagemaker compatible image (#147)

The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

fix(ci): fix sagemaker action (#148)

feat(benchmark): tui based benchmarking tool (#149)

fix(server): fix flash neox rotary embeddings (#150)

v0.4.2 (#151)

v0.4.3 (#152)

feat(server): flash santacoder (#153)

docs(readme): provide link Logits Warper README (#154)

fix(server): fix escape characters in stop sequence (#155)

feat(docker): improve flash_attention caching (#160)

feat(launcher): allow disabling hf_transfer (#161)

fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)

fix(router): use buckets for metrics histograms (#163)

feat(router): make router input validation optional (#164)

feat(server): add flash attention llama (#144)

feat(server): support OPT models (#55)

OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.

v0.5.0 (#168)

feat(server): optimize decode for sane tokenizers (#170)

feat(server): support sharded santacoder (#167)

fix(launcher): revert change on shard errors (#173)

fix(ci): fix CVE in github-slug-action (#174)

feat(ci): add image signing with cosign (#175)

feat(ci): add Trivy and scan docker image (#178)

feat(ci): use large runners (#179)

feat(ci): faster scanning (#180)

fix(ci): fix ci permissions (#181)

fea(dockerfile): better layer caching (#159)

fix(ci): fix cosign error (#183)

fix(docker): fix docker image (#184)

fix(docker): fix image (#185)

fix(docker): revert dockerfile changes (#186)

fix(docker): fix docker image dependencies (#187)

fix(router): fix truncation (#190)

closes #189

feat(python-client): get list of currently deployed tgi models using the inference API (#191)

feat(router): add info route (#196)

close #125

feat(server): support quantization for flash models (#200)

closes #197

feat(server): check cuda capability when importing flash models (#201)

close #198

fix(server): fix hf_transfer issue with private repos (#203)

fix(docker): remove unused dependencies (#205)

fix(router): add auth token to get model info (#207)

feat(router): add git sha to info route (#208)

feat(router): drop requests when client closes the channel (#202)

fix(ci): fix sha in docker image (#212)

feat(server): flash attention past key value optimizations (#213)

feat(router): add device and dtype info (#215)

fix(server): fix past key values logic (#216)

@njhill fyi

fix(server): cleanup new flash past_key_values logic (#217)

fix(server): fix flash causal (#218)

fix(server): fix flash causal (#219)

fix(server): fix flash batch filtering (#220)

misc: update to rust 1.69 (#221)

v0.6.0 (#222)

feat(server): reduce memory requirement (#214)

chore(server): update huggingface-hub (#227)

feat(router): use number of tokens in batch as input for dynamic batching (#226)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

feat(router): add endpoint info to /info route (#228)

chore(server): update safetensors version (#235)

fix(python-client): add auth headers to is supported requests (#234)

Starting some routing tests. (#233)

fix(benchmarking): fix benchmarking tool

chore(launcher): refactor logic (#242)

Hopefully it's cleaner

feat(router): add tests to validation (#237)

feat(router): new healthcheck that skips the queue (#244)

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)

Introduced in #214

Fixes #249

fix(server): Small tidy of code from recent changes (#251)

remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()

chore(server): update transformers (#250)

feat(server): add watermarking tests (#248)

feat(docker): add nvidia env vars (#255)

doc(launcher): add more docs to the `launcher` itself and link in the README (#257)

feat(benchmark): add support for private tokenizers (#262)

Adding docs on how dynamic batching works. (#258)

This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.

Maybe some drawings could help too but I kept it to text for now.

chore(github): add templates (#264)

fix(server): fix typo in tokenizers decode (#269)

closes #268

feat(server): support hf endpoint weight layout (#266)

fix(launcher): pass weights cache override to the download process (#274)

closes #273

fix(launcher): handle hub branches (#278)

fix(server): Removes the parallelism in file convertion (during download) (#275)

feat(launcher): Improve error message when download process fails. (#276)

fix(server): fix convert (#284)

chore: add `flash-attention` to docker ignore (#287)

included when building docker locally.
(Where the local dirs might have the flash-attention folder.)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fea(server): decrease convert RAM requirements (#286)

fix(dockerfile): fix nvidia env vars (#297)

Fixes #291

feat(router): Adding response schema for compat_generate (#292)

feat(docker): add benchmarking tool to docker image (#298)

fix(docker): fix docker build (#299)

feat(server): optim flash causal lm decode_token (#285)

fix(docker): fix nvidia env vars (#305)

fix(docker): remove nvidia require cuda env (#310)

feat(server): shard token decode (#303)

feat(server): use float16 (#304)

fix(docker): remove CUDA_VERSION

feat(server): use cuda graph in logits warping (#302)

fix(server): fix multinomial implem in Sampling

feat(server): GPTQ quantization (step1) (#277)

Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore(docker): use nvidia base image (#318)

fix(docker): remove quantize default

fix(docker): use ubuntu20.04

Hotfixes for santacoder/bigcode. (#294)

Hotfixes:

- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Lifting check_unitialized. (#325)

Lifting check_unitialized.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Removing dead variables. (#327)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(ci): custom gpu runners (#328)

Single place for TP layers + Dropout Layer Norm + FastLinear (#329)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: add snapshot testing (#282)

feat(integration-tests): improve comparison and health checks (#336)

fix(server): fix decode token (#334)

Fixes #333

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

fix: set MODEL_ID in sagemaker-entrypoint script (#343)

feat(server): Support BLOOMChat-176B (#348) (#351)

@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): fix init for flash causal lm (#352)

Fixes #347

fix(server): t5 cannot run in f16 (#356)

Fix #349

fix(ci): fix security group (#359)

Switch security group used for ci
(open outbound rules)

Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>

feat: add nightly load testing (#358)

chore(sever): update requirements (#357)

Fixes #338

feat(server): support fp16 for t5 (#360)

Fixes #349

feat(server): do not use device_map auto on single GPU (#362)

feat(server): support trust_remote_code (#363)

feat(router): log input/ouput at debug level (#364)

@njhill FYI

v0.7.0 (#353)

feat: decrease IPC proto size (#367)

Closes #307 #308

feat(benchmarker): add summary tables (#368)

feat(server): support vectorized warpers in flash causal lm (#317)

Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>

Fix issue when load AutoModelForSeq2SeqLM model (#370)

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(server): fix quantization

feat(server): support RefinedWeb models (#379)

v0.8.0

increase health checks

feat(server): add retry on download (#384)

fix(server): fix bnb quantization for CausalLM models (#385)

v0.8.1

fix(server): fix has_position_ids (#395)

Fix #389

feat(server): remove trust_remote_code requirement for falcon models (#396)

feat(server): load santacoder/starcoder models with safetensors (#393)

Fix #366

v0.8.2

feat(sagemaker): add trust remote code to entrypoint (#394)

feat(launcher): parse oom signal (#404)

feat(server): only compute prefill logprobs when asked (#406)

Close #288

feat(server): batch tokenization for flash causal lm (#411)

chore: update openapi schema

feat(server): Rework model loading (#344)

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(server): optimize dist ops (#434)

docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)

It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.

fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)

This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`

<!-- Remove if not applicable -->

Fixes #422

feat(server): pre-allocate past key values for flash causal LM (#412)

feat(router): add ngrok integration (#453)

feat(server): improve flash attention import errors (#465)

@lewtun, is this enough?

Closes #458
Closes #456

fix(server): fix warpers on CPU (#472)

Closes #471

fix(server): Fixing T5 in case the names are mixed up. (#475)

feat(server): Update convert logic. (#483)

Should be more robust to shared tensors (ok when using
      `from_pretrained). But forcing us to add new checks in our loading
      code (since the chosen key to keep might be different from
      `transformers`).

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>

feat(server): Adding new ignore_rule for conversion. (#485)

fix(router): add timeout on flume sends (#488)

feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)

Let's start discussing implementation.

- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).

Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.

My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): Do not init process group if already initialized (#388)

feat(router): add header option to disable buffering for the generate_stream response (#498)

generate_stream endpoint response stream.

Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.

Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.

feat(server): add paged attention to flash models (#516)

Closes #478

feat(router): arg validation (#519)

feat: Add the option to force another dtype than `f16`. (#513)

fix(launcher): fix issue where launcher does not properly report shard failures (#522)

v0.9.0 (#525)

feat(server): Add Non flash MPT. (#514)

This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290

fix: Update server/Makefile to include Makefile-vllm (#520)

For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)

fix(server): Handle loading from local files for MPT (#534)

This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.

fix(server): avoid errors for very small top_p values (#544)

See https://github.com/huggingface/transformers/pull/24111

I didn't add validation to the `__init__` method since it's not done for
other values/warpers.

feat(server): use latest flash attention commit (#543)

@njhill FYI

feat(router): add argument for hostname in router (#545) (#550)

In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with

```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health'  # failed before this commit
```

Trigger CI

---------

Co-authored-by: Phil Chen <philchen2000@gmail.com>

fix(server): decrease memory fragmentation (#557)

v0.9.1 (#558)

fix(server): harden the weights choice to save on disk. (#561)

- Look at `transformers` base class to check for
  `_key_to_ignore_on_load_missing` or `_tied_weights` which are the
  standard attributes to select the keys to NOT save on disk (since they
  are ignored)

- Modified safetensors code (to be reflected in safetensors even if it's
  an internal function).

- Will not work for trust_remote_code=True repos (like santacoder).

Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593

feat: better errors for warmup and TP (#575)

Close #571

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)

Fixes #555

feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)

Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.

Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.

This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.

Fixes #500

chore: migrate ci region for more availability. (#581)

fix(server): T5 weights names. (#582)

Fixes #541

fix(server): Adding logger import to t5_modeling.py (#585)

Logger is referenced during the apex importing but is not imported,
causing a NameError

fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)

This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.

Thanks @Narsil for the original fix.

feat(server): Implements sharding for non divisible `vocab_size`. (#583)

- The code is relatively easy (just disable the checks on Embedding and
Head)

This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.

feat(server): empty cache on errors

GPTQ Env vars: catch correct type of error (#596)

When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.

@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.

feat(launcher): add arg validation and drop subprocess (#595)

feat(router): explicit warning if revision is not set (#608)

docs: README: Add logo + baseline (#611)

![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984)

fix(server): blacklist local files (#609)

Close #589 #602

v0.9.2 (#616)

fix(server): empty_cache when stopped

fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)

fea(launcher): debug logs (#623)

feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Reworking the quantization script so it's still universal (not llama
specific)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Still need to investigate the potential differences in quantization
results.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(server): flash attention v2 (#624)

feat(server): add support for llamav2 (#633)

v0.9.3 (#634)

fix(server): fix llamav2 config (#635)

feat(server): auto max_batch_total_tokens for flash att models (#630)

feat(router): ngrok edge (#642)

docs: Update README.md (#639)

docs: Update README.md (#643)

Add trust_remote_code to quantize script (#647)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.

With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

 -->

fix(server): llama v2 GPTQ (#648)

As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```

fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)

fix(server): use mem_get_info to get kv cache size (#664)

Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636

feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)

Just trying to get the integration tests to pass.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>

Directly load GPTBigCode to specified device (#618)

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

feat(server): add local prom and health routes if running w/ ngrok

feat: add cuda memory fraction (#659)

Close #673

fix(server): fix exllama buffers (#689)

Close #683

feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)

- Current PR is not great because we're side stepping the
  `Weights.__init__` but Weights shouldn't requires anything related
  to the config or the model_id as it aims to be a simple Wrapper
  over multi file loading.
- Ideal solution would be to use something like Rust enum
  ```
  enum Quantize{
    Bitandbytes(Bitsandbytes),
    GPTQ(bits: usize, groupsize: usize)
  ```
  And passing that around during load. Unfortunately we don't
  have access to this, so for now, side-stepping seems easier.

- Re-enabling groupsize<0 with exllama (confirmed it works.)

Helps #601

In next steps we should make sure our quantization script uses that
format and make it standard.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(README): update readme

fix(server): fix quantization python requirements (#708)

fix(server): fix missing datasets in quantize

feat(server): support new falcon config (#712)

v0.9.4 (#713)

Add section about TGI on other AI hardware accelerators in README (#715)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

As per title.

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs: Add hardware section to TOC in README (#721)

feat(server): update vllm version (#723)

chore: update license to HFOIL (#725)

v1.0.0 (#727)

Local gptq support. (#738)

Redoes #719

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix typing in `Model.generate_token` (#733)

This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.

All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591

I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

CC @OlivierDehaene

Adding Rope scaling. (#741)

- Adds Rope NTK scaling.

Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653

- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).

Fixes #512

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore: fix typo in mpt_modeling.py (#737)

Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

implemetation -> implementation

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests