Merge 0922 (#6)

* Remove hardcode flash-attn disable setting (lm-sys#2342) * Document turning off proxy_buffering when api is streaming (lm-sys#2337) * Simplify huggingface api example (lm-sys#2355) * Update sponsor logos (lm-sys#2367) * if LOGDIR is empty, then don't try output log to local file (lm-sys#2357) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> * add best_of and use_beam_search for completions interface (lm-sys#2348) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> * Extract upvote/downvote from log files (lm-sys#2369) * Revert "add best_of and use_beam_search for completions interface" (lm-sys#2370) * Improve doc (lm-sys#2371) * add best_of and use_beam_search for completions interface (lm-sys#2372) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> * update monkey patch for llama2 (lm-sys#2379) * Make E5 adapter more restrict to reduce mismatch (lm-sys#2381) * Update UI and sponsers (lm-sys#2387) * Use fsdp api for save save (lm-sys#2390) * Release v0.2.27 * Spicyboros + airoboros 2.2 template update. (lm-sys#2392) Co-authored-by: Jon Durbin <jon.durbin@onna.com> * bugfix of openai_api_server for fastchat.serve.vllm_worker (lm-sys#2398) Co-authored-by: wuyongyu <wuyongyu@atomecho.xyz> * Revert "bugfix of openai_api_server for fastchat.serve.vllm_worker" (lm-sys#2400) * Revert "add best_of and use_beam_search for completions interface" (lm-sys#2401) * Release a v0.2.28 with bug fixes and more test cases * Fix model_worker error (lm-sys#2404) * Added google/flan models and fixed AutoModelForSeq2SeqLM when loading T5 compression model (lm-sys#2402) * Rename twitter to X (lm-sys#2406) * Update huggingface_api.py (lm-sys#2409) * Add support for baichuan2 models (lm-sys#2408) * Fixed character overlap issue when api streaming output (lm-sys#2431) * Support custom conversation template in multi_model_worker (lm-sys#2434) * Add Ascend NPU support (lm-sys#2422) * Add raw conversation template (lm-sys#2417) (lm-sys#2418) * Improve docs & UI (lm-sys#2436) * Fix Salesforce xgen inference (lm-sys#2350) * Add support for Phind-CodeLlama models (lm-sys#2415) (lm-sys#2416) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Add falcon 180B chat conversation template (lm-sys#2384) * Improve docs (lm-sys#2438) * add dtype and seed (lm-sys#2430) * Data cleaning scripts for dataset release (lm-sys#2440) * merge google/flan based adapters: T5Adapter, CodeT5pAdapter, FlanAdapter (lm-sys#2411) * Fix docs * Update UI (lm-sys#2446) * Add Optional SSL Support to controller.py (lm-sys#2448) * Format & Improve docs * Release v0.2.29 (lm-sys#2450) * Show terms of use as an JS alert (lm-sys#2461) * vllm worker awq quantization update (lm-sys#2463) Co-authored-by: 董晓龙 <dongxiaolong@shiyanjia.com> * Fix falcon chat template (lm-sys#2464) --------- Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Trangle <kw_w@foxmail.com> Co-authored-by: Nathan Stitt <nathan@stitt.org> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: leiwen83 <leiwen83@users.noreply.github.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Jon Durbin <jon@jondurbin.com> Co-authored-by: Jon Durbin <jon.durbin@onna.com> Co-authored-by: Rayrtfr <2384172887@qq.com> Co-authored-by: wuyongyu <wuyongyu@atomecho.xyz> Co-authored-by: wangxiyuan <wangxiyuan@huawei.com> Co-authored-by: Jeff (Zhen) Wang <wangzhen263@gmail.com> Co-authored-by: karshPrime <94996251+karshPrime@users.noreply.github.com> Co-authored-by: obitolyz <obitoquilt@qq.com> Co-authored-by: Shangwei Chen <109785802+Somezak1@users.noreply.github.com> Co-authored-by: HyungJin Ahn <crushed7@o.cnu.ac.kr> Co-authored-by: zhangsibo1129 <134488188+zhangsibo1129@users.noreply.github.com> Co-authored-by: Tobias Birchler <tobias@birchlerfamily.ch> Co-authored-by: Jae-Won Chung <jwnchung@umich.edu> Co-authored-by: Mingdao Liu <joshua@btlmd.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Brandon Biggs <brandonsbiggs@gmail.com> Co-authored-by: dongxiaolong <774848421@qq.com> Co-authored-by: 董晓龙 <dongxiaolong@shiyanjia.com>
shaleprotocol · Sep 23, 2023 · a887de7 · a887de7
1 parent 9b11481
commit a887de7
Show file tree

Hide file tree

Showing 62 changed files with 1,226 additions and 673 deletions.
diff --git a/README.md b/README.md
@@ -16,6 +16,10 @@ We are focused to support Llama2 at scale now. If you want any other models, ple
 
 ## Dev Log
 
+### 2023-09
+
+Sync upstream changes
+
 ### 2023-08
 
 Support llama2 at scale.
@@ -37,4 +41,3 @@ Support "Llama-2-13b-chat-hf" and make it the default for API.
 
 * API key database and rate limit enforcement
 * Deployable on Kubernetes
-
diff --git a/docs/commands/leaderboard.md b/docs/commands/leaderboard.md
@@ -11,5 +11,16 @@ python3 clean_battle_data.py
 
 ### Run Elo analysis
 ```
-python3 elo_analysis.py --clean-battle-file clean_battle_20230523.json
+python3 elo_analysis.py --clean-battle-file clean_battle_20230905.json
+```
+
+### Copy files to HF space
+1. update plots
+```
+scp atlas:/data/lmzheng/FastChat/fastchat/serve/monitor/elo_results_20230905.pkl .
+```
+
+2. update table
+```
+wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/raw/main/leaderboard_table_20230905.csv
 ```
diff --git a/docs/commands/test_process.md b/docs/commands/test_process.md
@@ -1,3 +1,6 @@
+## Unit tests for FastChat
+The scripts are under [FastChat/tests](../../tests).
+
 ### Test CLI Inference
 
 ```

diff --git a/docs/commands/webserver.md b/docs/commands/webserver.md
@@ -27,7 +27,7 @@ cd fastchat_logs/server0
 export OPENAI_API_KEY=
 export ANTHROPIC_API_KEY=
 
-python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results_20230802.pkl --leaderboard-table-file ~/elo_results/leaderboard_table_20230802.csv --register ~/elo_results/register_oai_models.json
+python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results.pkl --leaderboard-table-file ~/elo_results/leaderboard_table.csv --register ~/elo_results/register_oai_models.json --show-terms
 
 python3 backup_logs.py
 ```

diff --git a/docs/model_support.md b/docs/model_support.md
@@ -31,13 +31,15 @@
 - [openaccess-ai-collective/manticore-13b-chat-pyg](https://huggingface.co/openaccess-ai-collective/manticore-13b-chat-pyg)
 - [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
 - [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
+- [Phind/Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2)
 - [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
 - [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)
 - [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b)
 - [StabilityAI/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)
 - [THUDM/chatglm-6b](https://huggingface.co/THUDM/chatglm-6b)
 - [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 - [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
+- [tiiuae/falcon-180B-chat](https://huggingface.co/tiiuae/falcon-180B-chat)
 - [timdettmers/guanaco-33b-merged](https://huggingface.co/timdettmers/guanaco-33b-merged)
 - [togethercomputer/RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
 - [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
@@ -71,7 +73,7 @@ You can add `--debug` to see the actual prompt sent to the model.
 
 FastChat uses the `Conversation` class to handle prompt templates and `BaseModelAdapter` class to handle model loading.
 
-1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one.
+1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one. Please also add a link to the official reference code if possible.
 2. Implement a model adapter for the new model at [fastchat/model/model_adapter.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_adapter.py). You can follow existing examples and use `register_model_adapter` to add a new one.
 3. (Optional) add the model name to the "Supported models" [section](#supported-models) above and add more information in [fastchat/model/model_registry.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_registry.py).
 

diff --git a/docs/openai_api.md b/docs/openai_api.md
@@ -62,7 +62,7 @@ completion = openai.ChatCompletion.create(
 print(completion.choices[0].message.content)
 ```
 
-Streaming is also supported. See [test_openai_api.py](../tests/test_openai_api.py).
+Streaming is also supported. See [test_openai_api.py](../tests/test_openai_api.py).  If your api server is behind a proxy you'll need to turn off buffering, you can do so in Nginx by setting `proxy_buffering off;` in the location block for the proxy.
 
 ### cURL
 cURL is another good tool for observing the output of the api.

diff --git a/docs/training.md b/docs/training.md
@@ -87,3 +87,32 @@ deepspeed fastchat/train/train_lora_t5.py \
         --deepspeed playground/deepspeed_config_s2.json
 
 ```
+
+### Fine-tuning Vicuna-7B with Local NPUs
+
+You can use the following command to train Vicuna-7B with 8 x 910B (60GB). Use `--nproc_per_node` to specify the number of NPUs.
+```bash
+torchrun --nproc_per_node=8 --master_port=20001 fastchat/train/train.py \
+    --model_name_or_path ~/vicuna-7b-v1.5-16k  \
+    --data_path data/dummy_conversation.json \
+    --fp16 True \
+    --output_dir output_vicuna \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 1200 \
+    --save_total_limit 10 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --lazy_preprocess True
+```
diff --git a/docs/vllm_integration.md b/docs/vllm_integration.md
@@ -18,3 +18,8 @@ See the supported models [here](https://vllm.readthedocs.io/en/latest/models/sup
    ```
    python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3 --tokenizer hf-internal-testing/llama-tokenizer
    ```
+
+   if you use a awq model, try
+   '''
+   python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq
+   '''
diff --git a/fastchat/__init__.py b/fastchat/__init__.py
@@ -1 +1 @@
-__version__ = "0.2.26"
+__version__ = "0.2.29"
diff --git a/fastchat/constants.py b/fastchat/constants.py
@@ -15,7 +15,7 @@
 CONVERSATION_LIMIT_MSG = "YOU HAVE REACHED THE CONVERSATION LENGTH LIMIT. PLEASE CLEAR HISTORY AND START A NEW CONVERSATION."
 INACTIVE_MSG = "THIS SESSION HAS BEEN INACTIVE FOR TOO LONG. PLEASE REFRESH THIS PAGE."
 # Maximum input length
-INPUT_CHAR_LEN_LIMIT = int(os.getenv("FASTCHAT_INPUT_CHAR_LEN_LIMIT", 2560))
+INPUT_CHAR_LEN_LIMIT = int(os.getenv("FASTCHAT_INPUT_CHAR_LEN_LIMIT", 3072))
 # Maximum conversation turns
 CONVERSATION_TURN_LIMIT = 50
 # Session expiration time

diff --git a/fastchat/conversation.py b/fastchat/conversation.py
@@ -27,6 +27,7 @@ class SeparatorStyle(IntEnum):
     RWKV = auto()
     PHOENIX = auto()
     ROBIN = auto()
+    FALCON_CHAT = auto()
 
 
 @dataclasses.dataclass
@@ -200,6 +201,17 @@ def get_prompt(self) -> str:
                 else:
                     ret += role + ":\n"
             return ret
+        elif self.sep_style == SeparatorStyle.FALCON_CHAT:
+            ret = ""
+            if self.system_message:
+                ret += system_prompt + self.sep
+            for role, message in self.messages:
+                if message:
+                    ret += role + ": " + message + self.sep
+                else:
+                    ret += role + ":"
+
+            return ret
         else:
             raise ValueError(f"Invalid style: {self.sep_style}")
 
@@ -285,6 +297,17 @@ def get_conv_template(name: str) -> Conversation:
     return conv_templates[name].copy()
 
 
+# An empty template for raw conversation.
+register_conv_template(
+    Conversation(
+        name="raw",
+        system_message="",
+        roles=("", ""),
+        sep_style=SeparatorStyle.NO_COLON_SINGLE,
+        sep="",
+    )
+)
+
 # A template with a one-shot conversation example
 register_conv_template(
     Conversation(
@@ -357,6 +380,17 @@ def get_conv_template(name: str) -> Conversation:
     )
 )
 
+register_conv_template(
+    Conversation(
+        name="airoboros_v2",
+        system_message="A chat.",
+        roles=("USER", "ASSISTANT"),
+        sep_style=SeparatorStyle.ADD_COLON_TWO,
+        sep="\n",
+        sep2="</s>",
+    )
+)
+
 # Koala default template
 register_conv_template(
     Conversation(
@@ -743,11 +777,10 @@ def get_conv_template(name: str) -> Conversation:
     Conversation(
         name="xgen",
         system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
-        roles=("### Human: ", "###"),
-        sep_style=SeparatorStyle.NO_COLON_SINGLE,
+        roles=("### Human", "### Assistant"),
+        sep_style=SeparatorStyle.ADD_COLON_SINGLE,
         sep="\n",
-        stop_token_ids=[50256, 0, 1, 2],
-        stop_str="<|endoftext|>",
+        stop_token_ids=[50256],
     )
 )
 
@@ -793,6 +826,20 @@ def get_conv_template(name: str) -> Conversation:
     )
 )
 
+# Baichuan2-13B-Chat template
+register_conv_template(
+    # source: https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/c6f8592a60b4ad73c210b28dd2ab3cca51abbf93/modeling_baichuan.py#L773
+    # https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/generation_config.json
+    # https://github.com/baichuan-inc/Baichuan2/issues/62
+    Conversation(
+        name="baichuan2-chat",
+        roles=("<reserved_106>", "<reserved_107>"),
+        sep_style=SeparatorStyle.NO_COLON_SINGLE,
+        sep="",
+        stop_token_ids=[],
+    )
+)
+
 # llama2 template
 # reference: https://huggingface.co/blog/codellama#conversational-instructions
 # reference: https://github.com/facebookresearch/llama/blob/1a240688810f8036049e8da36b073f63d2ac552c/llama/generation.py#L212
@@ -905,6 +952,35 @@ def get_conv_template(name: str) -> Conversation:
     )
 )
 
+# Falcon 180B chat template
+# source: https://huggingface.co/spaces/tiiuae/falcon-180b-demo/blob/d1590ee7fae9b6ce331ba7808e61a29dcce9239f/app.py#L28-L37
+register_conv_template(
+    Conversation(
+        name="falcon-chat",
+        roles=("User", "Falcon"),
+        system_template="System: {system_message}",
+        messages=[],
+        sep_style=SeparatorStyle.FALCON_CHAT,
+        sep="\n",
+        sep2="<|endoftext|>",
+        stop_str="\nUser:",  # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
+    )
+)
+
+# Phind template
+# source: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2
+register_conv_template(
+    Conversation(
+        name="phind",
+        system_message="### System Prompt\nYou are an intelligent programming assistant.",
+        roles=("### User Message", "### Assistant"),
+        messages=(),
+        offset=0,
+        sep_style=SeparatorStyle.ADD_COLON_SINGLE,
+        sep="\n\n",
+    )
+)
+
 
 if __name__ == "__main__":
     print("Vicuna template:")

diff --git a/fastchat/data/merge.py b/fastchat/data/merge.py
@@ -6,7 +6,6 @@
 
 import argparse
 import json
-from typing import Dict, Sequence, Optional
 
 
 if __name__ == "__main__":

diff --git a/fastchat/llm_judge/README.md b/fastchat/llm_judge/README.md
@@ -1,5 +1,5 @@
 # LLM Judge
-| [Paper](https://arxiv.org/abs/2306.05685) | [Leaderboard](https://chat.lmsys.org/?leaderboard) |
+| [Paper](https://arxiv.org/abs/2306.05685) | [Leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) |
 
 In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge.
 MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants.
@@ -10,7 +10,7 @@ To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as j
 - [Review Pre-Generated Model Answers and Judgments](#review-pre-generated-model-answers-and-judgments)
 - [MT-Bench](#mt-bench)
 - [Agreement Computation](#agreement-computation)
-- [Dataset](#dataset)
+- [Datasets](#datasets)
 - [Citation](#citation)
 
 ## Install
@@ -64,6 +64,7 @@ This mode asks GPT-4 to grade and give a score to model's answer directly withou
 For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.
 
 ```
+export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
 python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
 ```
 
@@ -133,7 +134,7 @@ We released 3.3K human annotations for model responses generated by 6 models in
 
 This Colab [notebook](https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8-bRZCl1WNcT8De6?usp=sharing) shows how to compute the agreement between humans and GPT-4 judge with the dataset. Our results show that humans and GPT-4 judge achieve over 80\% agreement, the same level of agreement between humans.
 
-## Dataset
+## Datasets
 - [Chatbot Arena Conversation Dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
 - [MT-bench Human Annotation Dataset](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
 

diff --git a/fastchat/llm_judge/common.py b/fastchat/llm_judge/common.py
@@ -418,6 +418,35 @@ def chat_compeletion_openai(model, conv, temperature, max_tokens):
     return output
 
 
+def chat_compeletion_openai_azure(model, conv, temperature, max_tokens):
+    openai.api_type = "azure"
+    openai.api_base = os.environ["AZURE_OPENAI_ENDPOINT"]
+    openai.api_key = os.environ["AZURE_OPENAI_KEY"]
+    openai.api_version = "2023-05-15"
+
+    if "azure-" in model:
+        model = model[6:]
+
+    output = API_ERROR_OUTPUT
+    for _ in range(API_MAX_RETRY):
+        try:
+            messages = conv.to_openai_api_messages()
+            response = openai.ChatCompletion.create(
+                engine=model,
+                messages=messages,
+                n=1,
+                temperature=temperature,
+                max_tokens=max_tokens,
+            )
+            output = response["choices"][0]["message"]["content"]
+            break
+        except openai.error.OpenAIError as e:
+            print(type(e), e)
+            time.sleep(API_RETRY_SLEEP)
+
+    return output
+
+
 def chat_compeletion_anthropic(model, conv, temperature, max_tokens):
     output = API_ERROR_OUTPUT
     for _ in range(API_MAX_RETRY):