Skip to content

Commit

Permalink
Merge 0922 (#6)
Browse files Browse the repository at this point in the history
* Remove hardcode flash-attn disable setting (lm-sys#2342)

* Document turning off proxy_buffering when api is streaming (lm-sys#2337)

* Simplify huggingface api example (lm-sys#2355)

* Update sponsor logos (lm-sys#2367)

* if LOGDIR is empty, then don't try output log to local file (lm-sys#2357)

Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>

* add best_of and use_beam_search for completions interface (lm-sys#2348)

Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>

* Extract upvote/downvote from log files (lm-sys#2369)

* Revert "add best_of and use_beam_search for completions interface" (lm-sys#2370)

* Improve doc (lm-sys#2371)

* add best_of and use_beam_search for completions interface (lm-sys#2372)

Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>

* update monkey patch for llama2 (lm-sys#2379)

* Make E5 adapter more restrict to reduce mismatch (lm-sys#2381)

* Update UI and sponsers (lm-sys#2387)

* Use fsdp api for save save (lm-sys#2390)

* Release v0.2.27

* Spicyboros + airoboros 2.2 template update. (lm-sys#2392)

Co-authored-by: Jon Durbin <jon.durbin@onna.com>

* bugfix of openai_api_server for fastchat.serve.vllm_worker (lm-sys#2398)

Co-authored-by: wuyongyu <wuyongyu@atomecho.xyz>

* Revert "bugfix of openai_api_server for fastchat.serve.vllm_worker" (lm-sys#2400)

* Revert "add best_of and use_beam_search for completions interface" (lm-sys#2401)

* Release a v0.2.28 with bug fixes and more test cases

* Fix model_worker error (lm-sys#2404)

* Added google/flan models and fixed AutoModelForSeq2SeqLM when loading T5 compression model (lm-sys#2402)

* Rename twitter to X (lm-sys#2406)

* Update huggingface_api.py (lm-sys#2409)

* Add support for baichuan2 models (lm-sys#2408)

* Fixed character overlap issue when api streaming output (lm-sys#2431)

* Support custom conversation template in multi_model_worker (lm-sys#2434)

* Add Ascend NPU support (lm-sys#2422)

* Add raw conversation template (lm-sys#2417) (lm-sys#2418)

* Improve docs & UI (lm-sys#2436)

* Fix Salesforce xgen inference (lm-sys#2350)

* Add support for Phind-CodeLlama models (lm-sys#2415) (lm-sys#2416)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* Add falcon 180B chat conversation template (lm-sys#2384)

* Improve docs (lm-sys#2438)

* add dtype and seed (lm-sys#2430)

* Data cleaning scripts for dataset release (lm-sys#2440)

* merge google/flan based adapters: T5Adapter, CodeT5pAdapter, FlanAdapter (lm-sys#2411)

* Fix docs

* Update UI (lm-sys#2446)

* Add Optional SSL Support to controller.py (lm-sys#2448)

* Format & Improve docs

* Release v0.2.29 (lm-sys#2450)

* Show terms of use as an JS alert (lm-sys#2461)

* vllm worker awq quantization update (lm-sys#2463)

Co-authored-by: 董晓龙 <dongxiaolong@shiyanjia.com>

* Fix falcon chat template (lm-sys#2464)

---------

Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Trangle <kw_w@foxmail.com>
Co-authored-by: Nathan Stitt <nathan@stitt.org>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: leiwen83 <leiwen83@users.noreply.github.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Jon Durbin <jon@jondurbin.com>
Co-authored-by: Jon Durbin <jon.durbin@onna.com>
Co-authored-by: Rayrtfr <2384172887@qq.com>
Co-authored-by: wuyongyu <wuyongyu@atomecho.xyz>
Co-authored-by: wangxiyuan <wangxiyuan@huawei.com>
Co-authored-by: Jeff (Zhen) Wang <wangzhen263@gmail.com>
Co-authored-by: karshPrime <94996251+karshPrime@users.noreply.github.com>
Co-authored-by: obitolyz <obitoquilt@qq.com>
Co-authored-by: Shangwei Chen <109785802+Somezak1@users.noreply.github.com>
Co-authored-by: HyungJin Ahn <crushed7@o.cnu.ac.kr>
Co-authored-by: zhangsibo1129 <134488188+zhangsibo1129@users.noreply.github.com>
Co-authored-by: Tobias Birchler <tobias@birchlerfamily.ch>
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
Co-authored-by: Mingdao Liu <joshua@btlmd.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Brandon Biggs <brandonsbiggs@gmail.com>
Co-authored-by: dongxiaolong <774848421@qq.com>
Co-authored-by: 董晓龙 <dongxiaolong@shiyanjia.com>
  • Loading branch information
1 parent 9b11481 commit a887de7
Show file tree
Hide file tree
Showing 62 changed files with 1,226 additions and 673 deletions.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ We are focused to support Llama2 at scale now. If you want any other models, ple

## Dev Log

### 2023-09

Sync upstream changes

### 2023-08

Support llama2 at scale.
Expand All @@ -37,4 +41,3 @@ Support "Llama-2-13b-chat-hf" and make it the default for API.

* API key database and rate limit enforcement
* Deployable on Kubernetes

13 changes: 12 additions & 1 deletion docs/commands/leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,16 @@ python3 clean_battle_data.py

### Run Elo analysis
```
python3 elo_analysis.py --clean-battle-file clean_battle_20230523.json
python3 elo_analysis.py --clean-battle-file clean_battle_20230905.json
```

### Copy files to HF space
1. update plots
```
scp atlas:/data/lmzheng/FastChat/fastchat/serve/monitor/elo_results_20230905.pkl .
```

2. update table
```
wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/raw/main/leaderboard_table_20230905.csv
```
3 changes: 3 additions & 0 deletions docs/commands/test_process.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
## Unit tests for FastChat
The scripts are under [FastChat/tests](../../tests).

### Test CLI Inference

```
Expand Down
2 changes: 1 addition & 1 deletion docs/commands/webserver.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ cd fastchat_logs/server0
export OPENAI_API_KEY=
export ANTHROPIC_API_KEY=
python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results_20230802.pkl --leaderboard-table-file ~/elo_results/leaderboard_table_20230802.csv --register ~/elo_results/register_oai_models.json
python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results.pkl --leaderboard-table-file ~/elo_results/leaderboard_table.csv --register ~/elo_results/register_oai_models.json --show-terms
python3 backup_logs.py
```
Expand Down
4 changes: 3 additions & 1 deletion docs/model_support.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,15 @@
- [openaccess-ai-collective/manticore-13b-chat-pyg](https://huggingface.co/openaccess-ai-collective/manticore-13b-chat-pyg)
- [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
- [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
- [Phind/Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2)
- [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
- [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)
- [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b)
- [StabilityAI/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)
- [THUDM/chatglm-6b](https://huggingface.co/THUDM/chatglm-6b)
- [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
- [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
- [tiiuae/falcon-180B-chat](https://huggingface.co/tiiuae/falcon-180B-chat)
- [timdettmers/guanaco-33b-merged](https://huggingface.co/timdettmers/guanaco-33b-merged)
- [togethercomputer/RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
- [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
Expand Down Expand Up @@ -71,7 +73,7 @@ You can add `--debug` to see the actual prompt sent to the model.

FastChat uses the `Conversation` class to handle prompt templates and `BaseModelAdapter` class to handle model loading.

1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one.
1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one. Please also add a link to the official reference code if possible.
2. Implement a model adapter for the new model at [fastchat/model/model_adapter.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_adapter.py). You can follow existing examples and use `register_model_adapter` to add a new one.
3. (Optional) add the model name to the "Supported models" [section](#supported-models) above and add more information in [fastchat/model/model_registry.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_registry.py).

Expand Down
2 changes: 1 addition & 1 deletion docs/openai_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ completion = openai.ChatCompletion.create(
print(completion.choices[0].message.content)
```

Streaming is also supported. See [test_openai_api.py](../tests/test_openai_api.py).
Streaming is also supported. See [test_openai_api.py](../tests/test_openai_api.py). If your api server is behind a proxy you'll need to turn off buffering, you can do so in Nginx by setting `proxy_buffering off;` in the location block for the proxy.

### cURL
cURL is another good tool for observing the output of the api.
Expand Down
29 changes: 29 additions & 0 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,32 @@ deepspeed fastchat/train/train_lora_t5.py \
--deepspeed playground/deepspeed_config_s2.json

```

### Fine-tuning Vicuna-7B with Local NPUs

You can use the following command to train Vicuna-7B with 8 x 910B (60GB). Use `--nproc_per_node` to specify the number of NPUs.
```bash
torchrun --nproc_per_node=8 --master_port=20001 fastchat/train/train.py \
--model_name_or_path ~/vicuna-7b-v1.5-16k \
--data_path data/dummy_conversation.json \
--fp16 True \
--output_dir output_vicuna \
--num_train_epochs 3 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True
```
5 changes: 5 additions & 0 deletions docs/vllm_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,8 @@ See the supported models [here](https://vllm.readthedocs.io/en/latest/models/sup
```
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3 --tokenizer hf-internal-testing/llama-tokenizer
```

if you use a awq model, try
'''
python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq
'''
2 changes: 1 addition & 1 deletion fastchat/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.2.26"
__version__ = "0.2.29"
2 changes: 1 addition & 1 deletion fastchat/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
CONVERSATION_LIMIT_MSG = "YOU HAVE REACHED THE CONVERSATION LENGTH LIMIT. PLEASE CLEAR HISTORY AND START A NEW CONVERSATION."
INACTIVE_MSG = "THIS SESSION HAS BEEN INACTIVE FOR TOO LONG. PLEASE REFRESH THIS PAGE."
# Maximum input length
INPUT_CHAR_LEN_LIMIT = int(os.getenv("FASTCHAT_INPUT_CHAR_LEN_LIMIT", 2560))
INPUT_CHAR_LEN_LIMIT = int(os.getenv("FASTCHAT_INPUT_CHAR_LEN_LIMIT", 3072))
# Maximum conversation turns
CONVERSATION_TURN_LIMIT = 50
# Session expiration time
Expand Down
84 changes: 80 additions & 4 deletions fastchat/conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ class SeparatorStyle(IntEnum):
RWKV = auto()
PHOENIX = auto()
ROBIN = auto()
FALCON_CHAT = auto()


@dataclasses.dataclass
Expand Down Expand Up @@ -200,6 +201,17 @@ def get_prompt(self) -> str:
else:
ret += role + ":\n"
return ret
elif self.sep_style == SeparatorStyle.FALCON_CHAT:
ret = ""
if self.system_message:
ret += system_prompt + self.sep
for role, message in self.messages:
if message:
ret += role + ": " + message + self.sep
else:
ret += role + ":"

return ret
else:
raise ValueError(f"Invalid style: {self.sep_style}")

Expand Down Expand Up @@ -285,6 +297,17 @@ def get_conv_template(name: str) -> Conversation:
return conv_templates[name].copy()


# An empty template for raw conversation.
register_conv_template(
Conversation(
name="raw",
system_message="",
roles=("", ""),
sep_style=SeparatorStyle.NO_COLON_SINGLE,
sep="",
)
)

# A template with a one-shot conversation example
register_conv_template(
Conversation(
Expand Down Expand Up @@ -357,6 +380,17 @@ def get_conv_template(name: str) -> Conversation:
)
)

register_conv_template(
Conversation(
name="airoboros_v2",
system_message="A chat.",
roles=("USER", "ASSISTANT"),
sep_style=SeparatorStyle.ADD_COLON_TWO,
sep="\n",
sep2="</s>",
)
)

# Koala default template
register_conv_template(
Conversation(
Expand Down Expand Up @@ -743,11 +777,10 @@ def get_conv_template(name: str) -> Conversation:
Conversation(
name="xgen",
system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("### Human: ", "###"),
sep_style=SeparatorStyle.NO_COLON_SINGLE,
roles=("### Human", "### Assistant"),
sep_style=SeparatorStyle.ADD_COLON_SINGLE,
sep="\n",
stop_token_ids=[50256, 0, 1, 2],
stop_str="<|endoftext|>",
stop_token_ids=[50256],
)
)

Expand Down Expand Up @@ -793,6 +826,20 @@ def get_conv_template(name: str) -> Conversation:
)
)

# Baichuan2-13B-Chat template
register_conv_template(
# source: https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/c6f8592a60b4ad73c210b28dd2ab3cca51abbf93/modeling_baichuan.py#L773
# https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/generation_config.json
# https://github.com/baichuan-inc/Baichuan2/issues/62
Conversation(
name="baichuan2-chat",
roles=("<reserved_106>", "<reserved_107>"),
sep_style=SeparatorStyle.NO_COLON_SINGLE,
sep="",
stop_token_ids=[],
)
)

# llama2 template
# reference: https://huggingface.co/blog/codellama#conversational-instructions
# reference: https://github.com/facebookresearch/llama/blob/1a240688810f8036049e8da36b073f63d2ac552c/llama/generation.py#L212
Expand Down Expand Up @@ -905,6 +952,35 @@ def get_conv_template(name: str) -> Conversation:
)
)

# Falcon 180B chat template
# source: https://huggingface.co/spaces/tiiuae/falcon-180b-demo/blob/d1590ee7fae9b6ce331ba7808e61a29dcce9239f/app.py#L28-L37
register_conv_template(
Conversation(
name="falcon-chat",
roles=("User", "Falcon"),
system_template="System: {system_message}",
messages=[],
sep_style=SeparatorStyle.FALCON_CHAT,
sep="\n",
sep2="<|endoftext|>",
stop_str="\nUser:", # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
)
)

# Phind template
# source: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2
register_conv_template(
Conversation(
name="phind",
system_message="### System Prompt\nYou are an intelligent programming assistant.",
roles=("### User Message", "### Assistant"),
messages=(),
offset=0,
sep_style=SeparatorStyle.ADD_COLON_SINGLE,
sep="\n\n",
)
)


if __name__ == "__main__":
print("Vicuna template:")
Expand Down
1 change: 0 additions & 1 deletion fastchat/data/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@

import argparse
import json
from typing import Dict, Sequence, Optional


if __name__ == "__main__":
Expand Down
7 changes: 4 additions & 3 deletions fastchat/llm_judge/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# LLM Judge
| [Paper](https://arxiv.org/abs/2306.05685) | [Leaderboard](https://chat.lmsys.org/?leaderboard) |
| [Paper](https://arxiv.org/abs/2306.05685) | [Leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) |

In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge.
MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants.
Expand All @@ -10,7 +10,7 @@ To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as j
- [Review Pre-Generated Model Answers and Judgments](#review-pre-generated-model-answers-and-judgments)
- [MT-Bench](#mt-bench)
- [Agreement Computation](#agreement-computation)
- [Dataset](#dataset)
- [Datasets](#datasets)
- [Citation](#citation)

## Install
Expand Down Expand Up @@ -64,6 +64,7 @@ This mode asks GPT-4 to grade and give a score to model's answer directly withou
For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.

```
export OPENAI_API_KEY=XXXXXX # set the OpenAI API key
python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
```

Expand Down Expand Up @@ -133,7 +134,7 @@ We released 3.3K human annotations for model responses generated by 6 models in

This Colab [notebook](https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8-bRZCl1WNcT8De6?usp=sharing) shows how to compute the agreement between humans and GPT-4 judge with the dataset. Our results show that humans and GPT-4 judge achieve over 80\% agreement, the same level of agreement between humans.

## Dataset
## Datasets
- [Chatbot Arena Conversation Dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
- [MT-bench Human Annotation Dataset](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)

Expand Down
29 changes: 29 additions & 0 deletions fastchat/llm_judge/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -418,6 +418,35 @@ def chat_compeletion_openai(model, conv, temperature, max_tokens):
return output


def chat_compeletion_openai_azure(model, conv, temperature, max_tokens):
openai.api_type = "azure"
openai.api_base = os.environ["AZURE_OPENAI_ENDPOINT"]
openai.api_key = os.environ["AZURE_OPENAI_KEY"]
openai.api_version = "2023-05-15"

if "azure-" in model:
model = model[6:]

output = API_ERROR_OUTPUT
for _ in range(API_MAX_RETRY):
try:
messages = conv.to_openai_api_messages()
response = openai.ChatCompletion.create(
engine=model,
messages=messages,
n=1,
temperature=temperature,
max_tokens=max_tokens,
)
output = response["choices"][0]["message"]["content"]
break
except openai.error.OpenAIError as e:
print(type(e), e)
time.sleep(API_RETRY_SLEEP)

return output


def chat_compeletion_anthropic(model, conv, temperature, max_tokens):
output = API_ERROR_OUTPUT
for _ in range(API_MAX_RETRY):
Expand Down
Loading

0 comments on commit a887de7

Please sign in to comment.