Add Hugging face client #42

philschmid · 2024-03-28T12:36:58Z

What does this PR do?

This PR adds a dedicated Hugging Face client, which allows llmperf user to benchmark Hugging Face models using TGI on the API inference, Inference Endpoints or Locally/any URL.

Below is an simple example

run tgi

docker run --gpus all -ti -p 8080:80   -e MODEL_ID=HuggingFaceH4/zephyr-7b-beta ghcr.io/huggingface/text-generation-inference:latest

run benchmark

export HUGGINGFACE_API_BASE="http://localhost:8080"
export MODEL_ID="HuggingFaceH4/zephyr-7b-beta"

python token_benchmark_ray.py \
--model $MODEL_ID \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api huggingface \
--additional-sampling-params '{}'

philschmid · 2024-03-28T12:38:59Z

cc @waleedkadous

Signed-off-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Benchmark

x

slyt · 2024-07-08T20:51:29Z

@philschmid The README mentions HUGGINGFACE_API_KEY, but I couldn't get the your fork to benchmark Llama3 on an instance of text-generation-inference server without specifying HUGGINGFACE_API_TOKEN. Is there a difference between HUGGINGFACE_API_TOKEN and HUGGINGFACE_API_KEY? Should all references be one or the other?

src/llmperf/ray_clients/huggingface_client.py is using HUGGINFACE_API_TOKEN
litellm is using HUGGINFACE_API_KEY
Huggingface Hub python library has HF_TOKEN which supersedes the deprecatedHUGGING_FACE_HUB_TOKEN

If HUGGINGFACE_API_TOKEN is not set, you get this error when trying to benchmark meta-llama/Meta-Llama-3-70B-Instruct. It can't pull the tokenizer without the token because Llama3 tokenizer is behind an agreement acknowledgment page:

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct.
401 Client Error. (Request ID: Root=1-668c4b2e-082a7cbe6986c4514589204c;528c624d-4cfa-42f0-bd0f-d3f2e1431fbf)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-70B-Instruct is restricted. You must be authenticated to access it.
  0%|                                                               | 0/2 [00:06<?, ?it/s]

philschmid added 2 commits March 28, 2024 12:30

add hugging face client

d507f5c

add token for gated and private models

febbcb1

philschmid and others added 19 commits March 28, 2024 16:47

fix to make sure base models work as well

73e6f59

removed unnecessary res

75187d2

Update README.md

7ad763f

Signed-off-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Update README.md

6c2ae61

Signed-off-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Update README.md

0411dc2

Signed-off-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

add hf endpoint to sm client

2cbacf9

Merge branch 'main' of https://github.com/philschmid/llmperf

6c6f598

add messages client

9c844e0

clean up readme

c6abd52

update

5be9eee

init

acbee87

first results

e22c830

Merge pull request #1 from philschmid/benchmark

9321dc7

Benchmark

x

61362d9

Merge pull request #2 from philschmid/benchmark

cc7753b

x

updatae

eb77e63

wip

e5833c0

updated csv

bd517ba

inf2

5c1c321

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hugging face client #42

Add Hugging face client #42

philschmid commented Mar 28, 2024

philschmid commented Mar 28, 2024

slyt commented Jul 8, 2024 •

edited

Loading

Add Hugging face client #42

Are you sure you want to change the base?

Add Hugging face client #42

Conversation

philschmid commented Mar 28, 2024

What does this PR do?

philschmid commented Mar 28, 2024

slyt commented Jul 8, 2024 • edited Loading

slyt commented Jul 8, 2024 •

edited

Loading