Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Benchmark : Update benchmark configs for Nightly #126

Merged
merged 6 commits into from
Mar 15, 2024

Conversation

varun-sundar-rabindranath
Copy link

@varun-sundar-rabindranath varun-sundar-rabindranath commented Mar 14, 2024

SUMMARY:
Update the benchmark configs such that the Nightly runs the following models,

  • 7b Mistral

    • Base : teknium/OpenHermes-2.5-Mistral-7B
    • GPTQ : TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ
    • Marlin : neuralmagic/OpenHermes-2.5-Mistral-7B-marlin
    • Sparse : neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50
    • Sparse 2:4 : neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4
  • Llama 7b fp16

    • NousResearch/Llama-2-7b-chat-hf #fp16
  • Update benchmark_serving num_prompts and qps pairs.

  • Minor update to the benchmark_throughput prefill and decode cases.

TEST PLAN:
Manual testing

@varun-sundar-rabindranath
Copy link
Author

varun-sundar-rabindranath commented Mar 15, 2024

The new config takes 6.5 hrs on a A10g x 4 machine - 4.5 hrs for the serving case and 2 hrs for the throughput case.
Extrapolating on previous benchmark runs, It will likely take around 10 hrs on a A10g x 1 machine.

Also note that this is with "benchmark iterations" set to 1.

We can split the runs into 2 to speed up the process, But I am not sure about the implications on cost. @robertgshaw2-neuralmagic @mgoin @dhuangnm any suggestions ?

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Mar 15, 2024

Do you know why it takes so long for the serving case?

Looks like from the config we have 5 models and 5 scenarios, each scenario takes 5 minutes --> 555 = 125 / 60 = 2hr?

I think we should adjust to do ~2.5minutes per scenario

@varun-sundar-rabindranath
Copy link
Author

@robertgshaw2-neuralmagic we have 6 models and 5 scenarios - so a total of 6 * 5 * 5 = 2.5 hrs, which is still considerably smaller than what we are seeing. some reasons i could think of,

  1. we generate the dataset from scratch for each run. We should cache it.
  2. model loading time
  3. engine warmup time (after creating the engine, we run 1000 prompts at infinity qps)
    I can look at this as a follow up.

@varun-sundar-rabindranath varun-sundar-rabindranath merged commit ac8f242 into main Mar 15, 2024
2 checks passed
@varun-sundar-rabindranath varun-sundar-rabindranath deleted the varun/update-nightly-configs branch March 15, 2024 20:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants