Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Add lm-eval correctness test #210

Merged
merged 32 commits into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
952a0db
Add test framework for server
dbarbuzzi Apr 22, 2024
6178aea
Update docstring
dbarbuzzi Apr 22, 2024
c13b5c2
Add missing '__init__.py'
dbarbuzzi Apr 22, 2024
df48eef
In-line updated` ServerRunner` implementation
dbarbuzzi Apr 23, 2024
09f7161
Restore logging of server command args
dbarbuzzi Apr 24, 2024
2b32a92
Add lm-eval correctness test
dbarbuzzi Apr 24, 2024
74d0293
Add "--max-model-len" arg
dbarbuzzi Apr 24, 2024
4f6a5cf
Adjust relative tolerance value to 0.05
dbarbuzzi Apr 24, 2024
7392992
Change '--max-model-len' to 2048
dbarbuzzi Apr 25, 2024
3ebcc81
Fix comment length, remove outdated comment
dbarbuzzi Apr 25, 2024
a790a1f
Update comment
dbarbuzzi Apr 25, 2024
431f051
Skip if `lm_eval` is not available
dbarbuzzi Apr 29, 2024
44d781f
Merge branch 'main' into add-lm-eval-correctness-test
dbarbuzzi May 3, 2024
6856f24
Skip test in remote push jobs
dbarbuzzi May 3, 2024
dc33cee
Fix check in lm-eval smoke test
dbarbuzzi May 3, 2024
9bf3a71
Update lm-eval smoke job to use prebuilt wheel
dbarbuzzi May 3, 2024
c914b36
Fix typing in test
dbarbuzzi May 3, 2024
da1adf2
Add lm-eval-full job on release runs
dbarbuzzi May 3, 2024
473f8ee
Skip full test in nightly
dbarbuzzi May 3, 2024
f316375
Fix style
dbarbuzzi May 3, 2024
c61d6b2
Update eval task configs
dbarbuzzi May 3, 2024
44df6ad
Add support for configurable `rtol`
dbarbuzzi May 3, 2024
7a1ecdf
Mark 'chat-marlin' model as xfail
dbarbuzzi May 3, 2024
49d115b
Use correct label for TEST-LM-EVAL-FULL
dbarbuzzi May 3, 2024
d6571d4
Only run full lm-eval on a weekly cadence
dbarbuzzi May 6, 2024
3b25154
Update naming
dbarbuzzi May 6, 2024
5308642
Add manual release workflow
dbarbuzzi May 7, 2024
4471031
Remove xfail logic
dbarbuzzi May 7, 2024
e972635
Fix release workflow category
dbarbuzzi May 7, 2024
73adc9f
Disable marlin models
dbarbuzzi May 9, 2024
638d924
Separate nightly/weekly workflows
dbarbuzzi May 9, 2024
9828633
Additional fix for lm-eval smoke check
dbarbuzzi May 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/actions/nm-lm-eval-accuracy/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ runs:
steps:
- id: lm-eval
run: |
# move source directories
mv vllm vllm-ignore || echo "no 'vllm' folder to move"
mv csrc csrc-ignore || echo "no 'csrc' folder to move"

COMMIT=${{ github.sha }}
VENV="${{ inputs.venv }}-${COMMIT:0:7}"
source $(pyenv root)/versions/${{ inputs.python }}/envs/${VENV}/bin/activate
Expand All @@ -20,7 +24,7 @@ runs:
pip3 install pytest openai==1.3.9

SUCCESS=0
pytest .github/scripts/test_lm_eval_sweep.py -s -v || SUCCESS=$?
pytest -v tests/accuracy/test_lm_eval_correctness.py || SUCCESS=$?
echo "test=${SUCCESS}" >> "$GITHUB_OUTPUT"
exit ${SUCCESS}
shell: bash
4 changes: 4 additions & 0 deletions .github/actions/nm-lm-eval-smoke/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ runs:
steps:
- id: lm-eval
run: |
# move source directories
mv vllm vllm-ignore || echo "no 'vllm' folder to move"
mv csrc csrc-ignore || echo "no 'csrc' folder to move"

COMMIT=${{ github.sha }}
VENV="${{ inputs.venv }}-${COMMIT:0:7}"
source $(pyenv root)/versions/${{ inputs.python }}/envs/${VENV}/bin/activate
Expand Down
5 changes: 5 additions & 0 deletions .github/data/nm_benchmark_weekly_configs_list.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
neuralmagic/benchmarks/configs/benchmark_serving.json
neuralmagic/benchmarks/configs/benchmark_throughput.json
neuralmagic/benchmarks/configs/benchmark_throughput_decode.json
neuralmagic/benchmarks/configs/benchmark_throughput_prefill.json
neuralmagic/benchmarks/configs/benchmark_remote_push.json
4 changes: 2 additions & 2 deletions .github/scripts/lm_eval_compare_hf_vs_vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def print_results(data_to_print: List = None,
def check_passing_score(results_dict: Dict = None,
alpha: float = None) -> bool:
for task in results_dict:
p_value = task["p_value"]
p_value = results_dict[task]["p_value"]
if p_value <= alpha:
return False
return True
Expand Down Expand Up @@ -120,6 +120,6 @@ def parse_args():
all_res[task1[0]] = {"z": z, "p_value": p_value}
print_results([results_hf["results"], results_vllm["results"]], all_res,
args.alpha)
if not check_passing_score:
if not check_passing_score(all_res, args.alpha):
print("Accuracy test failed!")
exit(1)
223 changes: 0 additions & 223 deletions .github/scripts/test_lm_eval_sweep.py

This file was deleted.

43 changes: 28 additions & 15 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
workflow_call:
inputs:
wf_category:
description: "categories: REMOTE, NIGHTLY, RELEASE"
description: "categories: REMOTE, NIGHTLY, WEEKLY, RELEASE"
type: string
default: "REMOTE"
python:
Expand Down Expand Up @@ -177,17 +177,30 @@ jobs:
push_benchmark_results_to_gh_pages: "${{ github.event_name == 'schedule' || inputs.push_benchmark_results_to_gh_pages }}"
secrets: inherit

# TODO: decide if this should build or use the whl
# single gpu
# TODO: this should only run if doing a NIGHTLY or RELEASE
# Accuracy-Smoke-AWS-AVX2-32G-A10G-24G:
# if: ${{ inputs.wf_category == 'NIGHTLY' || inputs.wf_category == 'RELEASE' }}
# uses: ./.github/workflows/nm-lm-eval-smoke.yml
# with:
# label: ${{ inputs.test_label_solo }}
# timeout: ${{ inputs.benchmark_timeout }}
# gitref: ${{ github.ref }}
# Gi_per_thread: ${{ inputs.Gi_per_thread }}
# nvcc_threads: ${{ inputs.nvcc_threads }}
# python: ${{ inputs.python }}
# secrets: inherit
TEST-ACCURACY-SMOKE:
needs: [BUILD]
if: inputs.wf_category == 'NIGHTLY'
uses: ./.github/workflows/nm-lm-eval-smoke.yml
with:
label: ${{ inputs.test_label_solo }}
timeout: ${{ inputs.benchmark_timeout }}
gitref: ${{ inputs.gitref }}
Gi_per_thread: ${{ inputs.Gi_per_thread }}
nvcc_threads: ${{ inputs.nvcc_threads }}
python: ${{ inputs.python }}
whl: ${{ needs.BUILD.outputs.whl }}
secrets: inherit

TEST-ACCURACY-FULL:
needs: [BUILD]
if: ${{ inputs.wf_category == 'WEEKLY' || inputs.wf_category == 'RELEASE' }}
uses: ./.github/workflows/nm-lm-eval-accuracy.yml
with:
label: ${{ inputs.test_label_multi }}
timeout: ${{ inputs.benchmark_timeout }}
gitref: ${{ inputs.gitref }}
Gi_per_thread: ${{ inputs.Gi_per_thread }}
nvcc_threads: ${{ inputs.nvcc_threads }}
python: ${{ inputs.python }}
whl: ${{ needs.BUILD.outputs.whl }}
secrets: inherit
4 changes: 2 additions & 2 deletions .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ run-name: ${{ github.actor }} triggered nightly on ${{ github.ref }}
on:
schedule:
# * is a special character in YAML so you have to quote this string
- cron: '0 1 * * *'
- cron: '0 1 * * 1-6' # nightly run (Mon-Sat)

workflow_dispatch:
inputs:
Expand All @@ -27,7 +27,7 @@ jobs:
test_label_solo: aws-avx2-32G-a10g-24G
test_label_multi: aws-avx2-192G-4-a10g-96G
test_timeout: 480
test_skip_list:
test_skip_list: neuralmagic/tests/skip-for-nightly.txt

benchmark_label: aws-avx2-32G-a10g-24G
benchmark_config_list_file: ./.github/data/nm_benchmark_nightly_configs_list.txt
Expand Down
Loading