Add lm-eval full accuracy sweep using GSM8k #166

mgoin · 2024-04-04T16:08:39Z

Using the OpenAI backend of lm-eval (model="local-completions") this creates a pytest that spins up a vLLM OpenAi server for various models (Llama, Mistral, Phi 2, Mixtral) and runs gsm8k evals against the server to compare with known accuracy values. This should be a good test for making sure accuracies aren't affected for fp16, sparse, and marlin models as we make releases or upstream syncs. For now, we will leave this as a manually triggered workflow.

These are the models and evals set up for this PR:

# Each entry in this dictionary holds a model id as the key and an
# EvalDefinition as a value. The EvalDefinition holds a list of Tasks
# to evaluate the models on, each with their own pre-recorded Metrics
MODEL_TEST_POINTS = [
    # Llama 2 7B: FP16, FP16 sparse, marlin
    ("NousResearch/Llama-2-7b-chat-hf",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.2266868840030326),
                  Metric("exact_match,flexible-extract", 0.22820318423047764)
              ])
     ])),
    ("neuralmagic/Llama-2-7b-pruned50-retrained-ultrachat",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.09855951478392722),
                  Metric("exact_match,flexible-extract", 0.10083396512509477)
              ])
     ],
                    extra_args=["--sparsity", "sparse_w16a16"])),
    ("neuralmagic/llama-2-7b-chat-marlin",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.14101592115238817),
                  Metric("exact_match,flexible-extract", 0.1652767247915087)
              ])
     ],
                    enable_tensor_parallel=False)),
    # Mistral 7B: FP16, FP16 sparse, marlin
    ("teknium/OpenHermes-2.5-Mistral-7B",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.6004548900682335),
                  Metric("exact_match,flexible-extract", 0.6482183472327521)
              ])
     ])),
    ("neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.4935557240333586),
                  Metric("exact_match,flexible-extract", 0.5269143290371494)
              ])
     ],
                    extra_args=["--sparsity", "sparse_w16a16"])),
    ("neuralmagic/OpenHermes-2.5-Mistral-7B-marlin",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.4935557240333586),
                  Metric("exact_match,flexible-extract", 0.5868081880212282)
              ])
     ],
                    enable_tensor_parallel=False)),
    # Phi 2: marlin
    ("neuralmagic/phi-2-super-marlin",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.49962092494313876),
                  Metric("exact_match,flexible-extract", 0.5041698256254739)
              ])
     ],
                    enable_tensor_parallel=False)),
    # Mixtral: FP16
    ("mistralai/Mixtral-8x7B-Instruct-v0.1",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.6550416982562547),
                  Metric("exact_match,flexible-extract", 0.6603487490523123)
              ])
     ],
                    enable_tensor_parallel=True)),
]

mgoin added 4 commits April 4, 2024 15:52

Add lm-eval full accuracy sweep using GSM8k

0d32b43

Update action.yml

6e63852

Format

9930970

Fix

ecc6ed3

robertgshaw2-redhat self-requested a review April 4, 2024 19:07

robertgshaw2-redhat approved these changes Apr 5, 2024

View reviewed changes

mgoin merged commit 802bca1 into main Apr 5, 2024

mgoin deleted the lm-eval-gsm8k branch April 5, 2024 18:39

dbarbuzzi mentioned this pull request Apr 25, 2024

Add lm-eval correctness test #210

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add lm-eval full accuracy sweep using GSM8k #166

Add lm-eval full accuracy sweep using GSM8k #166

Uh oh!

mgoin commented Apr 4, 2024

Uh oh!

Uh oh!

Add lm-eval full accuracy sweep using GSM8k #166

Add lm-eval full accuracy sweep using GSM8k #166

Uh oh!

Conversation

mgoin commented Apr 4, 2024

Uh oh!

Uh oh!