Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Add lm-eval full accuracy sweep using GSM8k #166

Merged
merged 4 commits into from
Apr 5, 2024
Merged

Add lm-eval full accuracy sweep using GSM8k #166

merged 4 commits into from
Apr 5, 2024

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Apr 4, 2024

Using the OpenAI backend of lm-eval (model="local-completions") this creates a pytest that spins up a vLLM OpenAi server for various models (Llama, Mistral, Phi 2, Mixtral) and runs gsm8k evals against the server to compare with known accuracy values. This should be a good test for making sure accuracies aren't affected for fp16, sparse, and marlin models as we make releases or upstream syncs. For now, we will leave this as a manually triggered workflow.

These are the models and evals set up for this PR:

# Each entry in this dictionary holds a model id as the key and an
# EvalDefinition as a value. The EvalDefinition holds a list of Tasks
# to evaluate the models on, each with their own pre-recorded Metrics
MODEL_TEST_POINTS = [
    # Llama 2 7B: FP16, FP16 sparse, marlin
    ("NousResearch/Llama-2-7b-chat-hf",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.2266868840030326),
                  Metric("exact_match,flexible-extract", 0.22820318423047764)
              ])
     ])),
    ("neuralmagic/Llama-2-7b-pruned50-retrained-ultrachat",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.09855951478392722),
                  Metric("exact_match,flexible-extract", 0.10083396512509477)
              ])
     ],
                    extra_args=["--sparsity", "sparse_w16a16"])),
    ("neuralmagic/llama-2-7b-chat-marlin",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.14101592115238817),
                  Metric("exact_match,flexible-extract", 0.1652767247915087)
              ])
     ],
                    enable_tensor_parallel=False)),
    # Mistral 7B: FP16, FP16 sparse, marlin
    ("teknium/OpenHermes-2.5-Mistral-7B",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.6004548900682335),
                  Metric("exact_match,flexible-extract", 0.6482183472327521)
              ])
     ])),
    ("neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.4935557240333586),
                  Metric("exact_match,flexible-extract", 0.5269143290371494)
              ])
     ],
                    extra_args=["--sparsity", "sparse_w16a16"])),
    ("neuralmagic/OpenHermes-2.5-Mistral-7B-marlin",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.4935557240333586),
                  Metric("exact_match,flexible-extract", 0.5868081880212282)
              ])
     ],
                    enable_tensor_parallel=False)),
    # Phi 2: marlin
    ("neuralmagic/phi-2-super-marlin",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.49962092494313876),
                  Metric("exact_match,flexible-extract", 0.5041698256254739)
              ])
     ],
                    enable_tensor_parallel=False)),
    # Mixtral: FP16
    ("mistralai/Mixtral-8x7B-Instruct-v0.1",
     EvalDefinition(tasks=[
         Task("gsm8k",
              metrics=[
                  Metric("exact_match,strict-match", 0.6550416982562547),
                  Metric("exact_match,flexible-extract", 0.6603487490523123)
              ])
     ],
                    enable_tensor_parallel=True)),
]

@mgoin mgoin merged commit 802bca1 into main Apr 5, 2024
2 checks passed
@mgoin mgoin deleted the lm-eval-gsm8k branch April 5, 2024 18:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants