Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

[Rel Eng] Dial In LM Eval Tests Phase 1 #289

Merged
merged 26 commits into from
Jun 21, 2024

Conversation

robertgshaw2-neuralmagic
Copy link
Collaborator

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic commented Jun 8, 2024

WAIT UNTIL UPSTREAM SYNC LANDS TO MERGE

SUMMARY:

  • refactored lm-eval workflows to use a single script for generating a baseline
  • refactored lm-eval workflows to accept a config file so we can parameterize for the different length runs
  • added configuration for remote-push -> running llama-3-8b on 250 GSM prompts
  • removed lm-eval-smoke such that we have one single pathway for running lm-eval tests

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title Rel eng/dial in accuracy tests Rel eng/dial in accuracy tests part 1 Jun 8, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title Rel eng/dial in accuracy tests part 1 [ REL ENG] Dial In Accuracy Tests Phase 1 Jun 9, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title [ REL ENG] Dial In Accuracy Tests Phase 1 [ Rel Eng ] Dial In Accuracy Tests Phase 1 Jun 9, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title [ Rel Eng ] Dial In Accuracy Tests Phase 1 [Rel Eng] Dial In Accuracy Tests Phase 1 Jun 9, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title [Rel Eng] Dial In Accuracy Tests Phase 1 [Rel Eng] Dial In LM Eval Tests Phase 1 Jun 10, 2024
Copy link

@derekk-nm derekk-nm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it should work.
Just raised a potential concern about the type of arg values passed to lm-eval, and a nit about the units on an arg.

Copy link

@dbarbuzzi dbarbuzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about using a new input (that seems otherwise unused) prior to approval.

.github/workflows/nm-build-test.yml Outdated Show resolved Hide resolved
.github/scripts/nm-run-lm-eval-gsm-hf-baseline Outdated Show resolved Hide resolved
.github/scripts/nm-run-lm-eval-gsm-hf-baseline Outdated Show resolved Hide resolved
.github/scripts/nm-run-lm-eval-vllm Outdated Show resolved Hide resolved
.github/scripts/nm-run-lm-eval-gsm-hf-baseline Outdated Show resolved Hide resolved
Copy link
Member

@dhuangnm dhuangnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic merged commit 7c46a95 into main Jun 21, 2024
37 checks passed
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic deleted the rel-eng/dial-in-accuracy-tests branch June 21, 2024 16:40
derekk-nm pushed a commit that referenced this pull request Jun 24, 2024
WAIT UNTIL UPSTREAM SYNC LANDS TO MERGE

SUMMARY:
* refactored lm-eval workflows to use a single script for generating a
baseline
* refactored lm-eval workflows to accept a config file so we can
parameterize for the different length runs
* added configuration for `remote-push` -> running `llama-3-8b` on 250
GSM prompts
* removed lm-eval-smoke such that we have one single pathway for running
lm-eval tests
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants