-
Notifications
You must be signed in to change notification settings - Fork 504
eval tui #735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
eval tui #735
Changes from all commits
Commits
Show all changes
70 commits
Select commit
Hold shift + click to select a range
746ea9c
simple multi eval scaffolding via toml config
mikasenghaas 8be1b4a
add debug config
mikasenghaas 138d30e
demote to debug log
mikasenghaas b43e337
move around logs
mikasenghaas 64e68f5
fix tests
mikasenghaas 8e9f335
support comma-separated list
mikasenghaas f84ac0e
fix precedence
mikasenghaas 681ebfb
minor
mikasenghaas f7118f6
fix schema validation
mikasenghaas 8a1da80
minor fix
mikasenghaas 1a1f278
update tests
mikasenghaas e49c648
add unit tests
mikasenghaas 9716fc4
revert pbar desc
mikasenghaas 26416b8
update docs
mikasenghaas 27b65fa
typo
mikasenghaas 3979923
fix mutation
mikasenghaas ce63f9b
validation for env ids
mikasenghaas ad47e3f
fix resolution issue
mikasenghaas 6c047c9
move debug config
mikasenghaas 92b7a77
poc vf-eval tui
mikasenghaas 8a2d8ba
exit on input
mikasenghaas 9790f68
streaming works
mikasenghaas d6243d1
full width boxes
mikasenghaas f6f5e27
make env id part of border
mikasenghaas f1eb16b
remove header
mikasenghaas a363847
use static env config
mikasenghaas ab716f6
remove redundant info
mikasenghaas 4be27f2
show running avg of all metrics
mikasenghaas c9fac9d
spacing
mikasenghaas c39d265
ckpt
mikasenghaas 4e488c1
fix
mikasenghaas 10f491b
final summary + stack
mikasenghaas fdf7072
remove global progress
mikasenghaas 02d894c
spacing
mikasenghaas d6ae255
unify progress callback behavior
mikasenghaas 087aab9
show gen/sem concurrency
mikasenghaas 3a86bc1
show sampling args
mikasenghaas 8e59a13
show saved results path
mikasenghaas acb7174
formatting
mikasenghaas bdf5415
remove print_results
mikasenghaas 6b2d641
show -1 concurrency with infinite
mikasenghaas e6dae48
fix
mikasenghaas 51de374
on log callback
mikasenghaas e8122ab
show save every
mikasenghaas d3b8743
fix tests
mikasenghaas 76ee16e
resolve num_examples=-1
mikasenghaas 62a8423
show error
mikasenghaas 7f956c3
cosmetics
mikasenghaas 3e4143d
remove global pbar
mikasenghaas b503341
refactor progress
mikasenghaas 449346b
refactor accums
mikasenghaas d78c6d7
fix progress bar
mikasenghaas 0b9cf1d
minor
mikasenghaas 0ccb5e0
minor
mikasenghaas 352b86b
cleanup
mikasenghaas d910314
fix linter
mikasenghaas c5144da
cleanup
mikasenghaas 41001dc
resolve num examples diff
mikasenghaas 7c35822
fix
mikasenghaas 3a7fe48
mc
willccbb b3e7364
tweaks to rendering to avoid scroll issues; configs
willccbb 305da44
remove old config
willccbb 76c969d
merge bug fixes
willccbb b70a15c
docs; logging tweak
willccbb edcadd2
revert logging change
willccbb b50c827
do not exit if no metrics
mikasenghaas 4525353
show avg reward correctly
mikasenghaas f175aa5
use env_idx to allow eval'ing the same env_id multiple times
mikasenghaas 7c0ee87
guarantee metrics is dict
mikasenghaas e2da252
simplify
mikasenghaas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,16 @@ | ||
| model = "openai/gpt-5-mini" | ||
| model = "openai/gpt-4.1-mini" | ||
| save_results = true | ||
| save_every = 10 | ||
|
|
||
| [[eval]] | ||
| env_id = "primeintellect/wiki-search" | ||
|
|
||
| [[eval]] | ||
| env_id = "gsm8k" | ||
| num_examples = 20 | ||
| rollouts_per_example = 1 | ||
| sampling_args = { max_tokens = 1024 } | ||
| independent_scoring = true | ||
|
|
||
| [[eval]] | ||
| env_id = "primeintellect/math-python" | ||
| env_id = "alphabet-sort" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| [[eval]] | ||
| env_id = "alphabet-sort" | ||
|
|
||
| [[eval]] | ||
| env_id = "alphabet-sort" | ||
| max_concurrent = 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| [[eval]] | ||
| env_id = "math500" | ||
| num_examples = -1 | ||
| rollouts_per_example = 1 | ||
|
|
||
| [[eval]] | ||
| env_id = "aime2024" | ||
| num_examples = -1 | ||
| rollouts_per_example = 8 | ||
|
|
||
| [[eval]] | ||
| env_id = "gpqa" | ||
| num_examples = -1 | ||
| rollouts_per_example = 1 | ||
|
|
||
| [[eval]] | ||
| env_id = "livecodebench" | ||
| num_examples = -1 | ||
| rollouts_per_example = 1 | ||
| max_concurrent = 16 # to limit sandbox usage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.