-
Notifications
You must be signed in to change notification settings - Fork 667
Fix eval in regression test #1305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1305
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 56ce980 with merge base 00bbd53 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -50,7 +50,7 @@ def test_finetune_and_eval(self, tmpdir, caplog, monkeypatch): | |||
runpy.run_path(TUNE_PATH, run_name="__main__") | |||
eval_cmd = f""" | |||
tune run eleuther_eval \ | |||
--config eleuther_eval \ | |||
--config eleuther_evaluation \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was this test working before??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn’t, it’s been failing in CI for a while now. But it doesn’t run on PRs or anything, only nightly
I missed this, but we're no longer running integration tests in our CI? |
So we actually have two types of integration tests: recipe tests and regression tests. Recipe tests always run on PRs but only use small checkpoints. Regression tests use the full-size model and run nightly. Currently this is the only regression test we have, but we’ve been wanting to add more and just haven’t had time (e.g. it’d be nice if we could test memory or perf of some of our models too). |
I'll raise an issue : ) |
The config name and the results parsing in our regression test job are incorrect.
This is actually a bit awkward to test now that we (a) don't allow creating PRs from a fork, and (b) don't let forks access the S3 bucket containing regression test artifacts.
So for now I've tested it locally, which I guess is better than nothing?