Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an eval-only script for existing ckpts #736

Open
wants to merge 58 commits into
base: main
Choose a base branch
from
Open

Conversation

liujch1998
Copy link

@liujch1998 liujch1998 commented Oct 20, 2024

This PR adds scripts/eval.py, which evaluates one or more existing ckpts while bypassing the training steps.

It seems impossible to backfill evals back to the original wandb run, because "step" must always increase. Rewinding the run will truncate the log, which we don't want. Therefore, this script logs things to a new wandb run.

Starting from a training setup:

  • You can keep using the same yaml file.
  • Make a copy of the XXX.sh file into XXX-eval.sh, point to scripts/eval.sh, add a flag --wandb.group=XXX to ensure it logs to the same group, and specify --load_path to be either a single ckpt or all ckpts under a directory.
  • Make a copy of the XXX-launch.sh file into XXX-eval-launch.sh, change --task-name to XXX-eval, and change the command so it runs XXX-eval.sh.

See an example in peteish1-eval.sh and peteish1-eval-launch.sh.

@liujch1998 liujch1998 marked this pull request as ready for review October 24, 2024 18:18
configs/peteish1-weka.yaml Show resolved Hide resolved
configs/peteish1-weka.yaml Show resolved Hide resolved
olmo/train.py Outdated
Comment on lines 1371 to 1372
if wandb.run is not None:
wandb.finish(exit_code=exit_code, quiet=True)
# if wandb.run is not None:
# wandb.finish(exit_code=exit_code, quiet=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix this

scripts/eval.py Outdated
Comment on lines 119 to 120
# train_loader = build_train_dataloader(cfg)
train_loader = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always going to be None? If so, we don't need it.

scripts/eval.py Outdated
if 'step' in cfg.load_path.split('/')[-1]:
load_paths = [cfg.load_path]
else:
# This globbing does not work with remote paths.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that problem handled then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not handled. I will assume the checkpoints are on WEKA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least throw an exception then.

scripts/eval.py Outdated
log.info(f"Number of non-embedding parameters: {olmo_model.num_params(include_embedding=False):,d}")
log.info(f"Peak GPU Memory (MB) before {cfg.distributed_strategy}: {int(peak_gpu_memory() or 0)}")

olmo_model.set_activation_checkpointing(cfg.activation_checkpointing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only ever eval, we don't need this.

Comment on lines +225 to +226
optim = build_optimizer(cfg, dist_model)
scheduler = build_scheduler(cfg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need optimizers and schedulers if we're just evaluating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're creating these only so that you can produce a Trainer object?

How hard is it to pull the stuff you need out of the Trainer object, so we don't have to do so many things we don't need? It makes me particularly uncomfortable that you're creating a trainer with a None data loader, which isn't supposed to work. It just happens to work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Trainer class has too many precious helper functions and it's kinda dumb to unroll them all. I do wanna keep at least a dummy Trainer object. Let me see if I can create it w/o the optim/scheduler/etc stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might find that you don't need most of that stuff when you're doing inference only.

scripts/eval.py Outdated
if 'step' in cfg.load_path.split('/')[-1]:
load_paths = [cfg.load_path]
else:
# This globbing does not work with remote paths.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least throw an exception then.

@dirkgr
Copy link
Member

dirkgr commented Nov 1, 2024

Let me know when this is ready for another review?

if cfg.load_path is None:
raise OLMoConfigurationError("To run eval you must provide a load_path")
elif "://" in cfg.load_path:
raise OLMoConfigurationError("Eval does not support remote paths. Please specify a local path or WEKA mounted path.")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwing exception for remote paths here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants