diff --git a/README.md b/README.md index ec14612..464ec26 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ projects with hardware acceleration featuring: imports. - Experiment management, tracking, and sharing with [Hydra](https://hydra.cc/) and [Weights & Biases](https://wandb.ai/site). +- Checkpointing setup for research experiments compatible with Weights & Biases. - Code quality with [pre-commit](https://pre-commit.com) hooks. The template makes collaboration and open-sourcing straightforward, avoiding setup issues and @@ -39,7 +40,7 @@ or [this paper](https://github.com/CLAIRE-Labo/no-representation-no-trust) whose Follow this README to get started with the template. -For a brief discussion of the template's design choices and a Q&A check `template/README.md` file. +For a brief discussion of the template's design choices, features, and a Q&A check `template/README.md` file. ## Getting started with the template diff --git a/reproducibility-scripts/template-sweep.yaml b/reproducibility-scripts/template-sweep.yaml index 09eb18b..959fa3f 100644 --- a/reproducibility-scripts/template-sweep.yaml +++ b/reproducibility-scripts/template-sweep.yaml @@ -11,7 +11,13 @@ parameters: wandb.mode: value: online job_subdir: - value: my-tagged-experiment + value: some-special-experiment + seed: + value: 1 + resuming.resume: + value: True + resuming.use_commit: + value: True some_number: values: [1, 2, 3] diff --git a/src/template_package_name/configs/override/template_experiment.yaml b/src/template_package_name/configs/override/template_experiment.yaml index 3fde7d9..8999c5b 100644 --- a/src/template_package_name/configs/override/template_experiment.yaml +++ b/src/template_package_name/configs/override/template_experiment.yaml @@ -1,4 +1,4 @@ # @package _global_ # The above line should appear in the override configs so that they sit at the root of the config tree. -is_this_overridden: yes +is_this_key_overridden: yes diff --git a/src/template_package_name/configs/setup.yaml b/src/template_package_name/configs/setup.yaml index 35b2a5c..6cb1107 100644 --- a/src/template_package_name/configs/setup.yaml +++ b/src/template_package_name/configs/setup.yaml @@ -27,8 +27,8 @@ job_subdir: dev hydra: run: - # This is where the outputs of an individual run will be stored. - dir: outputs/${outputs_subdir}/${hydra.job.name}/${job_subdir}/${now:%Y-%m-%d_%H-%M-%S-%f} + # Finally, this is where the outputs of an individual run will be stored. + dir: outputs/${outputs_subdir}/${hydra.job.name}/${job_subdir}/${now:%Y-%m-%d--%H-%M-%S-%f} job: chdir: true verbose: false # Set to true for logging at debug level. @@ -42,3 +42,20 @@ wandb: anonymous: allow tags: - development + run_id: null + +run_dir: ${hydra:run.dir} +resuming_dir: null + +resuming: + resume: False + use_commit: False + wandb_cache_bust: 0 # Limitation of wandb. Cannot create runs with the same ID if deleted previously. + # Use this to refresh the id of the run and make it a "new" run. + exclude_keys: # Can be a deep key e.g. model.optimizer.lr + - run_dir + - data_dir # To be able to resume by another user. + - outputs_dir # To be able to resume by another user. + - resuming_dir # To be able to force resume from anywhere. + - wandb # To be able to move a run and resume it. + - resuming.exclude_keys # To be able to add keys on the fly and force resume. diff --git a/src/template_package_name/configs/template_experiment.yaml b/src/template_package_name/configs/template_experiment.yaml index 2b35641..f2823ab 100644 --- a/src/template_package_name/configs/template_experiment.yaml +++ b/src/template_package_name/configs/template_experiment.yaml @@ -13,4 +13,4 @@ defaults: some_arg: "some_default_value" some_number: 10 -is_this_overridden: no +is_this_key_overridden: no diff --git a/src/template_package_name/template_experiment.py b/src/template_package_name/template_experiment.py index 607941f..b17357e 100644 --- a/src/template_package_name/template_experiment.py +++ b/src/template_package_name/template_experiment.py @@ -1,56 +1,133 @@ -# An example file to run an experiment. -# Keep this, it's used as an example to run the code after a user installs the project. +"""An example file to run an experiment. +Keep this, it's used as an example to run the code after a user installs the project. +""" import logging +import os +import subprocess +import sys from pathlib import Path +from time import sleep import hydra import wandb -from omegaconf import DictConfig, OmegaConf +from omegaconf import DictConfig, OmegaConf, omegaconf from template_package_name import utils +# Refers to utils for a description of resolvers +utils.config.register_resolvers() + # Hydra sets up the logger automatically. # https://hydra.cc/docs/tutorials/basic/running_your_app/logging/ logger = logging.getLogger(__name__) -# Resolvers can be used in the config files. -# https://omegaconf.readthedocs.io/en/latest/custom_resolvers.html -# They are useful when you want to make the default values of some config variables -# result from direct computation of other config variables. -# Only put variables meant to be edited by the user (as opposed to read-only variables described below) -# and avoid making them too complicated, the point is not to write code in the config file. - -# Useful to evaluate expressions in the config file. -OmegaConf.register_new_resolver("eval", eval, use_cache=True) -# Generate a random seed and record it in the config of the experiment. -OmegaConf.register_new_resolver( - "generate_random_seed", utils.seeding.generate_random_seed, use_cache=True -) - @hydra.main(version_base=None, config_path="configs", config_name="template_experiment") def main(config: DictConfig) -> None: - # Here you can make some computations with the config to add new keys, correct some values, etc. - # E.g., read-only variables that can be useful when navigating the experiments on wandb (filtering, sorting, etc.). - # Save the new config (as a file to record it) and pass it to wandb to record it with your experiment. + # The current working directory is a new directory unique to this run made by hydra, accessible by config.run_dir. + # A resuming directory uniquely identified by the config (and optionally the git sha) + # for storing checkpoints of the same experiment can be accessed via config.resuming.dir. + logger.info(f"Init directory: {Path.cwd()}") + resuming_dir, resuming_hash = utils.config.setup_resuming_dir(config) + logger.info(f"Run can be resumed from the directory: {resuming_dir}") + if config.resuming.resume: + os.chdir(resuming_dir) + logger.info(f"Resuming from the directory: {Path.cwd()}") + # You can still access the checkpoint directory for analysis etc even if not resuming with config.resuming_dir. + + postprocess_and_save_config(config) + # If wandb.init hangs, it's likely that you're resuming a run that you already deleted on wandb. + # Increment config.resuming.wandb_cache_bust to start a new run. + + # To resume a run in a sweep, find its wandb run id and pass it to the script alongside the same arguments + # the sweep agent started the run with. + + wandb_run_id = config.wandb.run_id + if wandb_run_id is None: + if config.resuming.resume: + wandb_run_id = resuming_hash wandb.init( - config=OmegaConf.to_container(config, resolve=True, throw_on_missing=True), + id=wandb_run_id, + resume="allow" if config.resuming.resume else "never", + config=OmegaConf.to_container(config), project=config.wandb.project, tags=config.wandb.tags, - anonymous=config.wandb.anonymous, mode=config.wandb.mode, + anonymous=config.wandb.anonymous, dir=Path(config.wandb.dir).absolute(), ) + # Use a custom step key when you log so that you can resume logging anywhere. + # For example, if the checkpoint is earlier than the last logged step in the crashed run, you can resume + # from steps already logged, and they will be rewritten (with the same value assuming reproducibility). + # E.g., wandb.log({"my_custom_step": i, "loss": loss}) + + # Re-log to capture log with wandb. + logger.info(f"Running command: {subprocess.list2cmdline(sys.argv)}") + logger.info(f"Init directory: {config.run_dir}") logger.info(f"Working directory: {Path.cwd()}") - logger.info(f"Running with config: \n{OmegaConf.to_yaml(config, resolve=True)}") + logger.info(f"Running with config: \n{OmegaConf.to_yaml(config)}") + if config.resuming.resume: + logger.info(f"Resuming from the directory: {Path.cwd()}") # Update this function whenever you have a library that needs to be seeded. utils.seeding.seed_everything(config) - wandb.log({"some_metric": config.some_number + 1}) + # Example experiment + n = 100 + # Loop from 1 to 27 and write 27 files to the disk. + + # Attempt to resume + # Find the latest checkpoint of format file_{i}.txt + path = Path.cwd() + files = path.glob("file_*.txt") + files = sorted(files, key=lambda x: int(x.stem.split("_")[1])) + if files: + last_file = files[-1] + logger.info(f"Resuming from {last_file}") + j = int(last_file.stem.split("_")[1]) % (config.some_number * n) + else: + j = 0 + + for i in range(j + 1, 28): + wandb.log( + { + "iteration": i, + "file_written": i, + "some_metric": i + config.some_number * n, + } + ) + print(i) + if i % 9 == 0: + with open(f"file_{i}.txt", "w") as f: + f.write(f"some_metric={i + config.some_number * n}") + print(f"Checkpointing at {i}") + + if j == 0 and i % 15 == 0: + # Crash at first run to test resuming. + raise ValueError("Crashing at i % 15 = 0") + pass + sleep(1) + + logger.info("Finished writing files") + + +def postprocess_and_save_config(config): + """Here you can make some computations with the config to add new keys, correct some values, etc. + E.g., read-only variables that can be useful when navigating the experiments on wandb + for filtering, sorting, etc. + Save the new config (as a file to record it) and pass it to wandb to record it with your experiment. + """ + Path("config/").mkdir(exist_ok=True) + # Save if it doesn't exist otherwise (in case of resuming) assert that the config is the same. + utils.config.maybe_save_config(config, "config/config-before-postprocess.yaml") + with omegaconf.open_dict(config): + # Example of adding a new key to the config + config.some_new_key = "bar" + OmegaConf.resolve(config) + utils.config.maybe_save_config(config, "config/config-resolved.yaml") if __name__ == "__main__": diff --git a/src/template_package_name/utils/__init__.py b/src/template_package_name/utils/__init__.py index e865ca5..e4bb34f 100644 --- a/src/template_package_name/utils/__init__.py +++ b/src/template_package_name/utils/__init__.py @@ -1 +1 @@ -from template_package_name.utils import seeding +from template_package_name.utils import config, seeding diff --git a/src/template_package_name/utils/config.py b/src/template_package_name/utils/config.py new file mode 100644 index 0000000..8b40c03 --- /dev/null +++ b/src/template_package_name/utils/config.py @@ -0,0 +1,93 @@ +# Resolvers can be used in the config files. +# https://omegaconf.readthedocs.io/en/latest/custom_resolvers.html +# They are useful when you want to make the default values of some config variables +# result from direct computation of other config variables. +# Only put variables meant to be edited by the user (as opposed to read-only variables described below) +# and avoid making them too complicated, the point is not to write code in the config file. +import logging +import subprocess +from hashlib import blake2b +from pathlib import Path + +from omegaconf import DictConfig, OmegaConf, omegaconf + +from template_package_name import utils + +# Hydra sets up the logger automatically. +# https://hydra.cc/docs/tutorials/basic/running_your_app/logging/ +logger = logging.getLogger(__name__) + + +def register_resolvers(): + if not OmegaConf.has_resolver("eval"): + # Useful to evaluate expressions in the config file. + OmegaConf.register_new_resolver("eval", eval, use_cache=True) + if not OmegaConf.has_resolver("generate_random_seed"): + # Generate a random seed and record it in the config of the experiment. + OmegaConf.register_new_resolver( + "generate_random_seed", utils.seeding.generate_random_seed, use_cache=True + ) + + +def maybe_save_config(config, path): + """Save if it doesn't exist otherwise (in case of resuming) assert that the config is the same.""" + if not Path(path).exists(): + OmegaConf.save(config, path) + else: + new_config = config.copy() + remove_excluded_keys(new_config, config.resuming.exclude_keys) + existing_config = OmegaConf.load(path) + remove_excluded_keys(existing_config, config.resuming.exclude_keys) + try: + OmegaConf.resolve(new_config) + OmegaConf.resolve(existing_config) + assert new_config == existing_config + except AssertionError: + logger.error(f"Config to resume is different from the one saved in {path}") + raise AssertionError + + +def remove_excluded_keys(config: DictConfig, exclude_keys: list[str]): + """Remove keys from the config that are specified in exclude_keys. + exclude_keys are of the form "key1.key2.key3" to remove the key3 from the key1.key2 dictionary. + """ + with omegaconf.open_dict(config): + for key in exclude_keys: + keys = key.split(".") + val = config + for key_ in keys[:-1]: + val = val[key_] + del val[keys[-1]] + + +def setup_resuming_dir(config): + """Create a unique identifier of the experiment used to specify a resuming/checkpoint directory. + The identifier is a hash of the config, excluding keys specified in config.resuming.exclude_keys. + If config.resuming.use_commit is True, the commit hash is appended to the identifier. + I.e. the checkpoint directory is defined by: the config - the excluded config keys + the commit hash (if specified) + """ + if config.resuming_dir is not None: + return Path(config.resuming_dir), Path(config.resuming_dir).name + + resuming_hash = "" + config_to_hash = config.copy() + + # resolve config + OmegaConf.resolve(config_to_hash) + remove_excluded_keys(config_to_hash, config.resuming.exclude_keys) + config_hash = blake2b(str(config_to_hash).encode(), digest_size=8).hexdigest() + resuming_hash += config_hash + if config.resuming.use_commit: + commit_hash = ( + subprocess.check_output(["git", "rev-parse", "HEAD"]) + .strip() + .decode("utf-8") + ) + resuming_hash += f"-{commit_hash[:8]}" + + resuming_dir = Path.cwd().parent / "checkpoints" / resuming_hash + resuming_dir.mkdir(parents=True, exist_ok=True) + with omegaconf.open_dict(config): + config.resuming_dir = str(resuming_dir) + + return resuming_dir, resuming_hash diff --git a/src/template_package_name/utils/seeding.py b/src/template_package_name/utils/seeding.py index 14566c7..28f103c 100644 --- a/src/template_package_name/utils/seeding.py +++ b/src/template_package_name/utils/seeding.py @@ -11,18 +11,24 @@ def seed_everything(config): """Seed all random generators.""" random.seed(config.seed) - # For numpy: + ## For numpy: # This is for legacy numpy: # np.random.seed(config.seed) # New code should make a Generator out of the config.seed directly: # https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html - # For PyTorch: + ## For PyTorch: # torch.manual_seed(config.seed) + # Higher (e.g., on CUDA too) reproducibility with deterministic algorithms: # https://pytorch.org/docs/stable/notes/randomness.html - # torch.backends.cudnn.benchmark = False - # torch.use_deterministic_algorithms(True) - # os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" + # Not supported for all operations though: # https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html + # torch.use_deterministic_algorithms(True) + + # A lighter version of the above otherwise as not all algorithms have a deterministic implementation + # torch.backends.cudnn.deterministic = True + + # torch.backends.cudnn.benchmark = False + # os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" diff --git a/template/README.md b/template/README.md index da8bce7..cfae273 100644 --- a/template/README.md +++ b/template/README.md @@ -11,11 +11,41 @@ This template ensures the reproducibility of your results through 3 artifacts: 2. The project code. - Recorded in the git repository that you keep up to date. - Made reproducible (to a desired degree) by you correctly seeding the random number generators and - optionally removing non-deterministic operations or replicable by running enough seeds. + optionally removing non-deterministic operations, or replicable by running enough seeds. 3. The data, outputs, model weights and other artifacts. - Recorded and uploaded by you. - (Virtually) placed in the placeholder directories abstracting away the user storage system. +## Checkpointing + +The template provides an automatic setup of the checkpointing directory for an experiment. +The unique identifier for the directory is created by hashing the config used and optionally the git commit sha. +Running the same experiment with the same config will thus set its working directory to the same checkpoint directory +every time (if the resuming option is enabled). + +To use this feature pass `resuming.resume=True` and `resuming.use_commit=True` to your script using a Hydra config +that inherits form the `setup.yaml` config file, like the `template_experiment.py` script. + +Even without using `resuming.use_commit=True`, the path to the checkpoint directory will be computed, and you could +for example, read from it. + +You can also force a resuming directory by passing `resuming.resume_dir=` to your script. + +### Compatibility with Weights & Biases + +For a non-sweep run, the run will have the id of the checkpoint directory as its wandb id, therefore your wandb run +will stay the same and resume when your run is resumed. +Make sure to use a custom step key when you log metrics so that you can have full control over when to start rewriting +when you resume (E.g. if you checkpoint less often than you log, you may relog from the last checkpoint), otherwise +the default step key of wandb will resume from the latest step and may be inconsistent with the checkpoint. + +For a sweep run, it already has an id from the sweep, so to resume it you should manually get its id and restart +the script with the same arguments the sweep agent started it, this way the config and the +checkpoint directory will be the same +(i.e. go to the wandb run UI, copy-paste the command it was run with and add `wandb.run_id=`). +This is a limitation of the wandb sweep system. +See [this issue.](https://github.com/wandb/wandb/issues/9143) + ## Template Q&A ### I started my project from an older version of the template, how do I get updates?