Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot for preemption #155

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Snapshot for preemption #155

wants to merge 16 commits into from

Conversation

knshnb
Copy link
Member

@knshnb knshnb commented Aug 30, 2021

Current pfrl does not support snapshot of training, which is important in many job systems such as Kubernetes.
This PR support saving and loading snapshot including replay buffer.

Done

  • save & load snapshot
    • agent, replay buffer, step_offset, max_score
  • test locally by python examples/gym/train_dqn_gym.py --env CartPole-v0 --steps=5000 --eval-n-runs=10 --eval-interval=1000 --load_snapshot --checkpoint-freq=1000

Not Done

  • reflect on examples/* (Do we just need to create one separate example for snapshot?)
  • log (scores.txt)
  • test script

Could you check the current implementation strategy and give some ideas on how to implement the above points?

@github-actions github-actions bot requested a review from ummavi August 30, 2021 10:57
@knshnb knshnb requested a review from muupan August 31, 2021 09:41
@muupan muupan removed the request for review from ummavi September 3, 2021 03:53
@muupan
Copy link
Member

muupan commented Sep 3, 2021

Thanks for your PR! It is really good to have better resumability.

General comments on resumability

First, let me summarize what I think need to be done to achieve resumability. Please comment if I miss something. I checked the points supported by this PR.

Things that need to be snapshotted for resumability except randomness:

  • Agent
    • Model (Agent.save)
    • Optimizer (Agent.save)
    • Replay buffer
    • recurrent states (only when recurrent models are used)*
    • any other internal states of Agent (e.g. PPO holds its own dataset)
    • statistics
  • Env
    • Env's internal state*
  • Experiment record
    • max_score
    • steps
    • episodes
    • past evaluation results (to reproduce scores.txt)

* indicates things needed only when you resume a half-way training episode.

RNG-related things that need to be snapshotted for complete resumability:

  • Env's internal RNG states
  • Module-level RNG states of torch, random, and numpy
  • Any other random states used in code (gym.spaces has its own random state, for example)

This is a large list, and it would be a tough task to support all of it. I think it is ok to start supporting only part of it if

  • it is tested and confirmed that the snapshotting functionality works as users expect for some tasks, and
  • its limitations are clarified so that users can know when it won't work as expected.

Specific comments on this PR

  • max_score is saved separately, but I think it is better to load scores.txt to restore all the evaluation information.
  • Snapshotting after every checkpoint_freq steps is not always desirable since it would consume time and storage mostly due to replay buffer. It should be optional.
  • How fast is it to save and load a large replay buffer e.g. with 1 million transitions each of which is 10KB? (this is almost what you would expect when you run DQN for Atari, so by running the experiments suggested below you can see it too).
  • Can you run experiments to demonstrate this PR works in non-trivial settings? My suggestion is:
    • Run python examples/atari/reproduction/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 (which takes <1day with a single GPU, a single CPU, and 14GB CPU RAM) with snapshots saved. Run with five random seeds: --seed 0/1/2/3/4 since variance among runs is high.
    • Resume from 5000000 steps to 10000000 steps for each seed.
    • Compare they roughly match.

@knshnb
Copy link
Member Author

knshnb commented Sep 14, 2021

Thank you for the detailed comments!!
Below is a memo of discussion with @muupan san

What I skip in this PR

  • rng-related things
  • reccurrent states of model, env’s internal state (needed only when you resume a half-way training episode.)
  • other internal states of Agent
  • Agent statistics

What I implement

  • save snapshot instead of save_agent only when take_resumable_snapshot is True
  • save steps and episodes in a file (such as checkpoint.txt)
    • Agent statistics should be included here in the future
  • restore max_score from scores.txt
    • include scores.txt in snapshot for the case eval_interval != checkpoint_freq
  • Add test in pfrl/examples_tests/atari/reproduction/test_dqn.sh

@knshnb
Copy link
Member Author

knshnb commented Sep 14, 2021

I conducted the experiment that you suggested with the following command.
python examples/atari/reproduction/dqn/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 --checkpoint-freq 2000000 --save-snapshot --load-snapshot --seed ${SEED} --exp-id ${SEED}

For each seed, I ran another training resuming from the snapshot of 6000000-step. As shown in the graph below, the score transitions after resuming from the snapshots were roughly the same as the ones without resumption.
image

In this experiment, each snapshot was about 6.8GB and took around 60-100 (s) to save in an NFS server in my environment. You can check how many seconds it took to save each snapshot in snapshot_history.txt.

@knshnb knshnb changed the title [WIP] Snapshot for preemption Snapshot for preemption Sep 14, 2021
@muupan
Copy link
Member

muupan commented Sep 14, 2021

/test

@pfn-ci-bot
Copy link

Successfully created a job for commit dde7ebf:

@knshnb
Copy link
Member Author

knshnb commented Sep 14, 2021

Sorry, I fixed the linter problem

@knshnb
Copy link
Member Author

knshnb commented Sep 15, 2021

(I forgot to write this)
Memo: It requires about twice more CPU memory if you save snapshots (~30GB in the above experiment).

@knshnb
Copy link
Member Author

knshnb commented Feb 9, 2022

Hi! Is there any action required for this PR to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants