Skip to content

Rework the run loop #921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Jul 29, 2023
Merged

Rework the run loop #921

merged 40 commits into from
Jul 29, 2023

Conversation

HenriDeh
Copy link
Member

@HenriDeh HenriDeh commented Jul 7, 2023

This PR attempts to rework the run loop in order to fix data storage misalignment. This relies on the RLTrajectories 0.2 release that stores the data along with meta information about each step and keeps the traces aligned.

The loop is now as follows

  • PreEpisodeStage: the state is pushed alone in the trajectory.
  • PreActStage: nothing
  • plan! simply query and the action for the current state
  • act!
  • PostActStage: push the action, reward, terminal flag, and the new state to the trajectory.
  • check_stop: now only breaks the episode loop prematurely.
  • loop to PreActStage unless reset condition
  • PostEpisodeStage, checks for missing traces in the trajectory, typically :next_action. Only query for an action if :next_action is a name in the trajectory's keys. Therefore the user must make sure to use the correct trajectory.

You'll notice that agent no longer has a cache. This is because reward, terminal and next_state can all be obtained at the PostActStage and immediately pushed to the trajectory. The action is given as an argument to push.

I also renamed agent/base.jl to agent/agent_base.jl because it was bugging me to have a base.jl tab that was unrelated to RLBase.

PR Checklist

  • Update NEWS.md?
  • MultiAgent
  • Unit tests for all structs / functions?
  • Integration and correctness tests using a simple env?
  • PR Review?
  • Add or update documentation?
  • Write docstrings for new methods?

@HenriDeh HenriDeh linked an issue Jul 7, 2023 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jul 7, 2023

Codecov Report

Merging #921 (aa55642) into main (2752420) will decrease coverage by 0.72%.
The diff coverage is 78.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #921      +/-   ##
==========================================
- Coverage   24.29%   23.57%   -0.72%     
==========================================
  Files         222      222              
  Lines        7747     7699      -48     
==========================================
- Hits         1882     1815      -67     
- Misses       5865     5884      +19     
Files Changed Coverage Δ
...ementLearningCore/src/ReinforcementLearningCore.jl 100.00% <ø> (ø)
...LearningCore/src/policies/agent/agent_srt_cache.jl 0.00% <0.00%> (-100.00%) ⬇️
...forcementLearningCore/test/policies/multi_agent.jl 100.00% <ø> (ø)
...eps/experiments/experiments/DQN/DQN_CartPoleGPU.jl 0.00% <ø> (ø)
...ments/experiments/DQN/JuliaRL_BasicDQN_CartPole.jl 0.00% <ø> (ø)
...xperiments/experiments/DQN/JuliaRL_DQN_CartPole.jl 0.00% <ø> (ø)
...xperiments/experiments/DQN/JuliaRL_IQN_CartPole.jl 0.00% <ø> (ø)
...xperiments/experiments/DQN/JuliaRL_NFQ_CartPole.jl 0.00% <ø> (ø)
...experiments/DQN/JuliaRL_PrioritizedDQN_CartPole.jl 0.00% <ø> (ø)
...eriments/experiments/DQN/JuliaRL_QRDQN_CartPole.jl 0.00% <ø> (ø)
... and 10 more

... and 1 file with indirect coverage changes

@HenriDeh
Copy link
Member Author

HenriDeh commented Jul 7, 2023

@Mytolo have a look. It's WIP but it gives an idea of what this could be. Feel free to make reviews.

Copy link
Contributor

@Mytolo Mytolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the multi_agent_policy, the run loop has also to be adjusted (In sequential environment dispatch) I am also not sure if the way the experience is stored that way it's like i suppose it to be and would like to discuss it here:

Tuples in experience replay:

(s_0, a_0, r_1, s_1, d_1)
(s_1, a_1, r_2, s_2, d_2)
...
(s_t, a_t, r_t+1, s_t+1, d_t+1)

and so on. Is it like that? And for trajectories with next_action, there should also be a_t+1 which is different to that in the next state choosen one? Or is it the same? (Or does it even matter if it is or not?)

push!(agent.cache, reward(env), is_terminated(env))
function Base.push!(agent::Agent, ::PostActStage, env::AbstractEnv, action)
next_state = state(env)
push!(agent.trajectory, (state = next_state, action = action, reward = reward(env), terminal = is_terminated(env)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I got sth wrong, but should next_state not stored in next_state field of the trajectory? next_state is successor of the state before the action was done in the environment, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same. Both names point to the same Trace in the trajectory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It is the multiplex trace, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

function Base.push!(agent::Agent, ::PostExperimentStage, env::E, player::Symbol) where {E<:AbstractEnv}
RLBase.reset!(agent.cache)
function Base.push!(agent::Agent, ::PostEpisodeStage, env::E)
if haskey(agent.trajectory, :next_action)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is if the episode finished (whether truncated or terminated) we query the policy to plan another step. We should also check if the environment is not terminated? If it is, it just makes no sense to plan an action.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't make sense indeed, but if your environment has terminal states at all, then you should not use a trajectory that has a next_action key. That's the how I thought about it. If we add that check, then it allows the user to have an incorrect trajectory without an error being thrown and the buffer will accumulate mistakes.

@HenriDeh
Copy link
Member Author

HenriDeh commented Jul 7, 2023

(s_0, a_0, r_1, s_1, d_1)
(s_1, a_1, r_2, s_2, d_2)
...
(s_t, a_t, r_t+1, s_t+1, d_t+1)

Yes these are the tuples that indexing/sampling the buffer would return. They are not stored as tuples though, each trace is contiguously stored in a dedicated array. state and next_state share that array.

@HenriDeh
Copy link
Member Author

HenriDeh commented Jul 7, 2023

For the multi_agent_policy, the run loop has also to be adjusted (In sequential environment dispatch)

Haven't looked at that yet but I should not forget thank you.

Base.push!(p::AbstractPolicy, ::AbstractStage, ::AbstractEnv, ::Symbol) = nothing
Base.push!(p::AbstractPolicy, ::PostActStage, ::AbstractEnv, action, ::Symbol) = nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is begging for us to create an action type, but that's something for another PR. :)

@jeremiahpslewis
Copy link
Member

@HenriDeh This looks great! I'll run a couple of type stability checks after this gets closer to ✅

@jeremiahpslewis jeremiahpslewis marked this pull request as ready for review July 29, 2023 15:48
@jeremiahpslewis jeremiahpslewis enabled auto-merge (squash) July 29, 2023 20:06
@jeremiahpslewis jeremiahpslewis disabled auto-merge July 29, 2023 21:08
@jeremiahpslewis jeremiahpslewis merged commit cce387b into main Jul 29, 2023
@HenriDeh HenriDeh deleted the loop-traj branch August 7, 2023 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Executing RLBase.plan! after end of experiment
3 participants