-
-
Notifications
You must be signed in to change notification settings - Fork 109
Rework the run loop #921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework the run loop #921
Conversation
@Mytolo have a look. It's WIP but it gives an idea of what this could be. Feel free to make reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the multi_agent_policy, the run loop has also to be adjusted (In sequential environment dispatch) I am also not sure if the way the experience is stored that way it's like i suppose it to be and would like to discuss it here:
Tuples in experience replay:
(s_0, a_0, r_1, s_1, d_1)
(s_1, a_1, r_2, s_2, d_2)
...
(s_t, a_t, r_t+1, s_t+1, d_t+1)
and so on. Is it like that? And for trajectories with next_action, there should also be a_t+1 which is different to that in the next state choosen one? Or is it the same? (Or does it even matter if it is or not?)
push!(agent.cache, reward(env), is_terminated(env)) | ||
function Base.push!(agent::Agent, ::PostActStage, env::AbstractEnv, action) | ||
next_state = state(env) | ||
push!(agent.trajectory, (state = next_state, action = action, reward = reward(env), terminal = is_terminated(env))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I got sth wrong, but should next_state not stored in next_state field of the trajectory? next_state is successor of the state before the action was done in the environment, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same. Both names point to the same Trace in the trajectory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It is the multiplex trace, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
function Base.push!(agent::Agent, ::PostExperimentStage, env::E, player::Symbol) where {E<:AbstractEnv} | ||
RLBase.reset!(agent.cache) | ||
function Base.push!(agent::Agent, ::PostEpisodeStage, env::E) | ||
if haskey(agent.trajectory, :next_action) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is if the episode finished (whether truncated or terminated) we query the policy to plan another step. We should also check if the environment is not terminated? If it is, it just makes no sense to plan an action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't make sense indeed, but if your environment has terminal states at all, then you should not use a trajectory that has a next_action key. That's the how I thought about it. If we add that check, then it allows the user to have an incorrect trajectory without an error being thrown and the buffer will accumulate mistakes.
Yes these are the tuples that indexing/sampling the buffer would return. They are not stored as tuples though, each trace is contiguously stored in a dedicated array. state and next_state share that array. |
Haven't looked at that yet but I should not forget thank you. |
Base.push!(p::AbstractPolicy, ::AbstractStage, ::AbstractEnv, ::Symbol) = nothing | ||
Base.push!(p::AbstractPolicy, ::PostActStage, ::AbstractEnv, action, ::Symbol) = nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is begging for us to create an action
type, but that's something for another PR. :)
@HenriDeh This looks great! I'll run a couple of type stability checks after this gets closer to ✅ |
This PR attempts to rework the run loop in order to fix data storage misalignment. This relies on the RLTrajectories 0.2 release that stores the data along with meta information about each step and keeps the traces aligned.
The loop is now as follows
You'll notice that agent no longer has a cache. This is because reward, terminal and next_state can all be obtained at the PostActStage and immediately pushed to the trajectory. The action is given as an argument to push.
I also renamed agent/base.jl to agent/agent_base.jl because it was bugging me to have a base.jl tab that was unrelated to RLBase.
PR Checklist