Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add batch environment #146

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

Sid-Bhatia-0
Copy link
Member

  1. Create a separate module for a new batch environment SingleRoomUndirectedBatch.
  2. Redesign objects and actions. Try to keep things as simple as possible for now.

This is a work in progress. Several things need to be added:

  1. Parallelization using multi-threading / GPU.
  2. Interactive playability of batch environments in the REPL for testing.
  3. Tests.
  4. Update README.

@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2021

Codecov Report

Merging #146 (6d114c6) into master (6d87e18) will decrease coverage by 5.80%.
The diff coverage is 30.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #146      +/-   ##
==========================================
- Coverage   78.21%   72.40%   -5.81%     
==========================================
  Files          21       24       +3     
  Lines        2226     2533     +307     
==========================================
+ Hits         1741     1834      +93     
- Misses        485      699     +214     
Impacted Files Coverage Δ
src/GridWorlds.jl 100.00% <ø> (ø)
src/envs/envs.jl 100.00% <ø> (ø)
src/envs/single_room_undirected.jl 0.00% <0.00%> (ø)
src/play.jl 0.00% <0.00%> (ø)
src/envs/single_room_undirected_batch.jl 57.05% <57.05%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d87e18...6d114c6. Read the comment docs.

@Sid-Bhatia-0
Copy link
Member Author

Added REPL-based playing.
Added tests.
Added script for benchmarking various aspects of batch environments

Here are some benchmarks for this environment with batch size 64 on a single thread without any parallelization. I'm sure the numbers can be improved. These are just preliminary benchmarks.

Right now I am copying the state, reward, and done arrays each time RLBase.state, RLBase.reward, RLBase.is_terminated is called respectively. @findmyway maybe we could have an RLBase.state! method in the RLBase API that pre-allocates memory for the state vector? Otherwise there always exists the question of whether to return a copy of the state or not. If we directly return env.tile_map, there is a risk of unintended consequences. If we generate a new copy each time, that is inefficient.

Date: 2021_06_17_23_41_30

List of Environments

  1. SingleRoomUndirectedBatch

Benchmarks

SingleRoomUndirectedBatch

Run uniformly random policy, NUM_ENVS = 64, NUM_RESETS = 100, STEPS_PER_RESET = 100, TOTAL_STEPS = 10000

memory estimate: 118.94 MiB

allocs estimate: 3256401

minimum time: 245.287 ms (1.38% GC)

median time: 248.607 ms (2.08% GC)

mean time: 250.885 ms (1.88% GC)

maximum time: 277.825 ms (1.28% GC)

samples: 20

evals/sample: 1

GridWorlds.ModuleSingleRoomUndirectedBatch.SingleRoomUndirectedBatch()

memory estimate: 2.23 MiB

allocs estimate: 928

minimum time: 1.481 ms (0.00% GC)

median time: 1.559 ms (0.00% GC)

mean time: 1.689 ms (3.84% GC)

maximum time: 4.262 ms (30.46% GC)

samples: 2952

evals/sample: 1

RLBase.reset!(env)

memory estimate: 7.00 KiB

allocs estimate: 64

minimum time: 6.581 μs (0.00% GC)

median time: 7.057 μs (0.00% GC)

mean time: 8.222 μs (1.48% GC)

maximum time: 197.964 μs (93.81% GC)

samples: 10000

evals/sample: 5

RLBase.state(env)

memory estimate: 1.66 KiB

allocs estimate: 2

minimum time: 196.419 ns (0.00% GC)

median time: 259.854 ns (0.00% GC)

mean time: 291.363 ns (9.01% GC)

maximum time: 2.413 μs (56.48% GC)

samples: 10000

evals/sample: 676

RLBase.action_space(env)

memory estimate: 0 bytes

allocs estimate: 0

minimum time: 2.533 ns (0.00% GC)

median time: 2.797 ns (0.00% GC)

mean time: 2.943 ns (0.00% GC)

maximum time: 35.666 ns (0.00% GC)

samples: 10000

evals/sample: 1000

RLBase.is_terminated(env)

memory estimate: 128 bytes

allocs estimate: 2

minimum time: 61.851 ns (0.00% GC)

median time: 65.807 ns (0.00% GC)

mean time: 76.281 ns (9.40% GC)

maximum time: 3.087 μs (96.85% GC)

samples: 10000

evals/sample: 975

RLBase.reward(env)

memory estimate: 336 bytes

allocs estimate: 1

minimum time: 56.887 ns (0.00% GC)

median time: 62.510 ns (0.00% GC)

mean time: 70.044 ns (5.47% GC)

maximum time: 1.226 μs (92.14% GC)

samples: 10000

evals/sample: 976

env(1)

memory estimate: 10.00 KiB

allocs estimate: 320

minimum time: 20.442 μs (0.00% GC)

median time: 22.449 μs (0.00% GC)

mean time: 25.575 μs (1.55% GC)

maximum time: 2.016 ms (98.68% GC)

samples: 10000

evals/sample: 1

env(2)

memory estimate: 10.00 KiB

allocs estimate: 320

minimum time: 20.467 μs (0.00% GC)

median time: 22.522 μs (0.00% GC)

mean time: 25.235 μs (1.59% GC)

maximum time: 2.075 ms (98.64% GC)

samples: 10000

evals/sample: 1

env(4)

memory estimate: 10.00 KiB

allocs estimate: 320

minimum time: 20.011 μs (0.00% GC)

median time: 21.992 μs (0.00% GC)

mean time: 24.767 μs (1.61% GC)

maximum time: 2.046 ms (98.64% GC)

samples: 10000

evals/sample: 1

env(3)

memory estimate: 10.00 KiB

allocs estimate: 320

minimum time: 20.017 μs (0.00% GC)

median time: 22.013 μs (0.00% GC)

mean time: 24.846 μs (1.60% GC)

maximum time: 2.060 ms (98.60% GC)

samples: 10000

evals/sample: 1

@zsunberg
Copy link
Member

zsunberg commented Jun 19, 2021

If we directly return env.tile_map, there is a risk of unintended consequences.

In my experience, you are right to worry about this. Returning a reference to the internal state of the environment will result in a lot of frustrating unexpected behavior. On the other hand, I don't have experience with "batch" environments, so I may be missing something important about how this will be used.

If we generate a new copy each time, that is inefficient.

I am not sure that it is that inefficient. Copying memory is not too bad; the only thing that is really expensive is allocating memory on the heap (which happens when you create a new variable-sized array).

If the compiler knows the size of the grid world and you use static arrays, I think it should be pretty efficient to return a copy.

@Sid-Bhatia-0
Copy link
Member Author

Thank you for your comments @zsunberg

On the other hand, I don't have experience with "batch" environments, so I may be missing something important about how this will be used.

From my understanding, this part shouldn't be too different for single vs. batch environments (I am imagining a simple case where all the state vectors are batched together and fed to a neural network as an input vs. a single state vector fed to the neural network as input).

Returning a reference to the internal state of the environment will result in a lot of frustrating unexpected behavior.

In order to avoid this safely, we anyways need at least one memory copy operation. This should hold true even for batch environments.

I am not sure that it is that inefficient. Copying memory is not too bad; the only thing that is really expensive is allocating memory on the heap (which happens when you create a new variable-sized array).

Yes. So right now in the current implementation of all the single environments, fetching the state is along the following lines:

RLBase.state(env) = copy(env.tile_map)

As you pointed out, it is relatively expensive to (unnecessarily) allocate new memory on the heap for each step of the game loop.
We could do better by letting the algorithm allocate the desired memory for the state vector only once, and then do something like the following at each step:

RLBase.state!(state_cache, env) = copy!(state_cache, env.tile_map)

This requires only one memory allocation during initialization, and then one memory copy operation at each time step (which we inevitably need to avoid unintended mutations of internal state of environment).

If the compiler knows the size of the grid world and you use static arrays, I think it should be pretty efficient to return a copy.

Right. I think it might still be slightly better to pre-allocate memory for a mutable static array instead of creating a new static array at each step. I'm not totally sure about this one though. I think both of them might be pretty close for static arrays. More importantly, the number of elements in the state vector could go well above 100 elements even for small grid worlds, so using static arrays might not be a scalable solution, especially for larger sized grid worlds. Maybe the maximum useful size could be higher for static BitArrays. I don't know if BitArrays are supported in static arrays though.

Another benefit of having something like RLBase.state! is consistency. As opposed to RLBase.state, RLBase.state! should hopefully clearly communicate that the array has been preallocated, and one need not worry about whether to return a copy of the state vector or not.

@findmyway Please let us know your thoughts as well.

@zsunberg
Copy link
Member

zsunberg commented Jun 19, 2021

I don't know if BitArrays are supported in static arrays though.

Unfortunately no :( JuliaArrays/StaticArrays.jl#412

I did some benchmarks:

World = falses(20, 20)

display(@benchmark SMatrix{20,20}($World)) # mean: 103 ns; max:  248 ns (0 alloc)
display(@benchmark MMatrix{20,20}($World)) # mean: 239 ns; max: 3384 ns (1 alloc)
display(@benchmark copy($World))           # mean:  69 ns; max: 2297 ns (2 alloc)

It looks like the savings of keeping it in terms of bits rather than Bools is quite significant - in these tests it actually seems to outweigh the cost of garbage collection/allocation on average.

It seems like state! would indeed be most efficient. However, I don't really like the idea of state! being part of the interface because then environment implementers will have to implement it. (By the way, this is why I thought optional interface functions and provided would be useful in CommonRLInterface. Algorithm writers can use state! if it is provided, but then also fall back to a slower state if it is not, and the AutomaticDefault submodule could make this automatic).)

@findmyway
Copy link
Member

Thanks for @zsunberg 's comments.

In summary, we are mainly discussing the following points here:

  1. Whether to copy the state?
  2. If so, how should the state be represented?

Re # 1, for GridWorlds.jl this package alone, I prefer to expose the inner state directly. We don't know what the package users will do with the environment states. Making decisions for users is not a good practice in my opinion. For example, one may simply transfer the state from CPU to GPU and do some heavy operations there, or they'd like to resize the image. Always copying the state is a waste in these cases. However, the main problem here is that this package is now tightly bound with RLBase.jl (which I've warned @Sid-Bhatia-0 not to do so at the early state of this package... ). RLBase.state doesn't specify whether the state is mutable or not. So one needs to add a wrapper to copy the state sometimes. But some other interfaces may have a different assumption from RLBase (e.g. CommonRLInterface.jl or the interfaces used in AlphaZero.jl).

Re # 2, based on @zsunberg 's benchmark result. It seems directly copying seems to be the fastest.

Talking about how to implement the "batch" environment, this is still experimental so that we can leverage as much resource as possible to run multiple replicas at the same time.

@Sid-Bhatia-0 could you make a summary of the benchmark result you listed above? I'm afraid it's too verbose, making others hard to understand what you want to express here.

src/play.jl Show resolved Hide resolved
@Sid-Bhatia-0
Copy link
Member Author

Unfortunately no :( JuliaArrays/StaticArrays.jl#412

I see.

However, I don't really like the idea of state! being part of the interface because then environment implementers will have to implement it. (By the way, this is why I thought optional interface functions and provided would be useful in CommonRLInterface. Algorithm writers can use state! if it is provided, but then also fall back to a slower state if it is not, and the AutomaticDefault submodule could make this automatic).)

Right. This is an interface concern. I don't have an informed opinion on this yet. I haven't looked into CommonRLInterface much, but maybe the solution could be along the lines of "optional interfaces", as you mentioned.

We don't know what the package users will do with the environment states. Making decisions for users is not a good practice in my opinion. For example, one may simply transfer the state from CPU to GPU and do some heavy operations there, or they'd like to resize the image. Always copying the state is a waste in these cases.

Yes. You are right in that not all users may want to go with the defaults and some users may want to do custom things to the get the environment state, and may need access to the internal state directly. But making it the default behavior might cause more problems than it solves (for most users). Moreover isn't that what method overloading is for? If someone wants to retrieve any internals and do anything fancy with it, they could always access to the env struct and can overload the default RLBase.state method like this:

RLBase.state(env) = env.tile_map

However, the main problem here is that this package is now tightly bound with RLBase.jl (which I've warned @Sid-Bhatia-0 not to do so at the early state of this package... )

Correct me if I am missing something, but here is what I understand... From usability standpoint, for GridWorlds.jl, using the RLBase interface seems like the obvious choice to me because of the ability to leverage the existing infrastructure (RLCore, RLZoo) around it. Sure, I could have created my own separate interface for this package, but I don't see a point in creating yet another general/flexible enough RL environment interface from scratch, especially since it would not provide any algorithms/examples of its own to go along with the environments.

But some other interfaces may have a different assumption from RLBase (e.g. CommonRLInterface.jl or the interfaces used in AlphaZero.jl).

Sure. And if one wants to use some other interface, they will anyways have to write the conversion. Compare the following:

MyCustomInterface.get_state(env) = RLBase.state(env)

and

MyCustomInterface.get_state(env) = HypotheticalGridWorldInterface.state(env)

Why would the first one be worse than the second one?

RLBase.state doesn't specify whether the state is mutable or not.

Wouldn't this cause confusion if left unspecified?

@Sid-Bhatia-0 could you make a summary of the benchmark result you listed above? I'm afraid it's too verbose, making others hard to understand what you want to express here.

The benchmarks I mentioned are just BenchmarkTools.@benchmark for all the common RL operations of an environment like env(action), RLBase.state(env), RLBase.reset!(env), etc... along with instantiation of an environment and running an overall benchmark for running a random policy for 10000 steps. I can use @btime macro instead of @benchmark to reduce verbosity, but other than that, compressing it further will result in loss of relevant information.

@findmyway
Copy link
Member

Moreover isn't that what method overloading is for? If someone wants to retrieve any internals and do anything fancy with it, they could always access to the env struct and can overload the default RLBase.state method like this:

RLBase.state(env) = env.tile_map

Well, it's doable though...

Why would the first one be worse than the second one?

This is just a taste of software design here. Personally I prefer

  A
 / \
B   C

instead of

A
|
B
|
C

RLBase.state doesn't specify whether the state is mutable or not.

Wouldn't this cause confusion if left unspecified?

Yes, I think so. Let me add a warning in the docstring first.

I can use @Btime macro instead of @benchmark to reduce verbosity, but other than that, compressing it further will result in loss of relevant information.

Yes, please do. And it would be great to make it into a table so that we can easily compare different implementations later.

findmyway added a commit to JuliaReinforcementLearning/ReinforcementLearning.jl that referenced this pull request Jun 21, 2021
findmyway added a commit to JuliaReinforcementLearning/ReinforcementLearning.jl that referenced this pull request Jun 21, 2021
@Sid-Bhatia-0
Copy link
Member Author

This is just a taste of software design here. Personally I prefer

  A
 / \
B   C

instead of

A
|
B
|
C

I see. Thank you for the clarification. I'll think more on it later. Here's an issue to track it #148

And it would be great to make it into a table so that we can easily compare different implementations later.

Right. A table would be very nice to have. I'll see if I can fit them in a page. I'll do this is in a separate PR. Here's an issue to track it #147

@Sid-Bhatia-0
Copy link
Member Author

Sid-Bhatia-0 commented Jun 24, 2021

Ok. So here are some benchmarks comparing optimized single-threaded CPU implementations of batch vs. non-batch for the SingleRoomUndirected environment.

For the random_policy column, NUM_RESETS = 100, STEPS_PER_EPISODE = 100.

SingleRoomUndirected (non-batch)

random_policy reset state MOVE_UP MOVE_DOWN MOVE_LEFT MOVE_RIGHT action_space is_terminated reward
0 bytes
316.639 μs
0 bytes
64.814 ns
0 bytes
1.875 ns
0 bytes
12.734 ns
0 bytes
14.151 ns
0 bytes
13.275 ns
0 bytes
14.229 ns
0 bytes
2.185 ns
0 bytes
1.842 ns
0 bytes
1.828 ns

SingleRoomUndirectedBatch (NUM_ENVS = 1)

random_policy reset state MOVE_UP MOVE_DOWN MOVE_LEFT MOVE_RIGHT action_space is_terminated reward
96 bytes
629.735 μs
0 bytes
86.335 ns
0 bytes
2.914 ns
0 bytes
43.705 ns
0 bytes
43.630 ns
0 bytes
44.786 ns
0 bytes
44.370 ns
0 bytes
2.426 ns
0 bytes
2.432 ns
0 bytes
2.799 ns

SingleRoomUndirectedBatch (NUM_ENVS = 1024)

random_policy reset state MOVE_UP MOVE_DOWN MOVE_LEFT MOVE_RIGHT action_space is_terminated reward
8.12 KiB
383.722 ms
0 bytes
111.034 μs
0 bytes
1.879 ns
0 bytes
15.101 μs
0 bytes
14.594 μs
0 bytes
14.129 μs
0 bytes
14.021 μs
0 bytes
1.854 ns
0 bytes
2.200 ns
0 bytes
1.887 ns

The Batch version with NUM_ENVS = 1 is worse than the non-batch version, which was expected.

The scaling from non-batch to batch is roughly linear (slightly worse) in NUM_ENVS for large NUM_ENVS, which is reasonable. Both batch and non-batch versions are fairly optimized and so we don't observe any gains on a single CPU thread. Right now we are just benchmarking the RLBase operations on these environments without any other overheads, but if we actually benchmark end-to-end using an actual algorithm, we should ideally see some benefits of batch version since it uses struct of arrays, even with a single CPU thread.

Note: The allocation during random_policy for batch version is single time allocation for storing the batch of actions generated at each step.

@findmyway
Copy link
Member

Hi @Sid-Bhatia-0 ,

After thinking for a while, I'm wondering maybe it is possible to have one unified codebase to implement the batch environments.

First, let's assume the game state and game spec are separated. Basically, a game state may look like this:

struct SomeSpecificaGameState
    board::AbstractArray{Bool}
    agent_position::CartesianIndex
    reward::Float32
end

Then, ideally we can use StructArrays.jl to create a vector of game states. As long as our game logic is implemented in pure Julia code, the same code should work well on both CPU and GPU.

The only thing I haven't figured out is how to update some scalar values under the StructArray context (the agent_position and reward fields in the above example). Because now each element is an immutable struct.

Based on the example in StructArrays.jl, the whole element is updated with the setindex method. Maybe we can modify the syntax slightly from dest[i].reward = src[i].reward to dest[:reward, i] = src[:reward, i] to overcome the immutable problem. You may give it a try, or propose some other approaches.

@Sid-Bhatia-0
Copy link
Member Author

Hmm.. It'll take me a while before I can think more deeply about this and share an informed opinion.

As discussed in our last call, let us put the batch environments on hold for now. I will be focusing on getting all the existing single versions working perfectly (resolving all non-batch issues) before JuliaCon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants