My grid world project (Swen 711)
Implementing Grid-world domain that we learned in class. This is done with 3 steps...- Have the agent uniformly randomly select actions. Run 10,000 episodes. Report the mean, standard deviation, maximum, and minimum of the observed discounted returns.
- Implement the value iteration algorithm to find the optimal policy. In this case, the agent will select actions that will provide maximum future discounted rewards. Report the optimal policy.
- Run the optimal policy that you found in [2] 10,000 times. Compare the mean, standard deviation, maximum, and minimum of the observed discounted returns with [1]
-
p = 0.8
- The correct action is attempted
-
p = 0.05
- The agent is confused and moves +90°
-
p = 0.05
- The agent is confused and moves-90°
-
p = 0.1
- The agent is confused and does not move
The agent cannot move out of the world, an attempt to do so will result in no movement
See initial.py for the python code related to part 1. The Environmental Dynamics lead to a level of variablility, however, a sample of the statistical analysis observed is returned below:
Mean: -26.196
Standard Deviation: 50.85970491459816
Maximum: 10
Minimuim: -480
See optimal.py for the python code related to Part 2. Below is the world in which the maximum future discounted rewards are displayed. Note that this was found using the Value Iteration method and iterated until the values changes were less than 0.05, which took 14 iterations. Also note that the xxx.xxx spaces represent the obstacles (states that cannot be entered).
[+003.74] [+004.24] [+004.79] [+005.40] [+005.96]
[+004.06] [+004.67] [+005.37] [+006.13] [+006.79]
[+003.59] [+004.08] [xxx.xxx] [+006.95] [+007.73]
[+003.16] [+003.56] [xxx.xxx] [+007.82] [+008.79]
[+002.72] [+002.43] [-010.00] [+008.79] [+010.00]
See gridworld.py for the python code related to Part 3. The code applies the policy above to part 1. It was ran 10,000 times and the statistics lie below:
Mean: 10.0
Standard Deviation: 0.0
Maximum: 10.0
Minimuim: 10.0