Skip to content

Commit

Permalink
Create README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
EsratMaria authored Jul 22, 2020
1 parent 36a998e commit 01e9ef8
Showing 1 changed file with 68 additions and 0 deletions.
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Monte Carlo Q Learning on Simple Self Defined Environment

## Environment

it is like below:
**(a simple grid based env)**

```
E . . .
. . . .
S . . .
. . . E
```
Where,
* E --> terminal / end point
* S --> start point --> initially at (2, 0)

### How it works

The agent starts from **S** and learns the optimal policy to reach the end point. the reward are set as below:
```
(0, 0) --> 0
(3, 3) --> 0
```
The overall grid with initial defined rewards look like below:
```
---------------------------
0.00|-1.00|-1.00|-1.00|
---------------------------
-1.00|-1.00|-1.00|-1.00|
---------------------------
-1.00|-1.00|-1.00|-1.00|
---------------------------
-1.00|-1.00|-1.00| 0.00|
```
Possible actions that the agent can take are:
* U (up)
* D (down)
* L (left)
* R (right)

### Final Q table after 10000 Episodes
```
Q table:
----------------------------------------------
| State | U | D | L | R |
----------------------------------------------
| (0, 0) | 0.00 | 0.00 | 0.00 | 0.00 |
| (0, 1) | -1.18 | -2.67 | 0.00 | -2.56 |
| (0, 2) | -2.83 | -4.45 | -1.52 | -4.76 |
| (0, 3) | -4.03 | -7.43 | -3.41 | -6.30 |
| (1, 0) | 0.00 | -2.71 | -1.46 | -2.65 |
| (1, 1) | -1.46 | -3.76 | -1.45 | -3.74 |
| (1, 2) | -2.70 | -4.87 | -3.69 | -4.93 |
| (1, 3) | -4.31 | -6.26 | -9.94 | -8.55 |
| (2, 0) | -1.45 | -3.63 | -2.64 | -3.62 |
| (2, 1) | -2.77 | -4.08 | -2.69 | -4.50 |
| (2, 2) | -3.81 | -4.37 | -4.89 | -5.44 |
| (2, 3) | -7.41 | 0.00 | -5.34 | -6.04 |
| (3, 0) | -2.65 | -3.76 | -3.69 | -4.39 |
| (3, 1) | -4.35 | -4.93 | -3.50 | -3.99 |
| (3, 2) | -10.00 | -9.99 | -5.64 | 0.00 |
| (3, 3) | 0.00 | 0.00 | 0.00 | 0.00 |
----------------------------------------------
```
The agent chooses the maximum value for each state to find the optimal policy.
## Reference
* I learned it from the repository described [here](https://github.com/ravasconcelos/monte_carlo) and [here](https://colab.research.google.com/drive/1yJwMgv3XSZ6mLcZ1W7JAxLRZVBjHOXea?usp=sharing)

0 comments on commit 01e9ef8

Please sign in to comment.