An interactive visualization of Q-Learning, a fundamental reinforcement learning algorithm, implemented in vanilla JavaScript. Watch an AI agent learn to navigate from start to goal while avoiding danger zones!
- Overview
- How It Works
- Q-Learning Algorithm
- Project Structure
- Features
- Usage
- Configuration
- Understanding the Visualization
This project demonstrates Q-Learning, a model-free reinforcement learning algorithm. The agent learns to navigate a grid from the start position (top-left) to the goal position (bottom-right) while avoiding danger zones that you can place interactively.
- Exploration vs Exploitation (ε-greedy policy)
- Temporal Difference Learning
- Q-Value Updates
- Reward Shaping
The environment is a nxn grid where:
| Cell Type | Color | Description |
|---|---|---|
| Start | 🟦 Light Blue | Agent's starting position (user-selected) |
| Goal | 🟩 Light Green | Target destination (user-selected) |
| Danger Zone | 🟥 Red | Obstacles with negative reward |
| Agent | 🟧 Orange | Current position of the learning agent |
The agent can take 4 actions at each step:
- ⬆️ Up - Move one cell up
- ⬇️ Down - Move one cell down
- ⬅️ Left - Move one cell left
- ➡️ Right - Move one cell right
| Event | Reward | Purpose |
|---|---|---|
| Reaching Goal | +1.0 | Encourage goal-seeking behavior |
| Stepping on Danger | -1.0 | Discourage dangerous paths |
| Each Step | -0.01 | Encourage finding shortest path |
The agent updates its Q-values using the Bellman equation:
Q(s, a) ← Q(s, a) + α × [r + γ × max(Q(s', a')) - Q(s, a)]
Where:
- Q(s, a) = Q-value for state
sand actiona - α (alpha) = Learning rate (how much new info overrides old)
- r = Reward received after taking action
- γ (gamma) = Discount factor (importance of future rewards)
- s' = Next state
- max(Q(s', a')) = Maximum Q-value for the next state
The agent balances exploration and exploitation:
if (Math.random() < epsilon) {
// EXPLORE: Choose random action
return randomAction();
} else {
// EXPLOIT: Choose best known action
return actionWithHighestQValue();
}- High ε (epsilon): More exploration (random actions)
- Low ε: More exploitation (best-known actions)
- ε decay: Gradually reduces over time (0.99× per step)
| Parameter | Default | Description |
|---|---|---|
| α (alpha) | 0.5 | Learning rate - Higher = faster learning but less stable |
| γ (gamma) | 0.9 | Discount factor - Higher = more weight on future rewards |
| ε (epsilon) | 0.6 | Initial exploration rate - Decays to 0.05 minimum |
rl-learning/
├── index.html # Main HTML with UI components
├── style.css # Modern responsive styling
├── main.js # Training loop, UI controls, event handlers
├── agent.js # Q-Learning agent implementation
├── environment.js # Grid world environment
└── README.md # This documentation
class Agent {
constructor(actions, { alpha, gamma, epsilon })
getStateKey(state) // Convert {x,y} to "x,y" string
initializeState(state) // Initialize Q-values for new states
chooseAction(state) // Epsilon-greedy action selection
updateQValue(...) // Q-learning update rule
}class Environment {
constructor(gridSize, start, goal)
draw(ctx, cellSize, offsetX, offsetY) // Render the grid
reset() // Reset to start position
step(state, action) // Execute action, return {state, reward, done}
showCurrentState(...) // Draw agent position
}- Canvas setup and rendering
- Training loop with async/await
- UI event handlers (buttons, slider, clicks)
- Pause/Resume/Reset functionality
- Speed control
▶️ Start Training - Begin the learning process- ⏸️ Pause /
▶️ Resume - Pause and resume training - 🔄 Reset - Clear Q-table and restart
- 📍 Placement Mode - Switch between placing Danger Zones, Start, and Goal
▶️ Run Agent - Execute the learned policy (greedy run, no learning)
- Dropdown menu - Choose grid size from 3×3 to 10×10
- Automatically adjusts cell size to fit the canvas
- Goal position updates to bottom-right corner
- Available sizes: 3×3, 4×4, 5×5 (default), 6×6, 7×7, 8×8, 10×10
- Slider - Adjust execution speed from slow (500ms) to max (instant)
- Real-time adjustment during training
- Click on cells to toggle danger zones before training
- Design your own maze/obstacle course
- Start and goal cells are protected
- Use Placement Mode to set the Start and Goal cells directly
- Episode - Current training episode (out of 1000)
- Steps - Steps taken in current episode
- Best Steps - Minimum steps achieved to reach goal
- Epsilon - Current exploration rate (decays over time)
- Open
index.htmlin a web browser - Select grid size from the dropdown (3×3 to 10×10)
- Choose Placement Mode: Danger Zones, Set Start, or Set Goal
- Click cells on the grid to place according to the mode (optional)
- Adjust speed using the slider (optional)
- Click "Start Training" to begin
- After training, click "Run Agent" to watch the agent follow the learned policy (no exploration)
- Early episodes (high ε): Agent explores randomly, often hitting danger zones
- Mid training: Agent starts finding paths but still explores
- Late episodes (low ε): Agent consistently takes optimal/near-optimal paths
- Last 10 episodes: Slower playback to observe final learned behavior
- Create challenging mazes to see how the agent adapts
- Watch the "Best Steps" metric decrease as learning improves
- Pause training to examine the agent's current position
- Reset and try different danger zone configurations
In main.js, adjust the agent initialization:
let agent = new Agent(['up', 'down', 'left', 'right'], {
alpha: 0.5, // Learning rate (0.0 - 1.0)
gamma: 0.9, // Discount factor (0.0 - 1.0)
epsilon: 0.6 // Initial exploration rate (0.0 - 1.0)
});In main.js:
const cellSize = 50; // Pixel size of each cell
const gridSize = 5; // 5x5 grid (change to 7 for 7x7, etc.)In main.js, change the train function call:
train(1000); // Number of episodes-
Random Movement Phase
- High epsilon → Agent tries random directions
- Often gets stuck or hits danger zones
- Many steps per episode
-
Learning Phase
- Agent starts remembering good paths
- Fewer danger zone hits
- Step count begins decreasing
-
Exploitation Phase
- Low epsilon → Agent follows learned policy
- Consistent, efficient paths
- Near-optimal step counts
After training, view the learned Q-values in the browser console:
console.log(agent.qTable);Example output:
{
"0,0": { up: -0.1, down: 0.5, left: -0.1, right: 0.3 },
"1,0": { up: -0.2, down: 0.6, left: 0.2, right: 0.4 },
// ... more states
}Higher Q-values indicate preferred actions for each state.
- Markov Decision Process (MDP) - The mathematical framework
- Value Iteration - Related dynamic programming approach
- Deep Q-Networks (DQN) - Neural network extension of Q-learning
- Add diagonal movement options
- Implement SARSA algorithm comparison
- Add Q-value heatmap visualization
- Save/load trained Q-tables
- Add multiple goal states
- Implement policy visualization (arrows showing best actions)
This project is open source and available for educational purposes.
Made with ❤️ for learning Reinforcement Learning