Skip to content

sevba/nim-q-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nim Misère Q-Learning Agent

A reinforcement learning agent that learns to play Nim Misère (reverse Nim) through self-play using the Q-Learning algorithm with reward shaping. Built in TypeScript with no external ML libraries - pure implementation of advanced Q-learning techniques.

🎮 What is Nim Misère?

Nim Misère is a variant of the classic game of Nim with a twist:

  • Setup: Start with 4 piles of stones: [1, 3, 5, 7]
  • Gameplay: Players alternate turns, removing any number of stones from a single pile
  • Win Condition: The player who takes the LAST stone LOSES (this is the "misère" twist)

This variant makes strategy somewhat more complex than standard Nim, where taking the last stone wins.

🧠 Q-Learning with Reward Shaping

This project implements an advanced Q-Learning algorithm with reward shaping for faster convergence:

Q-Learning Formula:

Q(s,a) ← Q(s,a) + α[r + γ·max_a' Q(s',a') - Q(s,a)]

Current Parameters (Optimized for Reward Shaping):

  • α (alpha) = 0.05: Learning rate (lower for stability)
  • γ (gamma) = 0.99: Discount factor (higher for long-term strategy)
  • ε (epsilon) = 1.0 → 0.05: Exploration rate with slower decay
  • Epsilon decay = 0.99995: Maintains exploration longer
  • Min epsilon = 0.05: Maintains 5% exploration throughout training

🎯 Reward Shaping Innovation

Unlike traditional Q-learning that only rewards winning/losing, this implementation uses Nim-sum based reward shaping:

Traditional Approach:

Move 1-5: reward = 0 (no feedback)
Final move: reward = ±1 (win/loss)

Our Approach with Reward Shaping:

Each move: reward = 0.5 if Nim-sum = 0 (good strategic position)
           reward = 0 if Nim-sum ≠ 0 (suboptimal)
Final move: reward = ±1 (win/loss)

This provides immediate feedback on every move, teaching the agent the optimal Nim-sum strategy implicitly through reinforcement rather than explicit programming.

📁 Project Structure

nim-q-learning/
├── src/
│   ├── game/
│   │   └── NimGame.ts          # Game state, rules, and Nim-sum calculation
│   ├── agent/
│   │   └── QLearningAgent.ts   # Q-learning + file I/O
│   ├── training/
│   │   └── trainer.ts          # Self-play with reward shaping
│   └── demo/
│       └── play.ts             # Human vs Agent gameplay
├── models/
│   ├── agent1.json         # Saved trained models
│   └── agent2.json
├── CLAUDE.md                    # Architecture documentation
├── README.md
├── package.json
└── tsconfig.json

🚀 Getting Started

Prerequisites

  • Node.js (v16 or higher)
  • npm or yarn

Installation

# Install dependencies
npm install

# Build the project
npm run build

📊 Training the Agent

Train the agent through self-play with reward shaping:

npm run train

Training Output:

╔════════════════════════════════════════════════════════════╗
║         Nim Misère Q-Learning Training System            ║
╚════════════════════════════════════════════════════════════╝

🤖 Initializing agents...
✅ Agents created with parameters:
   Learning rate (α): 0.05
   Discount factor (γ): 0.99
   Initial exploration (ε): 1.0
   Epsilon decay: 0.99995
   Min epsilon: 0.05
   🎯 Reward Shaping: ENABLED (Nim-sum based)

🎮 Starting Training...

📊 Progress Report - Game 500000/5000000
────────────────────────────────────────────────────────────
⏱️  Time: 125.3s | Speed: 3990 games/s
🏆 Agent 1 wins: 251234 (50.2%)
🏆 Agent 2 wins: 248766 (49.8%)
📈 Avg game length: 8.1 moves
🎲 Epsilon: Agent1=0.0821 | Agent2=0.0821
🗂️  Q-table: Agent1=2847 states | Agent2=2891 states
🔄 Updates: Agent1=4052341 | Agent2=4048123

Training Configuration:

  • Default: 5,000,000 games (adjustable in src/training/trainer.ts)
  • Progress reports: Every 500,000 games
  • Training time: ~2-3 minutes for 5M games (on Apple M3)
  • Alternating starts: Agents swap starting position each game for balanced learning
  • Auto-save: Models saved to models/agent1.json and models/agent2.json

Adjusting Training Duration

Edit src/training/trainer.ts:

const numGames = 5000000;  // Try 100000 (quick), 1000000 (standard), or 10000000 (expert)
const reportInterval = 500000;  // Adjust based on numGames

🎯 Playing Against the Agent

Play against the trained agent:

npm run play

Demo Mode Features:

  • Load pre-trained models from file
  • Quick training mode (30,000 games)
  • Interactive console interface with visual pile representation
  • Choose who starts each game (you or agent)
  • Agent shows its Q-values for transparency
  • Automatic winner detection (no manual input needed)
  • Score tracking across multiple games

Agent Setup Options:

🤖 Agent Setup Options:
   1. Load pre-trained agent from file (default)
   2. Train a new agent (quick training)
   3. Use untrained agent (random play)

Choose option (1/2/3, default: 1): 1

📂 Attempting to load from: models/agent1.json
✅ Q-table loaded from: models/agent1.json
   States: 2847
   Q-values: 18234
   Epsilon: 0.0500

Example Gameplay:

────────────────────────────────────────────────────────────
Who should start this game? (you/agent, default: you): you
👤 You will start this game.

🎮 Starting new game!
You are Player 0, Agent is Player 1

═══════════════════════════════════════════════════════════
🎮 Nim Misère - Remember: Taking the LAST stone LOSES!
═══════════════════════════════════════════════════════════

Current piles:
  Pile 0: [ 1] ●
  Pile 1: [ 3] ●●●
  Pile 2: [ 5] ●●●●●
  Pile 3: [ 7] ●●●●●●●

Total stones remaining: 16
────────────────────────────────────────────────────────────

👤 Your turn!
Choose pile (0-3): 1
How many stones? (1-3): 2

✅ You removed 2 stone(s) from pile 1

🤖 Agent is thinking...
🎯 Agent chose: Remove 1 stone(s) from pile 2
   (Q-value: 0.8734)

💾 Model Persistence

Saving Models

Models are automatically saved after training to models/ directory:

  • agent1.json - First agent's Q-table
  • agent2.json - Second agent's Q-table

Loading Models

The demo automatically loads agent1.json when you choose option 1.

Manual Save/Load

import { QLearningAgent } from './src/agent/QLearningAgent';

// Save a trained agent
const agent = new QLearningAgent();
// ... train the agent ...
agent.saveToFile('models/my-expert-agent.json');

// Load from file
const loadedAgent = QLearningAgent.fromFile('models/my-expert-agent.json');

// Or load into existing agent
agent.loadFromFile('models/my-expert-agent.json');

🔍 How Reward Shaping Works

Mid-Game Rewards (piles with >1 stones)

if (nextNimSum === 0) {
    reward = 0.5;  // Good move! Left opponent in losing position
} else {
    reward = 0;    // Suboptimal move
}

Example:

Before: [1,3,5,7] → Nim-sum = 0
Move: Take 2 from pile 1
After: [1,1,5,7] → Nim-sum = 2

Reward = 0 (you gave opponent advantage)

Better move: Take 1 from pile 2
After: [1,2,5,7] → Nim-sum = 1
Still need to find the move that makes Nim-sum = 0...

Agent learns through these immediate rewards!

Endgame Rewards (all piles ≤1)

const onesCount = piles.filter(p => p === 1).length;
if (onesCount % 2 === 1) {  // Odd number of 1s
    reward = 0.5;  // Good position for Misère
} else {
    reward = 0;
}

Optimal Strategy (Nim-Sum)

The mathematically optimal strategy uses XOR (Nim-sum):

Nim-sum = pile[0] ⊕ pile[1] ⊕ pile[2] ⊕ pile[3]

Example: [1,3,5,7]
Binary:   0001
          0011
          0101
        ⊕ 0111
        ------
          0000 = 0

Nim-sum = 0 → Losing position for current player

Strategy:

  • If Nim-sum = 0: You're in a losing position (with perfect opponent)
  • If Nim-sum ≠ 0: Make a move that leaves Nim-sum = 0 for opponent

The agent discovers this strategy through reward shaping!

🛠️ Customization

Modify Training Parameters

Edit src/training/trainer.ts:

// Training parameters optimized for reward shaping
const alpha = 0.05;      // Learning rate: Try 0.01 (slower) or 0.1 (faster)
const gamma = 0.99;      // Discount: Try 0.95 (short-term) or 0.999 (long-term)
const epsilon = 1.0;     // Initial exploration
const epsilonDecay = 0.99995;  // Try 0.9999 (faster) or 0.99999 (slower)
const epsilonMin = 0.05;       // Try 0.01 (less) or 0.1 (more exploration)

Modify Game Rules

Edit src/game/NimGame.ts:

// Change initial piles
constructor(piles: number[] = [1, 3, 5, 7]) {  // Try [2,4,6,8] or [1,2,3]
  // ...
}

🧪 Testing the Implementation

Quick Game Test

npx ts-node -e "
import { NimGame } from './src/game/NimGame';
const game = new NimGame();
console.log(game.toString());
console.log('Nim-sum:', game.getNimSum());
console.log('Legal moves:', game.getLegalMoves().length);
"

Agent Action Test

npx ts-node -e "
import { QLearningAgent } from './src/agent/QLearningAgent';
import { NimGame } from './src/game/NimGame';

const agent = QLearningAgent.fromFile('models/agent1.json');
const game = new NimGame();
const move = agent.getBestAction(game);
console.log('Agent recommends:', move);
"

📖 Further Reading

📄 License

MIT

About

Nim Misère Q-Learning Agent

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published