A reinforcement learning agent that learns to play Nim Misère (reverse Nim) through self-play using the Q-Learning algorithm with reward shaping. Built in TypeScript with no external ML libraries - pure implementation of advanced Q-learning techniques.
Nim Misère is a variant of the classic game of Nim with a twist:
- Setup: Start with 4 piles of stones:
[1, 3, 5, 7] - Gameplay: Players alternate turns, removing any number of stones from a single pile
- Win Condition: The player who takes the LAST stone LOSES (this is the "misère" twist)
This variant makes strategy somewhat more complex than standard Nim, where taking the last stone wins.
This project implements an advanced Q-Learning algorithm with reward shaping for faster convergence:
Q-Learning Formula:
Q(s,a) ← Q(s,a) + α[r + γ·max_a' Q(s',a') - Q(s,a)]
Current Parameters (Optimized for Reward Shaping):
- α (alpha) = 0.05: Learning rate (lower for stability)
- γ (gamma) = 0.99: Discount factor (higher for long-term strategy)
- ε (epsilon) = 1.0 → 0.05: Exploration rate with slower decay
- Epsilon decay = 0.99995: Maintains exploration longer
- Min epsilon = 0.05: Maintains 5% exploration throughout training
Unlike traditional Q-learning that only rewards winning/losing, this implementation uses Nim-sum based reward shaping:
Traditional Approach:
Move 1-5: reward = 0 (no feedback)
Final move: reward = ±1 (win/loss)
Our Approach with Reward Shaping:
Each move: reward = 0.5 if Nim-sum = 0 (good strategic position)
reward = 0 if Nim-sum ≠ 0 (suboptimal)
Final move: reward = ±1 (win/loss)
This provides immediate feedback on every move, teaching the agent the optimal Nim-sum strategy implicitly through reinforcement rather than explicit programming.
nim-q-learning/
├── src/
│ ├── game/
│ │ └── NimGame.ts # Game state, rules, and Nim-sum calculation
│ ├── agent/
│ │ └── QLearningAgent.ts # Q-learning + file I/O
│ ├── training/
│ │ └── trainer.ts # Self-play with reward shaping
│ └── demo/
│ └── play.ts # Human vs Agent gameplay
├── models/
│ ├── agent1.json # Saved trained models
│ └── agent2.json
├── CLAUDE.md # Architecture documentation
├── README.md
├── package.json
└── tsconfig.json
- Node.js (v16 or higher)
- npm or yarn
# Install dependencies
npm install
# Build the project
npm run buildTrain the agent through self-play with reward shaping:
npm run trainTraining Output:
╔════════════════════════════════════════════════════════════╗
║ Nim Misère Q-Learning Training System ║
╚════════════════════════════════════════════════════════════╝
🤖 Initializing agents...
✅ Agents created with parameters:
Learning rate (α): 0.05
Discount factor (γ): 0.99
Initial exploration (ε): 1.0
Epsilon decay: 0.99995
Min epsilon: 0.05
🎯 Reward Shaping: ENABLED (Nim-sum based)
🎮 Starting Training...
📊 Progress Report - Game 500000/5000000
────────────────────────────────────────────────────────────
⏱️ Time: 125.3s | Speed: 3990 games/s
🏆 Agent 1 wins: 251234 (50.2%)
🏆 Agent 2 wins: 248766 (49.8%)
📈 Avg game length: 8.1 moves
🎲 Epsilon: Agent1=0.0821 | Agent2=0.0821
🗂️ Q-table: Agent1=2847 states | Agent2=2891 states
🔄 Updates: Agent1=4052341 | Agent2=4048123
Training Configuration:
- Default: 5,000,000 games (adjustable in
src/training/trainer.ts) - Progress reports: Every 500,000 games
- Training time: ~2-3 minutes for 5M games (on Apple M3)
- Alternating starts: Agents swap starting position each game for balanced learning
- Auto-save: Models saved to
models/agent1.jsonandmodels/agent2.json
Edit src/training/trainer.ts:
const numGames = 5000000; // Try 100000 (quick), 1000000 (standard), or 10000000 (expert)
const reportInterval = 500000; // Adjust based on numGamesPlay against the trained agent:
npm run playDemo Mode Features:
- Load pre-trained models from file
- Quick training mode (30,000 games)
- Interactive console interface with visual pile representation
- Choose who starts each game (you or agent)
- Agent shows its Q-values for transparency
- Automatic winner detection (no manual input needed)
- Score tracking across multiple games
Agent Setup Options:
🤖 Agent Setup Options:
1. Load pre-trained agent from file (default)
2. Train a new agent (quick training)
3. Use untrained agent (random play)
Choose option (1/2/3, default: 1): 1
📂 Attempting to load from: models/agent1.json
✅ Q-table loaded from: models/agent1.json
States: 2847
Q-values: 18234
Epsilon: 0.0500
Example Gameplay:
────────────────────────────────────────────────────────────
Who should start this game? (you/agent, default: you): you
👤 You will start this game.
🎮 Starting new game!
You are Player 0, Agent is Player 1
═══════════════════════════════════════════════════════════
🎮 Nim Misère - Remember: Taking the LAST stone LOSES!
═══════════════════════════════════════════════════════════
Current piles:
Pile 0: [ 1] ●
Pile 1: [ 3] ●●●
Pile 2: [ 5] ●●●●●
Pile 3: [ 7] ●●●●●●●
Total stones remaining: 16
────────────────────────────────────────────────────────────
👤 Your turn!
Choose pile (0-3): 1
How many stones? (1-3): 2
✅ You removed 2 stone(s) from pile 1
🤖 Agent is thinking...
🎯 Agent chose: Remove 1 stone(s) from pile 2
(Q-value: 0.8734)
Models are automatically saved after training to models/ directory:
agent1.json- First agent's Q-tableagent2.json- Second agent's Q-table
The demo automatically loads agent1.json when you choose option 1.
import { QLearningAgent } from './src/agent/QLearningAgent';
// Save a trained agent
const agent = new QLearningAgent();
// ... train the agent ...
agent.saveToFile('models/my-expert-agent.json');
// Load from file
const loadedAgent = QLearningAgent.fromFile('models/my-expert-agent.json');
// Or load into existing agent
agent.loadFromFile('models/my-expert-agent.json');if (nextNimSum === 0) {
reward = 0.5; // Good move! Left opponent in losing position
} else {
reward = 0; // Suboptimal move
}Example:
Before: [1,3,5,7] → Nim-sum = 0
Move: Take 2 from pile 1
After: [1,1,5,7] → Nim-sum = 2
Reward = 0 (you gave opponent advantage)
Better move: Take 1 from pile 2
After: [1,2,5,7] → Nim-sum = 1
Still need to find the move that makes Nim-sum = 0...
Agent learns through these immediate rewards!
const onesCount = piles.filter(p => p === 1).length;
if (onesCount % 2 === 1) { // Odd number of 1s
reward = 0.5; // Good position for Misère
} else {
reward = 0;
}The mathematically optimal strategy uses XOR (Nim-sum):
Nim-sum = pile[0] ⊕ pile[1] ⊕ pile[2] ⊕ pile[3]
Example: [1,3,5,7]
Binary: 0001
0011
0101
⊕ 0111
------
0000 = 0
Nim-sum = 0 → Losing position for current player
Strategy:
- If Nim-sum = 0: You're in a losing position (with perfect opponent)
- If Nim-sum ≠ 0: Make a move that leaves Nim-sum = 0 for opponent
The agent discovers this strategy through reward shaping!
Edit src/training/trainer.ts:
// Training parameters optimized for reward shaping
const alpha = 0.05; // Learning rate: Try 0.01 (slower) or 0.1 (faster)
const gamma = 0.99; // Discount: Try 0.95 (short-term) or 0.999 (long-term)
const epsilon = 1.0; // Initial exploration
const epsilonDecay = 0.99995; // Try 0.9999 (faster) or 0.99999 (slower)
const epsilonMin = 0.05; // Try 0.01 (less) or 0.1 (more exploration)Edit src/game/NimGame.ts:
// Change initial piles
constructor(piles: number[] = [1, 3, 5, 7]) { // Try [2,4,6,8] or [1,2,3]
// ...
}npx ts-node -e "
import { NimGame } from './src/game/NimGame';
const game = new NimGame();
console.log(game.toString());
console.log('Nim-sum:', game.getNimSum());
console.log('Legal moves:', game.getLegalMoves().length);
"npx ts-node -e "
import { QLearningAgent } from './src/agent/QLearningAgent';
import { NimGame } from './src/game/NimGame';
const agent = QLearningAgent.fromFile('models/agent1.json');
const game = new NimGame();
const move = agent.getBestAction(game);
console.log('Agent recommends:', move);
"- Q-Learning Tutorial
- Reward Shaping in RL
- Nim Game Strategy
- Reinforcement Learning: An Introduction by Sutton & Barto
MIT