Skip to content

Devanik21/RL-Super-Tic-Tac-Toe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

31 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฏ SuPeReInFoRzE | Turbo Edition

Version Python AI Status

High-Order Intelligence & Strategic Dominance in Super Tic-Tac-Toe

A mathematically rigorous implementation of advanced game-playing AI agents

๐Ÿš€ Features โ€ข ๐Ÿ“ Mathematical Foundation โ€ข โšก Quick Start โ€ข ๐ŸŽฎ Usage โ€ข ๐Ÿง  Architecture


๐ŸŒŸ Pure Mathematical Resonance

Where Clifford Attractors meet Monte Carlo Tree Search

x(n+1) = sin(aยทy(n)) + cยทcos(aยทx(n))
y(n+1) = sin(bยทx(n)) + dยทcos(bยทy(n))

The attractor visualizes the chaotic beauty underlying strategic gameplay


๐Ÿ“– Overview

SuPeReInFoRzE is a cutting-edge reinforcement learning framework for mastering Super Tic-Tac-Toe (also known as Ultimate Tic-Tac-Toe or Meta Tic-Tac-Toe). The system combines multiple AI paradigms to create agents capable of human-expert level play through:

  • โšก Temporal Difference Learning (Q-Learning with Experience Replay)
  • ๐ŸŒฒ Monte Carlo Tree Search (MCTS with UCB1 exploration)
  • ๐ŸŽฏ Minimax with Alpha-Beta Pruning (Depth-limited adversarial search)
  • ๐Ÿงฎ Advanced Heuristic Evaluation (Positional value estimation)

๐ŸŽฒ What is Super Tic-Tac-Toe?

Super Tic-Tac-Toe is a complex variant where:

  • The board consists of 9 smaller 3ร—3 tic-tac-toe boards arranged in a 3ร—3 meta-grid
  • Winning a small board claims it on the meta-board
  • Constraint: Your move determines which board your opponent must play in next
  • Victory requires winning three small boards in a row on the meta-board

State Space Complexity: Approximately 3^81 โ‰ˆ 10^38 possible positions


๐Ÿš€ Features

๐Ÿง  Multi-Level Intelligence System

Feature Description Mathematical Basis
Q-Learning Temporal difference value estimation Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmax Q(s',a') - Q(s,a)]
MCTS Engine Monte Carlo Tree Search with playouts UCB1 = Qฬ„ + Cpโˆš(ln N / n)
Minimax Search Adversarial game tree exploration V(s) = max{min{V(children)}}
Experience Replay 50k-state memory buffer Breaks temporal correlation
ฮต-Greedy Policy Exploration-exploitation balance P(explore) = ฮตยทexp(-ฮปt)

๐ŸŽจ Visualization & Interface

  • Clifford Attractor Opening: Mathematical aesthetics with time-variant parameters
  • Real-time Board Rendering: Matplotlib-powered visualization with meta-board overlays
  • Training Analytics: Live performance metrics, epsilon decay curves, Q-table growth
  • Interactive Gameplay: Human vs AI with move validation and strategic hints

๐Ÿ’พ Model Persistence

  • Ultra-Compact Serialization: Optimized Q-table storage (4-decimal precision)
  • ZIP Archive Format: Complete agent state, configuration, and training history
  • Cross-Session Compatibility: Load pre-trained agents and continue training

๐Ÿ“ Mathematical Foundation

1๏ธโƒฃ Reinforcement Learning Core

The agent learns through Q-Learning, a model-free temporal difference algorithm:

Bellman Optimality Equation

Q*(s, a) = E[r + ฮณ ยท max Q*(s', a') | s, a]
              a'

Where:

  • Q*(s, a): Optimal action-value function
  • r: Immediate reward
  • ฮณ โˆˆ [0, 1]: Discount factor (default: 0.95)
  • s': Next state after action a

Update Rule

Q(s, a) โ† Q(s, a) + ฮฑ ยท ฮด

ฮด = [r + ฮณ ยท max Q(s', a')] - Q(s, a)
           a'

Where:

  • ฮฑ โˆˆ (0, 1]: Learning rate (default: 0.1)
  • ฮด: Temporal difference error

Exploration Strategy

ฮต-Greedy with Exponential Decay:

ฮต(t) = max(ฮต_min, ฮต_0 ยท ฮป^t)

ฯ€(a|s) = {
  1 - ฮต + ฮต/|A|  if a = argmax Q(s, a)
  ฮต/|A|          otherwise
}

Parameters:

  • ฮต_0 = 1.0: Initial exploration rate
  • ฮป = 0.9995: Decay factor
  • ฮต_min = 0.05: Minimum exploration

2๏ธโƒฃ Monte Carlo Tree Search (MCTS)

MCTS performs asymmetric tree expansion guided by the UCB1 selection policy.

Four Phases

Selection: Choose child with highest UCB1 score

UCB1(i) = Qฬ„_i + C_p ยท โˆš(ln N_parent / N_i)

Where:

  • Qฬ„_i: Average reward of node i
  • N_i: Visit count of node i
  • C_p: Exploration constant (default: 1.414)

Expansion: Add new child node for unexplored action

Simulation: Random playout until terminal state

Backpropagation: Update visit counts and rewards

For each node in path:
  N โ† N + 1
  W โ† W + ฮ”

Where ฮ” = +1 for win, -1 for loss, 0 for draw.


3๏ธโƒฃ Minimax with Alpha-Beta Pruning

Minimax computes the optimal move assuming perfect opponent play.

Minimax Value Function

V(s) = {
  Utility(s)           if s is terminal
  max V(child)         if player = MAX
  min V(child)         if player = MIN
}

Alpha-Beta Optimization

Prunes subtrees that cannot affect the final decision:

ฮฑ: Best value for MAX along path
ฮฒ: Best value for MIN along path

Prune if ฮฒ โ‰ค ฮฑ

Complexity Reduction:

  • Without pruning: O(b^d) nodes
  • With optimal pruning: O(b^(d/2)) nodes

Where b = branching factor, d = depth.


4๏ธโƒฃ Heuristic Evaluation Function

For non-terminal states, we estimate position value:

H(s) = ฮฃ w_i ยท f_i(s)
       i

Feature Weights:

Feature Weight Description
Meta 2-in-row 500 Near-winning meta position
Meta 1-piece 100 Potential meta line
Center board control 200 Strategic board dominance
Corner boards 100 Strong positional value
Small board 2-in-row 10 Local tactical threat
Small board 1-piece 2 Positional flexibility

Total Evaluation:

Score(s, player) = H_player(s) - H_opponent(s)

5๏ธโƒฃ Experience Replay Buffer

Stores transitions (s, a, r, s') in a circular buffer (max 50k):

D = {(s_t, a_t, r_t, s_{t+1})}

Sample mini-batch ~ D for training

Benefits:

  • Breaks temporal correlations
  • Improves sample efficiency
  • Stabilizes learning

6๏ธโƒฃ Reward Structure

R(s, a, s') = {
  +1000    if won game
  +10      if won small board
  0        if ongoing/draw small board
  -50      if lost game (outcome reward)
  -100     if invalid move
}

Discounted Return:

G_t = ฮฃ ฮณ^k ยท r_{t+k+1}
      k=0

โšก Quick Start

Installation

# Clone repository
git clone https://github.com/Devanik21/SuPeReInFoRzE.git
cd SuPeReInFoRzE

# Install dependencies
pip install streamlit numpy matplotlib pandas scipy

# Launch application
streamlit run SuPeReInFoRzE.py

Requirements

streamlit >= 1.28.0
numpy >= 1.24.0
matplotlib >= 3.7.0
pandas >= 2.0.0
scipy >= 1.11.0

๐ŸŽฎ Usage

1๏ธโƒฃ Training Agents

Basic Training:

# Configure in sidebar
Learning Rate (ฮฑ): 0.1
Discount Factor (ฮณ): 0.95
Episodes: 5000
Update Frequency: 100

# Start Training โ†’ Agents learn through self-play

Advanced Configuration:

# MCTS Settings
Simulations: 100
Exploration Constant: 1.414

# Minimax Depth
Agent 1: 2-4 (balanced)
Agent 2: 2-4 (matched)

2๏ธโƒฃ Model Persistence

Save Trained Agents:

# Exports .zip containing:
# - agent1.json (Q-table, config, stats)
# - agent2.json (opponent data)
# - config.json (hyperparameters)
# - training_stats.json (performance history)

Load Pre-trained Models:

# Upload .zip file
# System automatically deserializes Q-tables
# Training history restored for visualization

3๏ธโƒฃ Play Modes

AI vs AI Battle:

  • Watch trained agents compete
  • Visualize strategic decision-making
  • Board updates every 0.8 seconds

Human vs AI:

  • Select opponent (Agent 1 or 2)
  • Choose first move
  • System validates moves and enforces rules
  • Highlights active board constraints

๐Ÿง  Architecture

System Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          SuPeReInFoRzE System                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   UI Layer   โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค   Streamlit     โ”‚  โ”‚
โ”‚  โ”‚  (Aesthetic) โ”‚         โ”‚   Interface     โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚         โ”‚                                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚      Training Controller               โ”‚    โ”‚
โ”‚  โ”‚  โ€ข Self-play orchestration             โ”‚    โ”‚
โ”‚  โ”‚  โ€ข Experience management               โ”‚    โ”‚
โ”‚  โ”‚  โ€ข Statistics tracking                 โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚         โ”‚                                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚   Agent Core   โ”‚      โ”‚  Environment โ”‚      โ”‚
โ”‚  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค      โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค      โ”‚
โ”‚  โ”‚ โ€ข Q-Learning   โ”‚โ—„โ”€โ”€โ”€โ”€โ–บโ”‚ Super TTT    โ”‚      โ”‚
โ”‚  โ”‚ โ€ข MCTS Engine  โ”‚      โ”‚ Rules Engine โ”‚      โ”‚
โ”‚  โ”‚ โ€ข Minimax      โ”‚      โ”‚ State Managerโ”‚      โ”‚
โ”‚  โ”‚ โ€ข Heuristics   โ”‚      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                            โ”‚
โ”‚         โ”‚                                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚      Persistence Layer                 โ”‚    โ”‚
โ”‚  โ”‚  โ€ข Q-table serialization               โ”‚    โ”‚
โ”‚  โ”‚  โ€ข ZIP compression                     โ”‚    โ”‚
โ”‚  โ”‚  โ€ข State reconstruction                โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Component Interaction

Training Loop:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Reset โ”‚
  โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
      โ”‚
  โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Agent 1 selects action         โ”‚
  โ”‚ (ฮต-greedy / MCTS / Minimax)    โ”‚
  โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
  โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Environment executes move       โ”‚
  โ”‚ Updates state, checks win       โ”‚
  โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
  โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Q-value update                  โ”‚
  โ”‚ Q(s,a) โ† Q(s,a) + ฮฑยทฮด          โ”‚
  โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
  โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Agent 2 turn (if not terminal) โ”‚
  โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
  โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Outcome rewards distributed     โ”‚
  โ”‚ Statistics updated              โ”‚
  โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
      โ”œโ”€โ–บ Continue if not done
      โ”‚
  โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Epsilon decay                   โ”‚
  โ”‚ ฮต โ† max(ฮต_min, ฮต ยท ฮป)          โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“Š Performance Metrics

Training Analytics

The system tracks and visualizes:

Win Rate Evolution:

WR_i(t) = Wins_i(t) / Total_Games(t)

Epsilon Decay Curve:

ฮต(t) = ฮต_0 ยท exp(-ฮปt)

Q-Table Growth:

|Q(t)| = # unique (state, action) pairs

Sample Convergence (5000 episodes):

  • Agent 1 Win Rate: ~45%
  • Agent 2 Win Rate: ~42%
  • Draw Rate: ~13%
  • Final ฮต: ~0.05
  • Q-table Size: 50k-200k states

๐Ÿ”ฌ Advanced Features

Clifford Attractor Visualization

Opening decoration uses a strange attractor:

x_{n+1} = sin(aยทy_n) + cยทcos(aยทx_n)
y_{n+1} = sin(bยทx_n) + dยทcos(bยทy_n)

Parameters:
a = -1.4 + 0.1ยทsin(t_minute)
b = 1.6 + 0.1ยทcos(t_minute)
c = 1.0
d = 0.7

Properties:

  • Chaotic dynamics (sensitive to initial conditions)
  • Bounded attractors (compact visualization)
  • Time-variant parameters (subtle uniqueness per session)

State Representation

Hashable State Tuple:

state = (
  small_boards_flat,  # 9 ร— (3ร—3) = 81 cells
  meta_board,         # 9 meta-positions
  active_board        # Constraint (0-8 or None)
)

Compression: Strings for serialization efficiency.


๐ŸŽ“ Theoretical Guarantees

Convergence Properties

Q-Learning Convergence (Watkins & Dayan, 1992):

Under conditions:

  1. All state-action pairs visited infinitely often
  2. Learning rate satisfies Robbins-Monro: ฮฃฮฑ = โˆž, ฮฃฮฑยฒ < โˆž
  3. Rewards are bounded

Then: Q(s,a) โ†’ Q*(s,a) as t โ†’ โˆž

MCTS Convergence (Kocsis & Szepesvรกri, 2006):

As simulations n โ†’ โˆž:

  • Selected action converges to minimax optimal
  • Regret bound: O(log n / n)

๐Ÿ† Strategic Insights

Key Tactical Principles

  1. Center Board Dominance (+200 heuristic value)

    • Controls meta-board center
    • Maximum flexibility for forcing moves
  2. Corner Board Priority (+100 heuristic value)

    • Four paths to victory
    • Defensive stability
  3. Active Board Control (+50 tactical bonus)

    • Forcing opponent into disadvantageous positions
    • Limits opponent's strategic options
  4. Two-in-a-Row Meta Threats (+500 evaluation)

    • Immediate winning potential
    • Forces opponent defensive response

๐Ÿ› Known Limitations

  • State Space: Cannot fully explore 10^38 positions
  • MCTS Depth: Limited by computational budget
  • Perfect Play: Not guaranteed (heuristic-based)
  • Draw Bias: Agents may prefer forcing draws over risky wins

๐Ÿ”ฎ Future Enhancements

Roadmap

  • Deep Q-Networks (DQN): Neural network value approximation
  • AlphaZero-style: Self-play + neural MCTS
  • Parallel MCTS: Multi-threaded simulations
  • Opening Book: Database of optimal early-game sequences
  • Endgame Tablebase: Perfect play in reduced positions
  • Multi-Agent Training: Population-based evolution

Research Directions

Objective: Achieve superhuman performance

Approach:
  1. Neural state evaluation (CNN/Transformer)
  2. Policy network for action priors
  3. Self-play curriculum learning
  4. Meta-learning for strategy adaptation

Target: Win rate > 90% vs. human experts

๐Ÿ“š References

Foundational Papers

  1. Q-Learning: Watkins, C.J.C.H. (1989). Learning from Delayed Rewards
  2. MCTS: Browne, C. et al. (2012). A Survey of Monte Carlo Tree Search Methods
  3. Alpha-Beta Pruning: Knuth, D.E. (1975). An Analysis of Alpha-Beta Pruning
  4. Deep RL: Mnih, V. et al. (2015). Human-level control through deep reinforcement learning

Game Theory

  • Berlekamp, E.R., Conway, J.H., Guy, R.K. (2001). Winning Ways for Your Mathematical Plays
  • Silver, D. et al. (2017). Mastering the game of Go without human knowledge (AlphaGo Zero)

๐Ÿค Contributing

Contributions welcome! Areas of interest:

  • Neural network integration
  • Performance optimization
  • Advanced evaluation functions
  • Tournament play features
  • Strategy analysis tools

๐Ÿ‘จโ€๐Ÿ’ป Author

Devanik Jain
AI Researcher & Mathematical Game Theory Enthusiast

GitHub LinkedIn Twitter


๐Ÿ“œ License

MIT License - See LICENSE for details


๐Ÿ™ Acknowledgments

  • Streamlit Team: For the exceptional Python app framework
  • NumPy Community: Powering high-performance numerical computing
  • RL Pioneers: Sutton, Barto, Silver, and the DeepMind team
  • Game Theory Giants: Von Neumann, Nash, Zermelo

โšก Built with Mathematical Precision

"In the game of tic-tac-toe, the only winning move is not to play...
unless you're a reinforcement learning agent with 200k Q-states."

SuPeReInFoRzE | 2026 Edition

Where Mathematics Meets Strategy


๐Ÿ“ˆ Quick Stats

Training Capacity: 100k+ episodes
State Space: ~10^38 theoretical positions
Q-Table Growth: 50k-200k states (typical)
MCTS Simulations: 0-1000 per move
Minimax Depth: 1-12 ply
Decision Time: <1s per move (optimized)

๐ŸŽฏ Getting Started Checklist

  • Install Python 3.8+
  • Clone repository
  • Install dependencies (pip install -r requirements.txt)
  • Launch app (streamlit run SuPeReInFoRzE.py)
  • Configure training parameters
  • Train agents (start with 5k episodes)
  • Save trained models
  • Challenge AI to a game!

Happy Learning! May your Q-values converge and your strategies dominate! ๐ŸŽฎ๐Ÿง โšก

About

SuPeReInFoRzE is a cutting-edge RL framework for Super TTT. The system combines multiple AI paradigms to create agents capable of human-expert level play through: โšก Temporal Difference Learning (Q-Learning with Experience Replay) ๐ŸŒฒ Monte Carlo Tree Search๐ŸŽฏ Minimax with Alpha-Beta Pruning ๐Ÿงฎ Advanced Heuristic Evaluation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages