High-Order Intelligence & Strategic Dominance in Super Tic-Tac-Toe
A mathematically rigorous implementation of advanced game-playing AI agents
๐ Features โข ๐ Mathematical Foundation โข โก Quick Start โข ๐ฎ Usage โข ๐ง Architecture
Where Clifford Attractors meet Monte Carlo Tree Search
x(n+1) = sin(aยทy(n)) + cยทcos(aยทx(n))
y(n+1) = sin(bยทx(n)) + dยทcos(bยทy(n))
The attractor visualizes the chaotic beauty underlying strategic gameplay
SuPeReInFoRzE is a cutting-edge reinforcement learning framework for mastering Super Tic-Tac-Toe (also known as Ultimate Tic-Tac-Toe or Meta Tic-Tac-Toe). The system combines multiple AI paradigms to create agents capable of human-expert level play through:
- โก Temporal Difference Learning (Q-Learning with Experience Replay)
- ๐ฒ Monte Carlo Tree Search (MCTS with UCB1 exploration)
- ๐ฏ Minimax with Alpha-Beta Pruning (Depth-limited adversarial search)
- ๐งฎ Advanced Heuristic Evaluation (Positional value estimation)
Super Tic-Tac-Toe is a complex variant where:
- The board consists of 9 smaller 3ร3 tic-tac-toe boards arranged in a 3ร3 meta-grid
- Winning a small board claims it on the meta-board
- Constraint: Your move determines which board your opponent must play in next
- Victory requires winning three small boards in a row on the meta-board
State Space Complexity: Approximately 3^81 โ 10^38 possible positions
| Feature | Description | Mathematical Basis |
|---|---|---|
| Q-Learning | Temporal difference value estimation | Q(s,a) โ Q(s,a) + ฮฑ[r + ฮณยทmax Q(s',a') - Q(s,a)] |
| MCTS Engine | Monte Carlo Tree Search with playouts | UCB1 = Qฬ + Cpโ(ln N / n) |
| Minimax Search | Adversarial game tree exploration | V(s) = max{min{V(children)}} |
| Experience Replay | 50k-state memory buffer | Breaks temporal correlation |
| ฮต-Greedy Policy | Exploration-exploitation balance | P(explore) = ฮตยทexp(-ฮปt) |
- Clifford Attractor Opening: Mathematical aesthetics with time-variant parameters
- Real-time Board Rendering: Matplotlib-powered visualization with meta-board overlays
- Training Analytics: Live performance metrics, epsilon decay curves, Q-table growth
- Interactive Gameplay: Human vs AI with move validation and strategic hints
- Ultra-Compact Serialization: Optimized Q-table storage (4-decimal precision)
- ZIP Archive Format: Complete agent state, configuration, and training history
- Cross-Session Compatibility: Load pre-trained agents and continue training
The agent learns through Q-Learning, a model-free temporal difference algorithm:
Q*(s, a) = E[r + ฮณ ยท max Q*(s', a') | s, a]
a'
Where:
Q*(s, a): Optimal action-value functionr: Immediate rewardฮณ โ [0, 1]: Discount factor (default: 0.95)s': Next state after actiona
Q(s, a) โ Q(s, a) + ฮฑ ยท ฮด
ฮด = [r + ฮณ ยท max Q(s', a')] - Q(s, a)
a'
Where:
ฮฑ โ (0, 1]: Learning rate (default: 0.1)ฮด: Temporal difference error
ฮต-Greedy with Exponential Decay:
ฮต(t) = max(ฮต_min, ฮต_0 ยท ฮป^t)
ฯ(a|s) = {
1 - ฮต + ฮต/|A| if a = argmax Q(s, a)
ฮต/|A| otherwise
}
Parameters:
ฮต_0 = 1.0: Initial exploration rateฮป = 0.9995: Decay factorฮต_min = 0.05: Minimum exploration
MCTS performs asymmetric tree expansion guided by the UCB1 selection policy.
Selection: Choose child with highest UCB1 score
UCB1(i) = Qฬ_i + C_p ยท โ(ln N_parent / N_i)
Where:
Qฬ_i: Average reward of nodeiN_i: Visit count of nodeiC_p: Exploration constant (default: 1.414)
Expansion: Add new child node for unexplored action
Simulation: Random playout until terminal state
Backpropagation: Update visit counts and rewards
For each node in path:
N โ N + 1
W โ W + ฮ
Where ฮ = +1 for win, -1 for loss, 0 for draw.
Minimax computes the optimal move assuming perfect opponent play.
V(s) = {
Utility(s) if s is terminal
max V(child) if player = MAX
min V(child) if player = MIN
}
Prunes subtrees that cannot affect the final decision:
ฮฑ: Best value for MAX along path
ฮฒ: Best value for MIN along path
Prune if ฮฒ โค ฮฑ
Complexity Reduction:
- Without pruning:
O(b^d)nodes - With optimal pruning:
O(b^(d/2))nodes
Where b = branching factor, d = depth.
For non-terminal states, we estimate position value:
H(s) = ฮฃ w_i ยท f_i(s)
i
Feature Weights:
| Feature | Weight | Description |
|---|---|---|
| Meta 2-in-row | 500 | Near-winning meta position |
| Meta 1-piece | 100 | Potential meta line |
| Center board control | 200 | Strategic board dominance |
| Corner boards | 100 | Strong positional value |
| Small board 2-in-row | 10 | Local tactical threat |
| Small board 1-piece | 2 | Positional flexibility |
Total Evaluation:
Score(s, player) = H_player(s) - H_opponent(s)
Stores transitions (s, a, r, s') in a circular buffer (max 50k):
D = {(s_t, a_t, r_t, s_{t+1})}
Sample mini-batch ~ D for training
Benefits:
- Breaks temporal correlations
- Improves sample efficiency
- Stabilizes learning
R(s, a, s') = {
+1000 if won game
+10 if won small board
0 if ongoing/draw small board
-50 if lost game (outcome reward)
-100 if invalid move
}
Discounted Return:
G_t = ฮฃ ฮณ^k ยท r_{t+k+1}
k=0
# Clone repository
git clone https://github.com/Devanik21/SuPeReInFoRzE.git
cd SuPeReInFoRzE
# Install dependencies
pip install streamlit numpy matplotlib pandas scipy
# Launch application
streamlit run SuPeReInFoRzE.pystreamlit >= 1.28.0
numpy >= 1.24.0
matplotlib >= 3.7.0
pandas >= 2.0.0
scipy >= 1.11.0
Basic Training:
# Configure in sidebar
Learning Rate (ฮฑ): 0.1
Discount Factor (ฮณ): 0.95
Episodes: 5000
Update Frequency: 100
# Start Training โ Agents learn through self-playAdvanced Configuration:
# MCTS Settings
Simulations: 100
Exploration Constant: 1.414
# Minimax Depth
Agent 1: 2-4 (balanced)
Agent 2: 2-4 (matched)Save Trained Agents:
# Exports .zip containing:
# - agent1.json (Q-table, config, stats)
# - agent2.json (opponent data)
# - config.json (hyperparameters)
# - training_stats.json (performance history)Load Pre-trained Models:
# Upload .zip file
# System automatically deserializes Q-tables
# Training history restored for visualizationAI vs AI Battle:
- Watch trained agents compete
- Visualize strategic decision-making
- Board updates every 0.8 seconds
Human vs AI:
- Select opponent (Agent 1 or 2)
- Choose first move
- System validates moves and enforces rules
- Highlights active board constraints
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SuPeReInFoRzE System โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ UI Layer โโโโโโโโโโโค Streamlit โ โ
โ โ (Aesthetic) โ โ Interface โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Training Controller โ โ
โ โ โข Self-play orchestration โ โ
โ โ โข Experience management โ โ
โ โ โข Statistics tracking โ โ
โ โโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโผโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Agent Core โ โ Environment โ โ
โ โโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโค โ
โ โ โข Q-Learning โโโโโโโบโ Super TTT โ โ
โ โ โข MCTS Engine โ โ Rules Engine โ โ
โ โ โข Minimax โ โ State Managerโ โ
โ โ โข Heuristics โ โโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Persistence Layer โ โ
โ โ โข Q-table serialization โ โ
โ โ โข ZIP compression โ โ
โ โ โข State reconstruction โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Training Loop:
โโโโโโโโโ
โ Reset โ
โโโโโฌโโโโ
โ
โโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Agent 1 selects action โ
โ (ฮต-greedy / MCTS / Minimax) โ
โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Environment executes move โ
โ Updates state, checks win โ
โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Q-value update โ
โ Q(s,a) โ Q(s,a) + ฮฑยทฮด โ
โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Agent 2 turn (if not terminal) โ
โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Outcome rewards distributed โ
โ Statistics updated โ
โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโบ Continue if not done
โ
โโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Epsilon decay โ
โ ฮต โ max(ฮต_min, ฮต ยท ฮป) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The system tracks and visualizes:
Win Rate Evolution:
WR_i(t) = Wins_i(t) / Total_Games(t)
Epsilon Decay Curve:
ฮต(t) = ฮต_0 ยท exp(-ฮปt)
Q-Table Growth:
|Q(t)| = # unique (state, action) pairs
Sample Convergence (5000 episodes):
- Agent 1 Win Rate: ~45%
- Agent 2 Win Rate: ~42%
- Draw Rate: ~13%
- Final ฮต: ~0.05
- Q-table Size: 50k-200k states
Opening decoration uses a strange attractor:
x_{n+1} = sin(aยทy_n) + cยทcos(aยทx_n)
y_{n+1} = sin(bยทx_n) + dยทcos(bยทy_n)
Parameters:
a = -1.4 + 0.1ยทsin(t_minute)
b = 1.6 + 0.1ยทcos(t_minute)
c = 1.0
d = 0.7Properties:
- Chaotic dynamics (sensitive to initial conditions)
- Bounded attractors (compact visualization)
- Time-variant parameters (subtle uniqueness per session)
Hashable State Tuple:
state = (
small_boards_flat, # 9 ร (3ร3) = 81 cells
meta_board, # 9 meta-positions
active_board # Constraint (0-8 or None)
)Compression: Strings for serialization efficiency.
Q-Learning Convergence (Watkins & Dayan, 1992):
Under conditions:
- All state-action pairs visited infinitely often
- Learning rate satisfies Robbins-Monro:
ฮฃฮฑ = โ, ฮฃฮฑยฒ < โ - Rewards are bounded
Then: Q(s,a) โ Q*(s,a) as t โ โ
MCTS Convergence (Kocsis & Szepesvรกri, 2006):
As simulations n โ โ:
- Selected action converges to minimax optimal
- Regret bound:
O(log n / n)
-
Center Board Dominance (+200 heuristic value)
- Controls meta-board center
- Maximum flexibility for forcing moves
-
Corner Board Priority (+100 heuristic value)
- Four paths to victory
- Defensive stability
-
Active Board Control (+50 tactical bonus)
- Forcing opponent into disadvantageous positions
- Limits opponent's strategic options
-
Two-in-a-Row Meta Threats (+500 evaluation)
- Immediate winning potential
- Forces opponent defensive response
- State Space: Cannot fully explore 10^38 positions
- MCTS Depth: Limited by computational budget
- Perfect Play: Not guaranteed (heuristic-based)
- Draw Bias: Agents may prefer forcing draws over risky wins
- Deep Q-Networks (DQN): Neural network value approximation
- AlphaZero-style: Self-play + neural MCTS
- Parallel MCTS: Multi-threaded simulations
- Opening Book: Database of optimal early-game sequences
- Endgame Tablebase: Perfect play in reduced positions
- Multi-Agent Training: Population-based evolution
Objective: Achieve superhuman performance
Approach:
1. Neural state evaluation (CNN/Transformer)
2. Policy network for action priors
3. Self-play curriculum learning
4. Meta-learning for strategy adaptation
Target: Win rate > 90% vs. human experts
- Q-Learning: Watkins, C.J.C.H. (1989). Learning from Delayed Rewards
- MCTS: Browne, C. et al. (2012). A Survey of Monte Carlo Tree Search Methods
- Alpha-Beta Pruning: Knuth, D.E. (1975). An Analysis of Alpha-Beta Pruning
- Deep RL: Mnih, V. et al. (2015). Human-level control through deep reinforcement learning
- Berlekamp, E.R., Conway, J.H., Guy, R.K. (2001). Winning Ways for Your Mathematical Plays
- Silver, D. et al. (2017). Mastering the game of Go without human knowledge (AlphaGo Zero)
Contributions welcome! Areas of interest:
- Neural network integration
- Performance optimization
- Advanced evaluation functions
- Tournament play features
- Strategy analysis tools
Devanik Jain
AI Researcher & Mathematical Game Theory Enthusiast
MIT License - See LICENSE for details
- Streamlit Team: For the exceptional Python app framework
- NumPy Community: Powering high-performance numerical computing
- RL Pioneers: Sutton, Barto, Silver, and the DeepMind team
- Game Theory Giants: Von Neumann, Nash, Zermelo
"In the game of tic-tac-toe, the only winning move is not to play...
unless you're a reinforcement learning agent with 200k Q-states."
SuPeReInFoRzE | 2026 Edition
Where Mathematics Meets Strategy
Training Capacity: 100k+ episodes
State Space: ~10^38 theoretical positions
Q-Table Growth: 50k-200k states (typical)
MCTS Simulations: 0-1000 per move
Minimax Depth: 1-12 ply
Decision Time: <1s per move (optimized)- Install Python 3.8+
- Clone repository
- Install dependencies (
pip install -r requirements.txt) - Launch app (
streamlit run SuPeReInFoRzE.py) - Configure training parameters
- Train agents (start with 5k episodes)
- Save trained models
- Challenge AI to a game!
Happy Learning! May your Q-values converge and your strategies dominate! ๐ฎ๐ง โก