🎯 POMDP: Two Worlds, One Framework

Bridging the Gap Between Control Theory and Reinforcement Learning

Why do Control Theorists and RL Researchers sometimes talk past each other about POMDPs?

📖 Table of Contents

Quick Start
Motivation
The Core Difference
White-Box vs Black-Box & Model-Based vs Model-Free
Traditional / White-Box View
Reinforcement Learning View
Comparison Table
Implications for Cross-Domain Review
Roadmap
References
Contributing

🚀 Quick Start

We provide two code examples demonstrating both approaches. Try them out!

Example 1: Deep RL Approach (Memory Corridor + DRQN)

# Navigate to the deep RL example
cd examples/deep_rl

# Install dependencies
pip install -r requirements.txt

# Run training
python train.py

What you'll see:

A simple corridor environment where the agent must remember an initial hint
DRQN (LSTM-based) agent learning to handle partial observability
No explicit belief state computation - memory is learned

Example 2: Control Theory Approach (Tiger Problem + I-DID)

# Navigate to the control theory example
cd examples/control_theory

# Install dependencies
pip install -r requirements.txt

# Run solver
python solve.py

What you'll see:

Classic Tiger problem with known observation model (85% accuracy)
Bayesian belief state updates using explicit probability models
Influence Diagram / I-DID framework for decision making

📊 Key Difference in Action

Deep RL Example	Control Theory Example
`hidden_state = lstm(obs, hidden_state)`	`belief = bayes_update(belief, obs, P(o\|s))`
Memory is learned	Belief is computed
P(o\|s) is unknown	P(o\|s) = 0.85 is known

🎯 Motivation

📖 The Story Behind This Project

I recently submitted a paper that used POMDPs in the Reinforcement Learning sense—treating partial observability as a problem of incomplete information and using memory-based methods to handle it.

The reviewer, clearly from a Control Theory background, rejected it with comments like:

"This is not a proper POMDP formulation. Where is your observation model $P(o|s)$? How do you compute the belief state?"

From my RL perspective, these comments felt unfair. In our community, it's completely standard to handle partial observability with LSTMs or frame stacking—without explicitly defining observation probabilities.

But after some reflection, I realized: neither side is wrong. We're just using the same term "POMDP" with fundamentally different assumptions.

🎯 Why This Matters

The mathematical definition of a POMDP is the same in both fields. But the conceptual understanding is significantly different.

This mismatch causes real problems:

📝 Unfair rejections — RL papers rejected for "not being real POMDPs"
🤔 Mutual confusion — Control Theory papers seem to assume "cheating" knowledge
💬 Wasted debates — Arguments about whether explicit belief states are required

This project aims to bridge this gap.

We hope researchers from different backgrounds can:

Understand each other's assumptions and conventions
Communicate more effectively across domains
Review cross-domain work more fairly and charitably

🔑 The Core Difference

🔬 Control Theory

"The sensor is lying"

The agent has a known model but receives noisy/incorrect observations.

🤖 Reinforcement Learning

"The sensor is blind"

The agent has an unknown model but receives correct but incomplete observations.

📦 White-Box vs Black-Box & Model-Based vs Model-Free

Understanding these two pairs of concepts is crucial for cross-domain communication:

🎯 The Two Dimensions

                        Model Knowledge
                    ┌─────────────────────────────────────┐
                    │                                     │
                    │    KNOWN              UNKNOWN       │
                    │    (Model-Based)      (Model-Free)  │
              ┌─────┼─────────────────────────────────────┤
   System     │     │                                     │
   Trans-     │WHITE│  ✅ Classical POMDP    │ Hybrid     │
   parency    │ BOX │  (Control Theory)     │ Approach   │
              │     │  - Know T(s'|s,a)     │            │
              │     │  - Know O(o|s,a)      │            │
              ├─────┼─────────────────────────────────────┤
              │     │                                     │
              │BLACK│  World Models         │ ✅ Deep RL  │
              │ BOX │  (Learned Model)      │ (Pure RL)  │
              │     │                       │ - Learn π  │
              │     │                       │ - No model │
              └─────┴─────────────────────────────────────┘

📋 Concept Definitions

Concept	Definition	Example
White-Box	System internals are known and transparent. You can "open the box" and see how it works.	Sensor spec sheet says "5% error rate"
Black-Box	System internals are unknown/opaque. You can only observe inputs and outputs.	You don't know why the camera sometimes fails
Model-Based	Agent has access to (or learns) an explicit model: $T(s'\|s,a)$ and/or $O(o\|s,a)$	POMDP solver with known transition matrix
Model-Free	Agent learns a policy $\pi(a\|o)$ directly without explicitly modeling dynamics	DQN, PPO with frame stacking

🔗 How They Relate

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│   White-Box + Model-Based  ═══════════  Classical POMDP Approach       │
│   ─────────────────────────            (Control Theory)                │
│   • Transition model T(s'|s,a) is GIVEN                                │
│   • Observation model O(o|s,a) is GIVEN                                │
│   • Belief state is computed EXACTLY via Bayes rule                    │
│   • Planning happens in belief space                                   │
│                                                                        │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│   Black-Box + Model-Free   ═══════════  Deep RL Approach               │
│   ─────────────────────────            (Reinforcement Learning)        │
│   • No model is given or learned explicitly                            │
│   • Agent interacts with environment as a BLACK BOX                    │
│   • Memory (RNN/LSTM) implicitly captures temporal dependencies        │
│   • Policy is learned end-to-end from experience                       │
│                                                                        │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│   Black-Box + Model-Based  ═══════════  World Models / Dreamer         │
│   ─────────────────────────            (Modern Hybrid Approach)        │
│   • Model is LEARNED from data (not given)                             │
│   • Agent treats real environment as black-box                         │
│   • But builds an internal "world model" for planning                  │
│   • Examples: Dreamer, MuZero, World Models                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

⚡ Key Insight: The Source of Confusion

Control Theory Assumption	RL Assumption
White-Box is the default	Black-Box is the default
"Of course you know $P(o\|s)$, it's in the sensor datasheet!"	"How could you possibly know $P(o\|s)$ without learning it?"
Model-Based is natural	Model-Free is natural
"Planning requires a model"	"Just learn a policy directly"

💡 This is why the same POMDP paper can seem "obviously correct" to one reviewer and "fundamentally flawed" to another!

🎮 Concrete Example: Robot Navigation

Scenario	Control Theory View 🔬	RL View 🤖
Setup	A robot navigates a room with a noisy distance sensor
Model Knowledge	✅ "The LIDAR has 2cm Gaussian noise (from spec sheet)" `O(o\|s) = N(true_distance, 0.02)`	❓ "I don't know the sensor characteristics" `O(o\|s) = ???`
Approach	Build belief state using Kalman Filter / Particle Filter	Feed observation history to LSTM, learn policy end-to-end
Reasoning	"I know the sensor lies 2cm on average, so I'll account for it mathematically"	"I'll learn from experience what observations mean"

Both approaches are valid! They just start from different assumptions about what knowledge is available.

🔬 Traditional / White-Box View

This perspective emerged from Control Theory and Operations Research (e.g., work by Åström, Kaelbling, Cassandra)

Key Characteristics

1. The "Sensor Failure" Rule

In this context, researchers typically assume they have a known model of the world. This model explicitly includes an observation function $P(o|s)$.

2. Focus on Noise

The core problem is often that sensors are "noisy" or "unreliable."

📍 Example: A robot might be in Room A, but its sensor has a 10% chance of 
   reporting Room B. The sensor is WRONG, not just incomplete.

3. Belief States

Because the agent knows the "white box" (the exact failure rate of the sensor), it uses Bayesian updates to maintain a "Belief State"—a probability distribution over all possible states.

$$b'(s') = \eta \cdot O(o|s',a) \sum_{s} T(s'|s,a) b(s)$$

Where:

$b(s)$ — Current belief over states
$T(s'|s,a)$ — Known transition model
$O(o|s',a)$ — Known observation model
$\eta$ — Normalization constant

4. Typical Algorithms

Value Iteration over Belief Space
Point-Based Value Iteration (PBVI)
SARSOP
POMCP (with explicit belief tracking)

🤖 Reinforcement Learning View

This perspective is common in modern Deep RL research (e.g., Atari games, robotic manipulation)

Key Characteristics

1. Focus on Hidden Information (Latent States)

In many RL tasks, the "partial observability" isn't caused by a sensor being "wrong" (failure), but by the sensor seeing only a small part of the state.

🎮 Example: In a first-person shooter game, you see what is in front of you 
   PERFECTLY (no sensor failure), but you cannot see what is behind you.

2. Black-Box Approach

RL agents usually don't start with a known model of how their sensors fail. Instead of using Bayesian math to handle "incorrect" data, they use Memory (RNNs, LSTMs, or Transformers) to "remember" previous observations.

# Typical RL approach: Stack frames or use recurrent networks
observation_history = [o_t, o_{t-1}, o_{t-2}, o_{t-3}]
action = policy(observation_history)  # or policy(lstm_hidden_state)

3. Assumption of Accuracy

Historically, many RL benchmarks assumed that if you can see an object, the observation is 100% correct. The "uncertainty" comes from the absence of information, not the incorrectness of it.

4. Typical Algorithms

Frame Stacking
Recurrent Policies (LSTM/GRU)
Transformers with Memory
World Models with Latent States

📊 Comparison Table

Feature	Traditional / White-Box 🔬	Reinforcement Learning 🤖
System Transparency	White-Box (Internals are known)	Black-Box (Internals are unknown)
Learning Paradigm	Model-Based (Model is given/assumed)	Model-Free (Learn policy directly)
Typical Cause of PO	Sensor Failure / Noise (The sensor is "lying")	Occlusion / Limited Range (The sensor is "blind")
Model Knowledge	Known: $T(s'\|s,a)$, $O(o\|s,a)$ (From specs or domain knowledge)	Unknown (Agent must learn from experience)
Main Tool	Belief States (Bayesian Filtering / Kalman Filter)	Memory / History (RNNs, LSTMs, Transformers)
Observation Correctness	Can be 100% wrong (sensor failure)	Usually correct but incomplete (limited field of view)
Key Question	"Given noisy sensors, what do I believe?"	"Given limited vision, what did I miss?"
Uncertainty Source	Sensor error distribution	Information incompleteness
Typical Algorithms	PBVI, SARSOP, POMCP	Frame Stacking, DRQN, R2D2

⚖️ Implications for Cross-Domain Review

For Control Theory Reviewers Evaluating RL Papers

⚠️ Don't expect explicit observation models $P(o|s)$

RL papers often assume observations are correct but incomplete. The challenge is learning what information is missing, not modeling sensor error rates.

For RL Reviewers Evaluating Control Theory Papers

⚠️ Don't assume the agent is "cheating" by knowing the observation model

In many control applications, sensor specifications (accuracy, noise characteristics) are provided by manufacturers. Knowing $P(o|s)$ is a reasonable assumption.

Common Ground: Robust RL

Recent work on "Robust RL" has begun to merge these two perspectives by testing RL agents in environments with intentional sensor noise. This represents a promising direction for unifying both worldviews.

🌉 Bridging the Gap

┌─────────────────────────────────────────────────────────────────────────┐
│                           POMDP Framework                               │
│                    (Same Mathematical Definition)                       │
│                  Tuple: (S, A, T, R, Ω, O, γ, b₀)                        │
├───────────────────────────────┬─────────────────────────────────────────┤
│      Control Theory View      │      Reinforcement Learning View        │
│      ══════════════════       │      ════════════════════════════       │
│                               │                                         │
│   🔬 WHITE-BOX                │   📦 BLACK-BOX                          │
│   ────────────                │   ──────────                            │
│   System is transparent       │   System is opaque                      │
│   "Open the box, read specs"  │   "Can only observe I/O"                │
│                               │                                         │
│   📐 MODEL-BASED              │   🎯 MODEL-FREE                         │
│   ────────────                │   ──────────                            │
│   T(s'|s,a) — Given           │   T(s'|s,a) — Unknown                   │
│   O(o|s,a)  — Given           │   O(o|s,a)  — Unknown                   │
│                               │                                         │
│   🧠 BELIEF STATE             │   💾 MEMORY/HISTORY                     │
│   ────────────                │   ──────────                            │
│   Bayesian filtering          │   RNN / LSTM / Transformer              │
│   Exact probability tracking  │   Learned temporal representation       │
│                               │                                         │
│   ❓ "Sensor is LYING"        │   ❓ "Sensor is BLIND"                  │
│   Noise / Incorrect data      │   Incomplete / Missing data             │
│                               │                                         │
└───────────────────────────────┴─────────────────────────────────────────┘
                                │
                                ▼
                ┌───────────────────────────────────┐
                │   🔄 MODERN HYBRID APPROACHES     │
                ├───────────────────────────────────┤
                │  • World Models (Dreamer, MuZero) │
                │  • Robust RL (sensor noise)       │
                │  • Sim2Real (domain adaptation)   │
                │  • Learned Belief States          │
                └───────────────────────────────────┘
                                │
                                ▼
                     Both perspectives merged!

🗺️ Roadmap

This project is under active development. Here's what we're planning:

✅ Completed

Core documentation explaining the two perspectives
Comparison tables and visual diagrams
Cross-domain review guidelines
Code Examples: Control Theory Approach → examples/control_theory/
- Tiger problem with explicit observation model P(o|s)
- Belief state computation with Bayesian filtering
- Influence Diagram / I-DID solver
Code Examples: Reinforcement Learning Approach → examples/deep_rl/
- Memory Corridor environment (partial observability from limited view)
- DRQN (Deep Recurrent Q-Network) with LSTM
- Memory-based policy without explicit belief computation

📋 Planned

Side-by-Side Comparison
- Same environment solved with both approaches
- Performance and assumption comparison
Interactive Demo
- Web-based visualization of belief states vs memory states
Additional Resources
- Curated reading list for cross-domain researchers
- Common terminology mapping between fields

💡 Want to contribute? Check out the Contributing section!

📚 References

Control Theory / Operations Research

Åström, K. J. (1965). Optimal control of Markov processes with incomplete state information.
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains.
Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov processes over a finite horizon.

Reinforcement Learning

Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. (Frame stacking approach)
Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs.
Ni, T., et al. (2022). Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs.

Bridging Work

Dulac-Arnold, G., et al. (2021). Challenges of real-world reinforcement learning.
Kirk, R., et al. (2023). A Survey of Generalisation in Deep Reinforcement Learning.

🤝 Contributing

We welcome contributions from both communities! Whether you're from:

🔬 Control Theory / Operations Research
🤖 Reinforcement Learning / Deep Learning
🎓 Any other field working with POMDPs

Your perspective is valuable. Please feel free to:

Open an Issue — Share your experience with cross-domain misunderstandings
Submit a PR — Add examples, references, or clarifications
Start a Discussion — Propose ways to bridge the gap

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Let's build bridges, not walls. 🌉

If this project helped you understand cross-domain differences, please give it a ⭐

Made with ❤️ for the research community

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

chen-fo/POMDP_explanation

Folders and files

Latest commit

History

Repository files navigation

🎯 POMDP: Two Worlds, One Framework

📖 Table of Contents

🚀 Quick Start

Example 1: Deep RL Approach (Memory Corridor + DRQN)

Example 2: Control Theory Approach (Tiger Problem + I-DID)

📊 Key Difference in Action

🎯 Motivation

📖 The Story Behind This Project

🎯 Why This Matters

🔑 The Core Difference

🔬 Control Theory

🤖 Reinforcement Learning

📦 White-Box vs Black-Box & Model-Based vs Model-Free

🎯 The Two Dimensions

📋 Concept Definitions

🔗 How They Relate

⚡ Key Insight: The Source of Confusion

🎮 Concrete Example: Robot Navigation

🔬 Traditional / White-Box View

Key Characteristics

1. The "Sensor Failure" Rule

2. Focus on Noise

3. Belief States

4. Typical Algorithms

🤖 Reinforcement Learning View

Key Characteristics

1. Focus on Hidden Information (Latent States)

2. Black-Box Approach

3. Assumption of Accuracy

4. Typical Algorithms

📊 Comparison Table

⚖️ Implications for Cross-Domain Review

For Control Theory Reviewers Evaluating RL Papers

For RL Reviewers Evaluating Control Theory Papers

Common Ground: Robust RL

🌉 Bridging the Gap

🗺️ Roadmap

✅ Completed

📋 Planned

📚 References

Control Theory / Operations Research

Reinforcement Learning

Bridging Work

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages