Bridging the Gap Between Control Theory and Reinforcement Learning
Why do Control Theorists and RL Researchers sometimes talk past each other about POMDPs?
- Quick Start
- Motivation
- The Core Difference
- White-Box vs Black-Box & Model-Based vs Model-Free
- Traditional / White-Box View
- Reinforcement Learning View
- Comparison Table
- Implications for Cross-Domain Review
- Roadmap
- References
- Contributing
We provide two code examples demonstrating both approaches. Try them out!
# Navigate to the deep RL example
cd examples/deep_rl
# Install dependencies
pip install -r requirements.txt
# Run training
python train.pyWhat you'll see:
- A simple corridor environment where the agent must remember an initial hint
- DRQN (LSTM-based) agent learning to handle partial observability
- No explicit belief state computation - memory is learned
# Navigate to the control theory example
cd examples/control_theory
# Install dependencies
pip install -r requirements.txt
# Run solver
python solve.pyWhat you'll see:
- Classic Tiger problem with known observation model (85% accuracy)
- Bayesian belief state updates using explicit probability models
- Influence Diagram / I-DID framework for decision making
| Deep RL Example | Control Theory Example |
|---|---|
hidden_state = lstm(obs, hidden_state) |
belief = bayes_update(belief, obs, P(o|s)) |
| Memory is learned | Belief is computed |
| P(o|s) is unknown | P(o|s) = 0.85 is known |
I recently submitted a paper that used POMDPs in the Reinforcement Learning senseโtreating partial observability as a problem of incomplete information and using memory-based methods to handle it.
The reviewer, clearly from a Control Theory background, rejected it with comments like:
"This is not a proper POMDP formulation. Where is your observation model $P(o|s)$? How do you compute the belief state?"
From my RL perspective, these comments felt unfair. In our community, it's completely standard to handle partial observability with LSTMs or frame stackingโwithout explicitly defining observation probabilities.
But after some reflection, I realized: neither side is wrong. We're just using the same term "POMDP" with fundamentally different assumptions.
The mathematical definition of a POMDP is the same in both fields. But the conceptual understanding is significantly different.
This mismatch causes real problems:
- ๐ Unfair rejections โ RL papers rejected for "not being real POMDPs"
- ๐ค Mutual confusion โ Control Theory papers seem to assume "cheating" knowledge
- ๐ฌ Wasted debates โ Arguments about whether explicit belief states are required
This project aims to bridge this gap.
We hope researchers from different backgrounds can:
- Understand each other's assumptions and conventions
- Communicate more effectively across domains
- Review cross-domain work more fairly and charitably
|
"The sensor is lying" The agent has a known model but receives noisy/incorrect observations. |
"The sensor is blind" The agent has an unknown model but receives correct but incomplete observations. |
Understanding these two pairs of concepts is crucial for cross-domain communication:
Model Knowledge
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ KNOWN UNKNOWN โ
โ (Model-Based) (Model-Free) โ
โโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
System โ โ โ
Trans- โWHITEโ โ
Classical POMDP โ Hybrid โ
parency โ BOX โ (Control Theory) โ Approach โ
โ โ - Know T(s'|s,a) โ โ
โ โ - Know O(o|s,a) โ โ
โโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ โ
โBLACKโ World Models โ โ
Deep RL โ
โ BOX โ (Learned Model) โ (Pure RL) โ
โ โ โ - Learn ฯ โ
โ โ โ - No model โ
โโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Concept | Definition | Example |
|---|---|---|
| White-Box | System internals are known and transparent. You can "open the box" and see how it works. |
Sensor spec sheet says "5% error rate" |
| Black-Box | System internals are unknown/opaque. You can only observe inputs and outputs. |
You don't know why the camera sometimes fails |
| Model-Based | Agent has access to (or learns) an explicit model: |
POMDP solver with known transition matrix |
| Model-Free | Agent learns a policy without explicitly modeling dynamics |
DQN, PPO with frame stacking |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ White-Box + Model-Based โโโโโโโโโโโ Classical POMDP Approach โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ (Control Theory) โ
โ โข Transition model T(s'|s,a) is GIVEN โ
โ โข Observation model O(o|s,a) is GIVEN โ
โ โข Belief state is computed EXACTLY via Bayes rule โ
โ โข Planning happens in belief space โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Black-Box + Model-Free โโโโโโโโโโโ Deep RL Approach โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ (Reinforcement Learning) โ
โ โข No model is given or learned explicitly โ
โ โข Agent interacts with environment as a BLACK BOX โ
โ โข Memory (RNN/LSTM) implicitly captures temporal dependencies โ
โ โข Policy is learned end-to-end from experience โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Black-Box + Model-Based โโโโโโโโโโโ World Models / Dreamer โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ (Modern Hybrid Approach) โ
โ โข Model is LEARNED from data (not given) โ
โ โข Agent treats real environment as black-box โ
โ โข But builds an internal "world model" for planning โ
โ โข Examples: Dreamer, MuZero, World Models โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Control Theory Assumption | RL Assumption |
|---|---|
| White-Box is the default | Black-Box is the default |
| "Of course you know |
"How could you possibly know |
| Model-Based is natural | Model-Free is natural |
| "Planning requires a model" | "Just learn a policy directly" |
๐ก This is why the same POMDP paper can seem "obviously correct" to one reviewer and "fundamentally flawed" to another!
| Scenario | Control Theory View ๐ฌ | RL View ๐ค |
|---|---|---|
| Setup | A robot navigates a room with a noisy distance sensor | |
| Model Knowledge | โ
"The LIDAR has 2cm Gaussian noise (from spec sheet)"O(o|s) = N(true_distance, 0.02) |
โ "I don't know the sensor characteristics"O(o|s) = ??? |
| Approach | Build belief state using Kalman Filter / Particle Filter | Feed observation history to LSTM, learn policy end-to-end |
| Reasoning | "I know the sensor lies 2cm on average, so I'll account for it mathematically" | "I'll learn from experience what observations mean" |
Both approaches are valid! They just start from different assumptions about what knowledge is available.
This perspective emerged from Control Theory and Operations Research (e.g., work by ร strรถm, Kaelbling, Cassandra)
In this context, researchers typically assume they have a known model of the world. This model explicitly includes an observation function
The core problem is often that sensors are "noisy" or "unreliable."
๐ Example: A robot might be in Room A, but its sensor has a 10% chance of
reporting Room B. The sensor is WRONG, not just incomplete.
Because the agent knows the "white box" (the exact failure rate of the sensor), it uses Bayesian updates to maintain a "Belief State"โa probability distribution over all possible states.
Where:
-
$b(s)$ โ Current belief over states -
$T(s'|s,a)$ โ Known transition model -
$O(o|s',a)$ โ Known observation model -
$\eta$ โ Normalization constant
- Value Iteration over Belief Space
- Point-Based Value Iteration (PBVI)
- SARSOP
- POMCP (with explicit belief tracking)
This perspective is common in modern Deep RL research (e.g., Atari games, robotic manipulation)
1. Focus on Hidden Information (Latent States)
In many RL tasks, the "partial observability" isn't caused by a sensor being "wrong" (failure), but by the sensor seeing only a small part of the state.
๐ฎ Example: In a first-person shooter game, you see what is in front of you
PERFECTLY (no sensor failure), but you cannot see what is behind you.
RL agents usually don't start with a known model of how their sensors fail. Instead of using Bayesian math to handle "incorrect" data, they use Memory (RNNs, LSTMs, or Transformers) to "remember" previous observations.
# Typical RL approach: Stack frames or use recurrent networks
observation_history = [o_t, o_{t-1}, o_{t-2}, o_{t-3}]
action = policy(observation_history) # or policy(lstm_hidden_state)Historically, many RL benchmarks assumed that if you can see an object, the observation is 100% correct. The "uncertainty" comes from the absence of information, not the incorrectness of it.
- Frame Stacking
- Recurrent Policies (LSTM/GRU)
- Transformers with Memory
- World Models with Latent States
| Feature | Traditional / White-Box ๐ฌ | Reinforcement Learning ๐ค |
|---|---|---|
| System Transparency |
White-Box (Internals are known) |
Black-Box (Internals are unknown) |
| Learning Paradigm |
Model-Based (Model is given/assumed) |
Model-Free (Learn policy directly) |
| Typical Cause of PO | Sensor Failure / Noise (The sensor is "lying") |
Occlusion / Limited Range (The sensor is "blind") |
| Model Knowledge | Known: (From specs or domain knowledge) |
Unknown (Agent must learn from experience) |
| Main Tool | Belief States (Bayesian Filtering / Kalman Filter) |
Memory / History (RNNs, LSTMs, Transformers) |
| Observation Correctness | Can be 100% wrong (sensor failure) |
Usually correct but incomplete (limited field of view) |
| Key Question | "Given noisy sensors, what do I believe?" | "Given limited vision, what did I miss?" |
| Uncertainty Source | Sensor error distribution | Information incompleteness |
| Typical Algorithms | PBVI, SARSOP, POMCP | Frame Stacking, DRQN, R2D2 |
โ ๏ธ Don't expect explicit observation models$P(o|s)$
RL papers often assume observations are correct but incomplete. The challenge is learning what information is missing, not modeling sensor error rates.
โ ๏ธ Don't assume the agent is "cheating" by knowing the observation model
In many control applications, sensor specifications (accuracy, noise characteristics) are provided by manufacturers. Knowing
Recent work on "Robust RL" has begun to merge these two perspectives by testing RL agents in environments with intentional sensor noise. This represents a promising direction for unifying both worldviews.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ POMDP Framework โ
โ (Same Mathematical Definition) โ
โ Tuple: (S, A, T, R, ฮฉ, O, ฮณ, bโ) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Control Theory View โ Reinforcement Learning View โ
โ โโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ ๐ฌ WHITE-BOX โ ๐ฆ BLACK-BOX โ
โ โโโโโโโโโโโโ โ โโโโโโโโโโ โ
โ System is transparent โ System is opaque โ
โ "Open the box, read specs" โ "Can only observe I/O" โ
โ โ โ
โ ๐ MODEL-BASED โ ๐ฏ MODEL-FREE โ
โ โโโโโโโโโโโโ โ โโโโโโโโโโ โ
โ T(s'|s,a) โ Given โ T(s'|s,a) โ Unknown โ
โ O(o|s,a) โ Given โ O(o|s,a) โ Unknown โ
โ โ โ
โ ๐ง BELIEF STATE โ ๐พ MEMORY/HISTORY โ
โ โโโโโโโโโโโโ โ โโโโโโโโโโ โ
โ Bayesian filtering โ RNN / LSTM / Transformer โ
โ Exact probability tracking โ Learned temporal representation โ
โ โ โ
โ โ "Sensor is LYING" โ โ "Sensor is BLIND" โ
โ Noise / Incorrect data โ Incomplete / Missing data โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ MODERN HYBRID APPROACHES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โข World Models (Dreamer, MuZero) โ
โ โข Robust RL (sensor noise) โ
โ โข Sim2Real (domain adaptation) โ
โ โข Learned Belief States โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Both perspectives merged!
This project is under active development. Here's what we're planning:
-
Core documentation explaining the two perspectives
-
Comparison tables and visual diagrams
-
Cross-domain review guidelines
-
Code Examples: Control Theory Approach โ
examples/control_theory/- Tiger problem with explicit observation model P(o|s)
- Belief state computation with Bayesian filtering
- Influence Diagram / I-DID solver
-
Code Examples: Reinforcement Learning Approach โ
examples/deep_rl/- Memory Corridor environment (partial observability from limited view)
- DRQN (Deep Recurrent Q-Network) with LSTM
- Memory-based policy without explicit belief computation
-
Side-by-Side Comparison
- Same environment solved with both approaches
- Performance and assumption comparison
-
Interactive Demo
- Web-based visualization of belief states vs memory states
-
Additional Resources
- Curated reading list for cross-domain researchers
- Common terminology mapping between fields
๐ก Want to contribute? Check out the Contributing section!
- ร strรถm, K. J. (1965). Optimal control of Markov processes with incomplete state information.
- Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains.
- Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov processes over a finite horizon.
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. (Frame stacking approach)
- Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs.
- Ni, T., et al. (2022). Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs.
- Dulac-Arnold, G., et al. (2021). Challenges of real-world reinforcement learning.
- Kirk, R., et al. (2023). A Survey of Generalisation in Deep Reinforcement Learning.
We welcome contributions from both communities! Whether you're from:
- ๐ฌ Control Theory / Operations Research
- ๐ค Reinforcement Learning / Deep Learning
- ๐ Any other field working with POMDPs
Your perspective is valuable. Please feel free to:
- Open an Issue โ Share your experience with cross-domain misunderstandings
- Submit a PR โ Add examples, references, or clarifications
- Start a Discussion โ Propose ways to bridge the gap
This project is licensed under the MIT License - see the LICENSE file for details.
Let's build bridges, not walls. ๐
If this project helped you understand cross-domain differences, please give it a โญ
Made with โค๏ธ for the research community