Skip to content

chen-fo/POMDP_explanation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฏ POMDP: Two Worlds, One Framework

Bridging the Gap Between Control Theory and Reinforcement Learning

License: MIT PRs Welcome

Why do Control Theorists and RL Researchers sometimes talk past each other about POMDPs?


๐Ÿ“– Table of Contents


๐Ÿš€ Quick Start

We provide two code examples demonstrating both approaches. Try them out!

Example 1: Deep RL Approach (Memory Corridor + DRQN)

# Navigate to the deep RL example
cd examples/deep_rl

# Install dependencies
pip install -r requirements.txt

# Run training
python train.py

What you'll see:

  • A simple corridor environment where the agent must remember an initial hint
  • DRQN (LSTM-based) agent learning to handle partial observability
  • No explicit belief state computation - memory is learned

Example 2: Control Theory Approach (Tiger Problem + I-DID)

# Navigate to the control theory example
cd examples/control_theory

# Install dependencies
pip install -r requirements.txt

# Run solver
python solve.py

What you'll see:

  • Classic Tiger problem with known observation model (85% accuracy)
  • Bayesian belief state updates using explicit probability models
  • Influence Diagram / I-DID framework for decision making

๐Ÿ“Š Key Difference in Action

Deep RL Example Control Theory Example
hidden_state = lstm(obs, hidden_state) belief = bayes_update(belief, obs, P(o|s))
Memory is learned Belief is computed
P(o|s) is unknown P(o|s) = 0.85 is known

๐ŸŽฏ Motivation

๐Ÿ“– The Story Behind This Project

I recently submitted a paper that used POMDPs in the Reinforcement Learning senseโ€”treating partial observability as a problem of incomplete information and using memory-based methods to handle it.

The reviewer, clearly from a Control Theory background, rejected it with comments like:

"This is not a proper POMDP formulation. Where is your observation model $P(o|s)$? How do you compute the belief state?"

From my RL perspective, these comments felt unfair. In our community, it's completely standard to handle partial observability with LSTMs or frame stackingโ€”without explicitly defining observation probabilities.

But after some reflection, I realized: neither side is wrong. We're just using the same term "POMDP" with fundamentally different assumptions.

๐ŸŽฏ Why This Matters

The mathematical definition of a POMDP is the same in both fields. But the conceptual understanding is significantly different.

This mismatch causes real problems:

  • ๐Ÿ“ Unfair rejections โ€” RL papers rejected for "not being real POMDPs"
  • ๐Ÿค” Mutual confusion โ€” Control Theory papers seem to assume "cheating" knowledge
  • ๐Ÿ’ฌ Wasted debates โ€” Arguments about whether explicit belief states are required

This project aims to bridge this gap.

We hope researchers from different backgrounds can:

  1. Understand each other's assumptions and conventions
  2. Communicate more effectively across domains
  3. Review cross-domain work more fairly and charitably

๐Ÿ”‘ The Core Difference

๐Ÿ”ฌ Control Theory

"The sensor is lying"

The agent has a known model but receives noisy/incorrect observations.

๐Ÿค– Reinforcement Learning

"The sensor is blind"

The agent has an unknown model but receives correct but incomplete observations.


๐Ÿ“ฆ White-Box vs Black-Box & Model-Based vs Model-Free

Understanding these two pairs of concepts is crucial for cross-domain communication:

๐ŸŽฏ The Two Dimensions

                        Model Knowledge
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚                                     โ”‚
                    โ”‚    KNOWN              UNKNOWN       โ”‚
                    โ”‚    (Model-Based)      (Model-Free)  โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
   System     โ”‚     โ”‚                                     โ”‚
   Trans-     โ”‚WHITEโ”‚  โœ… Classical POMDP    โ”‚ Hybrid     โ”‚
   parency    โ”‚ BOX โ”‚  (Control Theory)     โ”‚ Approach   โ”‚
              โ”‚     โ”‚  - Know T(s'|s,a)     โ”‚            โ”‚
              โ”‚     โ”‚  - Know O(o|s,a)      โ”‚            โ”‚
              โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
              โ”‚     โ”‚                                     โ”‚
              โ”‚BLACKโ”‚  World Models         โ”‚ โœ… Deep RL  โ”‚
              โ”‚ BOX โ”‚  (Learned Model)      โ”‚ (Pure RL)  โ”‚
              โ”‚     โ”‚                       โ”‚ - Learn ฯ€  โ”‚
              โ”‚     โ”‚                       โ”‚ - No model โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“‹ Concept Definitions

Concept Definition Example
White-Box System internals are known and transparent.
You can "open the box" and see how it works.
Sensor spec sheet says "5% error rate"
Black-Box System internals are unknown/opaque.
You can only observe inputs and outputs.
You don't know why the camera sometimes fails
Model-Based Agent has access to (or learns) an explicit model:
$T(s'|s,a)$ and/or $O(o|s,a)$
POMDP solver with known transition matrix
Model-Free Agent learns a policy $\pi(a|o)$ directly
without explicitly modeling dynamics
DQN, PPO with frame stacking

๐Ÿ”— How They Relate

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                        โ”‚
โ”‚   White-Box + Model-Based  โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•  Classical POMDP Approach       โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€            (Control Theory)                โ”‚
โ”‚   โ€ข Transition model T(s'|s,a) is GIVEN                                โ”‚
โ”‚   โ€ข Observation model O(o|s,a) is GIVEN                                โ”‚
โ”‚   โ€ข Belief state is computed EXACTLY via Bayes rule                    โ”‚
โ”‚   โ€ข Planning happens in belief space                                   โ”‚
โ”‚                                                                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                        โ”‚
โ”‚   Black-Box + Model-Free   โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•  Deep RL Approach               โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€            (Reinforcement Learning)        โ”‚
โ”‚   โ€ข No model is given or learned explicitly                            โ”‚
โ”‚   โ€ข Agent interacts with environment as a BLACK BOX                    โ”‚
โ”‚   โ€ข Memory (RNN/LSTM) implicitly captures temporal dependencies        โ”‚
โ”‚   โ€ข Policy is learned end-to-end from experience                       โ”‚
โ”‚                                                                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                        โ”‚
โ”‚   Black-Box + Model-Based  โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•  World Models / Dreamer         โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€            (Modern Hybrid Approach)        โ”‚
โ”‚   โ€ข Model is LEARNED from data (not given)                             โ”‚
โ”‚   โ€ข Agent treats real environment as black-box                         โ”‚
โ”‚   โ€ข But builds an internal "world model" for planning                  โ”‚
โ”‚   โ€ข Examples: Dreamer, MuZero, World Models                            โ”‚
โ”‚                                                                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โšก Key Insight: The Source of Confusion

Control Theory Assumption RL Assumption
White-Box is the default Black-Box is the default
"Of course you know $P(o|s)$, it's in the sensor datasheet!" "How could you possibly know $P(o|s)$ without learning it?"
Model-Based is natural Model-Free is natural
"Planning requires a model" "Just learn a policy directly"

๐Ÿ’ก This is why the same POMDP paper can seem "obviously correct" to one reviewer and "fundamentally flawed" to another!

๐ŸŽฎ Concrete Example: Robot Navigation

Scenario Control Theory View ๐Ÿ”ฌ RL View ๐Ÿค–
Setup A robot navigates a room with a noisy distance sensor
Model Knowledge โœ… "The LIDAR has 2cm Gaussian noise (from spec sheet)"
O(o|s) = N(true_distance, 0.02)
โ“ "I don't know the sensor characteristics"
O(o|s) = ???
Approach Build belief state using Kalman Filter / Particle Filter Feed observation history to LSTM, learn policy end-to-end
Reasoning "I know the sensor lies 2cm on average, so I'll account for it mathematically" "I'll learn from experience what observations mean"

Both approaches are valid! They just start from different assumptions about what knowledge is available.


๐Ÿ”ฌ Traditional / White-Box View

This perspective emerged from Control Theory and Operations Research (e.g., work by ร…strรถm, Kaelbling, Cassandra)

Key Characteristics

1. The "Sensor Failure" Rule

In this context, researchers typically assume they have a known model of the world. This model explicitly includes an observation function $P(o|s)$.

2. Focus on Noise

The core problem is often that sensors are "noisy" or "unreliable."

๐Ÿ“ Example: A robot might be in Room A, but its sensor has a 10% chance of 
   reporting Room B. The sensor is WRONG, not just incomplete.

3. Belief States

Because the agent knows the "white box" (the exact failure rate of the sensor), it uses Bayesian updates to maintain a "Belief State"โ€”a probability distribution over all possible states.

$$b'(s') = \eta \cdot O(o|s',a) \sum_{s} T(s'|s,a) b(s)$$

Where:

  • $b(s)$ โ€” Current belief over states
  • $T(s'|s,a)$ โ€” Known transition model
  • $O(o|s',a)$ โ€” Known observation model
  • $\eta$ โ€” Normalization constant

4. Typical Algorithms

  • Value Iteration over Belief Space
  • Point-Based Value Iteration (PBVI)
  • SARSOP
  • POMCP (with explicit belief tracking)

๐Ÿค– Reinforcement Learning View

This perspective is common in modern Deep RL research (e.g., Atari games, robotic manipulation)

Key Characteristics

1. Focus on Hidden Information (Latent States)

In many RL tasks, the "partial observability" isn't caused by a sensor being "wrong" (failure), but by the sensor seeing only a small part of the state.

๐ŸŽฎ Example: In a first-person shooter game, you see what is in front of you 
   PERFECTLY (no sensor failure), but you cannot see what is behind you.

2. Black-Box Approach

RL agents usually don't start with a known model of how their sensors fail. Instead of using Bayesian math to handle "incorrect" data, they use Memory (RNNs, LSTMs, or Transformers) to "remember" previous observations.

# Typical RL approach: Stack frames or use recurrent networks
observation_history = [o_t, o_{t-1}, o_{t-2}, o_{t-3}]
action = policy(observation_history)  # or policy(lstm_hidden_state)

3. Assumption of Accuracy

Historically, many RL benchmarks assumed that if you can see an object, the observation is 100% correct. The "uncertainty" comes from the absence of information, not the incorrectness of it.

4. Typical Algorithms

  • Frame Stacking
  • Recurrent Policies (LSTM/GRU)
  • Transformers with Memory
  • World Models with Latent States

๐Ÿ“Š Comparison Table

Feature Traditional / White-Box ๐Ÿ”ฌ Reinforcement Learning ๐Ÿค–
System Transparency White-Box
(Internals are known)
Black-Box
(Internals are unknown)
Learning Paradigm Model-Based
(Model is given/assumed)
Model-Free
(Learn policy directly)
Typical Cause of PO Sensor Failure / Noise
(The sensor is "lying")
Occlusion / Limited Range
(The sensor is "blind")
Model Knowledge Known: $T(s'|s,a)$, $O(o|s,a)$
(From specs or domain knowledge)
Unknown
(Agent must learn from experience)
Main Tool Belief States
(Bayesian Filtering / Kalman Filter)
Memory / History
(RNNs, LSTMs, Transformers)
Observation Correctness Can be 100% wrong
(sensor failure)
Usually correct but incomplete
(limited field of view)
Key Question "Given noisy sensors, what do I believe?" "Given limited vision, what did I miss?"
Uncertainty Source Sensor error distribution Information incompleteness
Typical Algorithms PBVI, SARSOP, POMCP Frame Stacking, DRQN, R2D2

โš–๏ธ Implications for Cross-Domain Review

For Control Theory Reviewers Evaluating RL Papers

โš ๏ธ Don't expect explicit observation models $P(o|s)$

RL papers often assume observations are correct but incomplete. The challenge is learning what information is missing, not modeling sensor error rates.

For RL Reviewers Evaluating Control Theory Papers

โš ๏ธ Don't assume the agent is "cheating" by knowing the observation model

In many control applications, sensor specifications (accuracy, noise characteristics) are provided by manufacturers. Knowing $P(o|s)$ is a reasonable assumption.

Common Ground: Robust RL

Recent work on "Robust RL" has begun to merge these two perspectives by testing RL agents in environments with intentional sensor noise. This represents a promising direction for unifying both worldviews.


๐ŸŒ‰ Bridging the Gap

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                           POMDP Framework                               โ”‚
โ”‚                    (Same Mathematical Definition)                       โ”‚
โ”‚                  Tuple: (S, A, T, R, ฮฉ, O, ฮณ, bโ‚€)                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚      Control Theory View      โ”‚      Reinforcement Learning View        โ”‚
โ”‚      โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•       โ”‚      โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•       โ”‚
โ”‚                               โ”‚                                         โ”‚
โ”‚   ๐Ÿ”ฌ WHITE-BOX                โ”‚   ๐Ÿ“ฆ BLACK-BOX                          โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                            โ”‚
โ”‚   System is transparent       โ”‚   System is opaque                      โ”‚
โ”‚   "Open the box, read specs"  โ”‚   "Can only observe I/O"                โ”‚
โ”‚                               โ”‚                                         โ”‚
โ”‚   ๐Ÿ“ MODEL-BASED              โ”‚   ๐ŸŽฏ MODEL-FREE                         โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                            โ”‚
โ”‚   T(s'|s,a) โ€” Given           โ”‚   T(s'|s,a) โ€” Unknown                   โ”‚
โ”‚   O(o|s,a)  โ€” Given           โ”‚   O(o|s,a)  โ€” Unknown                   โ”‚
โ”‚                               โ”‚                                         โ”‚
โ”‚   ๐Ÿง  BELIEF STATE             โ”‚   ๐Ÿ’พ MEMORY/HISTORY                     โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                            โ”‚
โ”‚   Bayesian filtering          โ”‚   RNN / LSTM / Transformer              โ”‚
โ”‚   Exact probability tracking  โ”‚   Learned temporal representation       โ”‚
โ”‚                               โ”‚                                         โ”‚
โ”‚   โ“ "Sensor is LYING"        โ”‚   โ“ "Sensor is BLIND"                  โ”‚
โ”‚   Noise / Incorrect data      โ”‚   Incomplete / Missing data             โ”‚
โ”‚                               โ”‚                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
                                โ–ผ
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ”‚   ๐Ÿ”„ MODERN HYBRID APPROACHES     โ”‚
                โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
                โ”‚  โ€ข World Models (Dreamer, MuZero) โ”‚
                โ”‚  โ€ข Robust RL (sensor noise)       โ”‚
                โ”‚  โ€ข Sim2Real (domain adaptation)   โ”‚
                โ”‚  โ€ข Learned Belief States          โ”‚
                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
                                โ–ผ
                     Both perspectives merged!

๐Ÿ—บ๏ธ Roadmap

This project is under active development. Here's what we're planning:

โœ… Completed

  • Core documentation explaining the two perspectives

  • Comparison tables and visual diagrams

  • Cross-domain review guidelines

  • Code Examples: Control Theory Approach โ†’ examples/control_theory/

    • Tiger problem with explicit observation model P(o|s)
    • Belief state computation with Bayesian filtering
    • Influence Diagram / I-DID solver
  • Code Examples: Reinforcement Learning Approach โ†’ examples/deep_rl/

    • Memory Corridor environment (partial observability from limited view)
    • DRQN (Deep Recurrent Q-Network) with LSTM
    • Memory-based policy without explicit belief computation

๐Ÿ“‹ Planned

  • Side-by-Side Comparison

    • Same environment solved with both approaches
    • Performance and assumption comparison
  • Interactive Demo

    • Web-based visualization of belief states vs memory states
  • Additional Resources

    • Curated reading list for cross-domain researchers
    • Common terminology mapping between fields

๐Ÿ’ก Want to contribute? Check out the Contributing section!


๐Ÿ“š References

Control Theory / Operations Research

  1. ร…strรถm, K. J. (1965). Optimal control of Markov processes with incomplete state information.
  2. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains.
  3. Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov processes over a finite horizon.

Reinforcement Learning

  1. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. (Frame stacking approach)
  2. Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs.
  3. Ni, T., et al. (2022). Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs.

Bridging Work

  1. Dulac-Arnold, G., et al. (2021). Challenges of real-world reinforcement learning.
  2. Kirk, R., et al. (2023). A Survey of Generalisation in Deep Reinforcement Learning.

๐Ÿค Contributing

We welcome contributions from both communities! Whether you're from:

  • ๐Ÿ”ฌ Control Theory / Operations Research
  • ๐Ÿค– Reinforcement Learning / Deep Learning
  • ๐ŸŽ“ Any other field working with POMDPs

Your perspective is valuable. Please feel free to:

  1. Open an Issue โ€” Share your experience with cross-domain misunderstandings
  2. Submit a PR โ€” Add examples, references, or clarifications
  3. Start a Discussion โ€” Propose ways to bridge the gap

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Let's build bridges, not walls. ๐ŸŒ‰

If this project helped you understand cross-domain differences, please give it a โญ

Made with โค๏ธ for the research community

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published