This handbook teaches from the fundamentals of diffusion through diffusion policy and transformers to flow matching in
Each repository section contains a markdown file(s), original papers, and some example code. The markdown files explain the contents of papers more clearly and concisely. Feel free to skip around, and also add your own notes/papers/code you think are worth including through a PR! Start at section 1 if you are already familiar with regular diffusion.
Section 0: Fundamentals
0.1: Variational Generative Inference
0.2: Denoising Diffusion Probabilistic Models (DDPMs)
Section 1: Diffusion Policy
1.2: Components of Diffusion Policy
Section 2: Diffusion Transformer
2.1: Diffusion Transformer (DiT)
2.2: Ingredients of Robotic Diffusion Transformers
2.3: RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
Section 3: Flow Matching
3.1: Continuous Normalizing Flows
3.3: Conditional Flow Matching
3.4: Diffusion and Optimal Transport Flows
This Colab Notebook has the elements set up for training a Diffusion Transformer Policy on any data dsitribution, including a simple generated one, then visualizing the forming of the learned Diffusion Transformer trajectories from randomly distributed noise particles to a stable aligned trajectory.
It uses this forked DiT repository.
In Denoising Diffusion Probabilistic Models (DDPMs), output generation is modeled as a denoising process.
- Starts from
$x^k$ sampled from Gaussian noise - K iterations of denoising to produce intermediate actions while decreasing noise
$x^k .. x^0$ until desired noise-free output$x^0$ formed - Denoising follows the equation
where
- For your intuition, this is the same as single noisy gradient step:
- Goal: minimize KL divergence between data dist p(x^0) and samples of DDPM q(x^0) with loss for each diffusion step as mean-squared error between pred and true noise:
- Now, output
$x$ represents robot actions instead of the typical image - Denoising process is conditioned on input observation
$O_t$ - Action Diffusion approximates conditional dist
$p(A | O)$ instead of joint dist of both$(p(A,O))$ - Thus, there is no need to infer future states/observations, which speeds up the process
- Action Diffusion approximates conditional dist
- Our new Denoising step is
, which is similarly the scaled previous action minus predicted noise plus noise for non-deterministic action output.
- Our new loss is the mean squared difference between predicted actions and actual actions:
Like Vision Transformer (ViT), operates on sequence of patches
- Patchify: converts spatial input into sequence of T tokens of dim d with linear embedding
- Apply standard ViT sine-cosine frequency-based positional embeddings to all input tokens
- Tokens processed by transformer block sequence
- Conditioning can optionally be added
- In-context conditioning: append vector embeddings of conditioning as two additional tokens in input sequence (after final block, remove conditioning tokens from sequence) (very little overhead)
- Cross-attention block: have concatenated conditioning embeddings and have input tokens cross-attent to the conditioning tokens (roughly 15% overhead)
- Adaptive LayerNorm (adaLN) Block: scale and shift values determined by functions of conditioning (regressed from sum of conditioning embedding vectors), then applied to layer output (least overhead, same function to all tokens)
- adaLN-Zero Block: initializing residual block as identity function (zero initializing scale factor in block) (significantly outperforms adaLN)
- Transformer Decoder (image tokens to noise prediction diagonal covariance prediction)
- standard linear decoder (apply final layer norm, linearly decode each token into 2 x same dim as spatial input)
- We predict covariance because different patches have different uncertainty/information levels
- Latent Diffusion Model: First, we learn autoencoder encoding of x into latent space of zs, then learn Diffusion Model in space zs, then generate with diffusion model and learned decoder
- Intuitively, this is a better Diffusion Model since it operates in a vastly lower latent dim than data dim
- Use standard VAE model from stable diffusion as encoder/decoder and train DiT in latent space
1.2B param lagnuage-condtioned bimanual manipulation with a vision foundation model, fine-tuned on self-created dataset, 0-shot generalizable to new objects/scenes, and 1-5 demo learning of new skills!
Physically Interpretable Unified Action Space: same action representation for different robots to preserve universal semantics of actions for cross-embodiment knowledge transfer
language instruction
-
$(X_t - T_img + 1 , … , X_t)$ : sequence of past RGB images of size$T_img$ -
$z_t$ : the low-dimensional robot proprioception (action$a_t$ is some subset of this) -
$c$ : control frequency
Not enough hardware-specific data, so pre-train on multi-robot data using a unified action space for any robot hardware, then fine-tune on the specific robot hardware.
For model architecture, we need expressiveness for multi-modal action distribution and scalability for generalization.
- Encoding inputs (probabilistic masking to prevent overreliance on one modality)
- Proprioception/action-chunk/control frequency (all low-dim) encoded with MLP with Fourier Features to capture high-f changes (learned)
- Images to compact representations with image-text-aligned pre-trained vision encoder sigLIP (weights fixed)
- Language to embeddings with pre-trained Transformer language model T5-XXL (weights fixed)
- Diffusion Transformer (DiT) backbone modified:
- QKNorm: to avoid gradient instability in attention from dramatically changing values and different joint ranges of robot proprioception data
- normalizes query and key matrixes before computing attention scores
- RMSNorm: root mean square normalization by squaring elements, taking mean, taking sqrt mean, dividing by that value
- instead of LayerNorm, which usually subtracts mean then normalizes by stdev then scales and shifts
- no centering operation (only normalizing) so no token/attention shift
- MLP Decoder: for nonlinear robot actions, replace final linear decoder with nonlinear MLP decoder
- Alternating Condition Injection: use cross-attention to accommodate the varied-length image text conditions
- As opposed to typical class label condition compressed into single token then Adaptive Layer Norm (class label embedding are inputs for function to generate scale and shift params for layer) applied
- inject images and text at alternating layers since image tokens usually way more and overshadow text tokens f simultaneously injected
- QKNorm: to avoid gradient instability in attention from dramatically changing values and different joint ranges of robot proprioception data
A Continuous Normalizing Flow (CNF) is a generative model for arbitrary probability paths (superset of paths modeled by Diffusion processes)
The goal of Flow Matching is to train CNFs by learning vector fields which represent of fixed conditional probability paths
- When applied to Diffusion paths, flow matching is a robust/stable training method
- Flow Matching can also use non-Diffusion probability paths like Optimal Transport
Start with pure noise
Goal: In flow matching, we don’t have the vector field CNF model. We only have data points we know are desirable, and we want to create a vector field which will naturally flow any random point to those desirable points!
Essentially, we just need to learn the vector field! if we learn this, then we can sample any arbitrary point and follow the flow defined by the vector field to some desired probability distribution we want.
We have data about desirable ending probability distribution
Given the target distribution
or the expected difference between our neural network predicting the vector field and the actual vector field
At zero loss the learned CNF model generates
The problem is that we do not know
However, we can use Conditional Flow Matching to approximate the vector field using sampling.
Conditional Flow Matching objective:
or the expected difference between vector field and sampling-estimated true field
Essentially, we sample
So, at last, we have our final Flow Matching Method:
- Have a collection of data samples
$x_1$ which are our desired instances sampled by some true desirable distribution$p_1$ - Sample from some standard normal
$p_0$ (a bunch of random points weighted according to normal distribution) - Sample
$x$ from$p_0$ and$x_1$ , and now$p_t(x|x_1)$ is some path which is an interpolation between$x$ and$x_1$ (which WE DEFINE however we want as long as it is continuous, thus it is easy to sample from!) - Sample a point along that path, and compare our current neural net vector field
$v_t(x)$ with$u_t(x|x_1)$ at that point (we design$u_t(x|x_1)$ , can be simple$(x_1 - x)$ which always pushes points towards$x_1$ , so we always know exactly what it is)
Conditional Flow Matching loss works with any conditional PP
However, the best choice for:
-
$p_t(x|x_1)$ : a Gaussian at each timestep where mean$u_t(x_1)$ moves from 0 to$x_1$ and stdev$\sigma_t(x_1)$ shrinks from 1 to$\sigma_{\text{min}}$ (final stdev around$x_1$ for$p_1$ ), so:
-
$u_t(x|x_1)$ : a simple vector field which pushes points toward the means along the path$p_t$ and accounts for shrinking variance:
Diffusion is a subset of flow matching. How do we get the Denoising Diffusion process from flow matching?
Variance Preserving: noise added and preserving total variance (modern diffusion with alpha scaling)
For Variance Preserving Diffusion, we choose
We are combining the Diffusion Condition Vector Field with Flow Matching objective, which they claim is better than score matching objective. They argue Diffusion technically never approaches true datapoints
Essentially, it gives us much better theoretical guarantees!
A generalist robot foundation model consisting of an “action expert” which uses conditional flow matching to augment a pretrained Vision-Language Model (VLM).
We want to learn
As the VLM we use OpenSource 3B param VLM PaliGemma, and we train a300M param action expert init from scratch.
First, images are passed through Vision Transformers then with language tokens through a pretrained 3B VLM. This output with proprioception and noise are passed through the denoising action expert to output a sequence of future actions
The architecture is also inspired by Transfusion, which trains single transformer using multiple objectives. Unlike Transfusion,
- VLM: for image and text inputs
- Action expert: robotics specific inputs/outputs such as proprioception and actions
The action expert uses bidirectional attention mask so all action tokens attend to each other
In training, language tokens are supervised by standard cross-entropy loss, but vision/actions/states are supervised by Conditional Flow Matching loss with a linear Gaussian probability path:
Essentially, we want to minimize the expected difference between predicted VF and actual VF over actions conditioned on the current obs.
- First, we sample random noise
$\epsilon \sim N(o,I)$ , - Then, we compute noisy actions
$A_t^{\tau} = \tau A_t + (t - \tau)\epsilon$ , - Finally, we train network outputs
$v_\theta(A_t^\tau, o_t)$ to match denoising vector field$u(A_t^\tau | A_t) = \epsilon - A_t$
At inference time, we generate actions by integrating learned vector field from
We've gone from simple diffusion to conditional flow matching for robotic foundation models in pi0.
The above was a highly compressed version of what's in the repository, I encourage you to check it out starting at the beginning, as well as the original papers linked in each folder.