This directory contains the minimal working prototype demonstrating how to train a reinforcement-learning policy entirely inside a pretrained UVA video-and-action model, then evaluate (and optionally finetune) using a real PushT environment.
# Activate the UVA conda env or create a new one
mamba env create -f conda_environment.yml # if you have not installed UVA
conda activate uva
# Install extra requirements for this folder
pip install -r World-model/requirements.txt
GPU with CUDA-enabled PyTorch is strongly recommended.
python World-model/train_policy.py \
--checkpoint checkpoints/pusht.ckpt # path to UVA PushT checkpoint
--timesteps 1000000 # adjust as needed
--logdir runs/ppo_uva # tensorboard + policy files
• The script creates a UVAWorldModelEnv
which serves as the simulator.
• Stable-Baselines3 PPO is used; logs are written to TensorBoard.
python World-model/eval_policy_real.py \
--policy runs/ppo_uva/trained_policy.zip \
--episodes 20
Returns for each episode and the mean score are printed.
python World-model/finetune_uva.py \
--checkpoint checkpoints/pusht.ckpt \
--episodes 50 # collect data
--epochs 5 # finetune steps on collected data
--save finetuned_uva.ckpt
The script currently collects random-policy trajectories to demonstrate the pipeline. Swap in your PPO policy to gather higher-quality data and finetune.
File | Purpose |
---|---|
uva_world_env.py |
Wraps the UVA model as a Gym simulator. |
train_policy.py |
PPO training script using the world model. |
real_pusht_env.py |
Physics-based PushT implementation for evaluation. |
eval_policy_real.py |
Runs a trained policy in the real env. |
finetune_uva.py |
Prototype for online finetuning of UVA with new data. |
requirements.txt |
Additional Python dependencies. |
- Replace random data collection in
finetune_uva.py
with policy rollouts. - Move to real-robot datasets by swapping
RealPushTEnv
with your robotics interface. - Experiment with larger UVA checkpoints or tasks beyond PushT.
See World-model/README.md
for instructions on joint PPO + world-model
finetuning, which mirrors the online-learning pipeline popularised by
DayDreamer but starts from a pretrained UVA model instead of training the world
model from scratch.
The joint pipeline now integrates a perfect replay buffer:
- Keeps the most recent 200 real-environment episodes.
- Samples random horizon-length chunks every gradient step.
- Ensures balanced, memory-bounded training data for continuous model updating.