We recommend using uv to manage the dependencies needed. Please refer to the uv documentation for installation instructions of uv.
To run this workflow, first clone this repository. Inside the repository, run:
uv syncThis will create a virtual environment in the .venv folder with all the required dependencies. To activate the virtual environment, run:
source .venv/bin/activateSummary:
- Combines video encoder, temporal attention, spatial transformer, and decoder
- Encodes video into spatio-temporal patches
- Aggregates temporal information per spatial patch
- Mixes spatial features across patches
- Decodes back to original spatial resolution
Detailed process:
The model takes daily SST (or similar) data in video format: x ∈ ℝ^{B × 1 × T × H × W} and a daily_mask indicating missing pixels. It also takes
land_mask_patch indicating land regions in the output.
-
Patch embedding:
`X (VideoEncoder)---------> X_patch` -
Temporal aggregation:
Temporal attention summarizes daily patches into a monthly token per spatial location:
`X_patch (TemporalAttentionAggregator)---------> X_temp_agg`
- Add spatial encoding + spatial transformer:
Spatial transformer mixes information across all spatial patches:
` X_temp_agg + PE ---------> X_mixed`
- Decode to original resolution:
Decoder upsamples tokens to full-resolution map, optionally masking land areas:
`X_mixed (MonthlyConvDecoder)---------> Output`