Manual memory pinning #1533

javak87 · 2025-12-29T13:10:35Z

Description

The issue described in #1399 suggests there’s a bottleneck when transferring data from CPU to GPU. Experiments show that the batch is not pinned even though the DataLoader is configured with pin_memory=True. Therefore, manual pinning is necessary to improve performance.

Here are the profiling results after manual pinning on JWB:

Here are the profiling results after manual pinning on Santis:

Before manual pinning, the data transfer throughput was ~6 GB/s on JWB and 248 MB/s on Santis (!!!). As shown in the profiling, throughput on JWB increased to 17 GB/s, and on Santis it jumped to 360 GB/s. Achieving 360 GB/s is close to what we expect from the CPU–GPU NVLink on Santis. The maximum theoretical throughput on Santis is 450 GB/s.

To verify the manual pinning behavior, the code was also run with:

../WeatherGenerator-private/hpc/launch-slurm.py --time 180 --nodes=1
Here are the training time results:

run_id	HPC	PR	Ingested Samples per GPU
kb9uki4x	Santis	develop (1 node) (180 mins)	9366
lptxb12a	Santis	javad/dev/manual-mem-pinning-1399 (1 node) (180 mins)	10320

The above performance check related to 23 Dec. 2026 develop branch.

As shown, Santis performance improved by ~10% (with throughput increasing from 248 MB/s to 360 GB/s), while there was no noticeable change on JWB.

Issue Number

Closes #1399

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig · 2025-12-29T13:28:25Z

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

javak87 · 2025-12-29T13:55:25Z

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.

You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

clessig · 2025-12-29T14:17:53Z

Can you try to pin here:

WeatherGenerator/src/weathergen/datasets/multi_stream_data_sampler.py

Line 763 in 1776b0a

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

javak87 · 2025-12-29T14:40:46Z

CPU RAM consumed (without pinning) for the first 10 batches:

Batch 0: RAM = 1694.7 MB
Batch 1: RAM = 3297.4 MB
Batch 2: RAM = 3371.8 MB
Batch 3: RAM = 3371.8 MB
Batch 4: RAM = 3374.2 MB
Batch 5: RAM = 3376.3 MB
Batch 6: RAM = 3376.3 MB
Batch 7: RAM = 3377.8 MB
Batch 8: RAM = 3379.9 MB
Batch 9: RAM = 3382.1 MB
Batch 10: RAM = 3383.6 MB

CPU RAM consumed (with pinning) for the first 10 batches:

Batch 0: RAM = 1093.4 MB
Batch 1: RAM = 3435.7 MB
Batch 2: RAM = 3441.2 MB
Batch 3: RAM = 3441.2 MB
Batch 4: RAM = 3443.7 MB
Batch 5: RAM = 3443.7 MB
Batch 6: RAM = 3445.7 MB
Batch 7: RAM = 3447.2 MB
Batch 8: RAM = 3449.6 MB
Batch 9: RAM = 3452.6 MB
Batch 10: RAM = 3454.5 MB

Pinning increases the CPU RAM by approximately 80 MB

javak87 · 2025-12-30T10:52:14Z

Done.

clessig · 2025-12-30T10:56:01Z

And does this change the performance behaviour? (Can you also lint the code.)

tjhunter · 2025-12-30T11:17:41Z

Thanks @javak87 for the investigation. It is not entirely surprising since the tensors sent to the GPU are heavily fragmented due to all the transforms. pinning forces assembling them first in a single memory aligned page.

javak87 · 2025-12-30T11:18:24Z

Regarding ruffing the code, I did it, but because I need to import torch in packages/common/src/weathergen/common/io.py, I’m getting the following error:

 WARN /home/runner/work/WeatherGenerator/WeatherGenerator/packages/common/pyproject.toml: Extra keys found in config: ignores
ERROR Could not find import of `torch` [import-error]
  --> packages/common/src/weathergen/common/io.py:19:8
   |
19 | import torch
   |        ^^^^^
   |
  Looked in these locations (from config in `/home/runner/work/WeatherGenerator/WeatherGenerator/packages/common/pyproject.toml`):

I think packages/common/pyproject.toml should be changed.

@tjhunter do you have any idea how to solve this import error?

src/weathergen/datasets/stream_data.py

javak87 · 2025-12-30T12:02:03Z

I made a mistake and ran a different branch. If I add pinning after this line in multi_stream_data_sampler.py, I get an error:

WeatherGenerator/src/weathergen/datasets/multi_stream_data_sampler.py

Line 763 in 1776b0a

batch = self._get_batch(idx, forecast_dt)

.

The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call .pin_memory() in the worker process (inside __iter__), it attempts to access CUDA, but the CUDA context hasn’t been properly initialized in that worker. Therefore, pinning needs to be done in the main process, where CUDA is already initialized correctly.
I think the initial setup was correct.

tjhunter

@javak87 this looks very hepful, and you can make your life much easier. All you need is to traverse the data structures to trigger a side effect (we are not dealing with async optimizations yet). Also, the Protocol concept in python is exactly for that purpose.

Define the protocol and the traversal function in a pin.py module:

from typing import Protocol, runtime_checkable
import torch
from weathergen.common.io import IOReaderData

@runtime_checkable
class Pinnable(Protocol):
    """
    Protocol that allows the pytorch content of this data structure 
    to be pinned to the memory of the current accelerator.

    This extends the pin_memory() capability of a torch Tensor 
    to other classes.

    It is blocking.
    """
    def pin_memory(self): ...


def pin_object(obj: Pinnable | torch.Tensor | IOReaderData | list | dict | None):
    if obj is None:
        return
    elif isinstance(obj, torch.Tensor) and obj.numel() > 0:
        obj.pin_memory()
    elif isinstance(obj, Pinnable):
        obj.pin_memory()
    elif isinstance(obj, IOReaderData):
        # Special case for that class because it is in common
        # Should not be the case, it is a numpy array
        pin_object(obj.coords)
        ...
    elif isinstance(obj, list):
        # Assume the list is a list of potentially pinnable objects and traverse it.
        for e in obj:
            pin_object(e)
    elif isinstance(obj, dict):
        # Assume the values are pinnable.
        for e in obj.values():
            pin_object(e)

and then the changes in each class are very tiny:

from weathergen.datasets.pin import Pinnable, pin_object
...
class Sample(Pinnable):
...
    def pin_memory(self):
        pin_object(self.streams_data)
        pin_object(self.meta_info)

No need to do more checks for attributes etc., this is all done for you by the protocol.

packages/common/src/weathergen/common/io.py

packages/evaluate/src/weathergen/evaluate/export/export_inference.py

packages/common/pyproject.toml

packages/common/src/weathergen/common/io.py

tjhunter · 2025-12-30T12:49:38Z

Also, using a protocol has the advantage of clearly documenting all the classes which deal with memory pinning.

tjhunter · 2025-12-30T12:54:11Z

Interesting, that sounds like a bug on torch side (at least regarding pinning the CPU memory, I can imagine that they take shortcuts and mix CPU and GPU logic)

clessig · 2025-12-30T14:01:30Z

Yes, it's what I would expected (since CUDA doesn't work in the parallel processes), but was worth a try. But then we should still do it as early as possible in Trainer.train() and Trainer.validate().

torch should at least generate a warning.

javak87 · 2025-12-30T15:56:04Z

@clessig
Since pinning is manual and dataloader pinning isn’t working with the current setup, I think it’s better to remove it.

WeatherGenerator/src/weathergen/train/trainer.py

Line 142 in dfcf3e1

"pin_memory": True,

…e change

javak87 · 2025-12-30T16:52:39Z

Thanks for your suggestion — it’s pretty convenient.

src/weathergen/datasets/memory_pinning.py

sophie-xhonneux · 2026-01-06T16:50:26Z

DINOv2 hangs with FSDP2 and your memory pinning, but I don't know why.

everything else worked in my testing (integration tests, JEPA, Physical modelling with and without FSDP2)

clessig · 2026-01-06T16:53:44Z

Do you know where it is hanging? Is there any log? Eventually it should time out and point you to the location where it hangs.

sophie-xhonneux · 2026-01-06T16:55:11Z

I waited for over 10 minutes and got no error

javak87 · 2026-01-06T19:50:13Z

Could you share the configuration you’re using to run the code?

Javad Kasravi and others added 5 commits December 24, 2025 12:29

add pin mem to IOReaderData

970c4dc

add pin mem to sample & modelbatch class

5c566df

add pin mem to stream data

e85309d

add pin mem to training loop

ac3b089

run /scripts/actions.sh lint

c3fc9a7

github-project-automation bot added this to WeatherGen-dev Dec 29, 2025

run ./scripts/actions.sh unit-test

7ac3b3e

ignore check torch import in package

a65f561

move pinning to MultiStreamDataSampler

98f4e0b

clessig reviewed Dec 30, 2025

View reviewed changes

src/weathergen/datasets/stream_data.py Outdated Show resolved Hide resolved

Javad Kasravi added 2 commits December 30, 2025 13:46

add _pin_tensor & _pin_tensor_list helper func

bc80b26

ruff the code

8f98482

tjhunter reviewed Dec 30, 2025

View reviewed changes

move back pin mem. to train loop

ea8f16c

Javad Kasravi added 3 commits December 30, 2025 17:16

Remove the ignore-import-error rule and revert to the state before th…

61433eb

…e change

create protocol for pinnable obj

48c51e3

remove pin_mem from IOReaderData class

dc40a2f

Javad Kasravi added 3 commits December 30, 2025 17:26

add pin_memory to Trainer.validate

36c4b9c

remove pin_memory from loader_params

ebec481

Rever export/export_inference.py to state before c3fc9a7

62c4e02

clessig reviewed Jan 5, 2026

View reviewed changes

src/weathergen/datasets/memory_pinning.py Show resolved Hide resolved

src/weathergen/datasets/memory_pinning.py Outdated Show resolved Hide resolved

Javad Kasravi added 2 commits January 6, 2026 16:13

change name

6a22234

revise Pinnable class description

3796bc8

Manual memory pinning #1533

Are you sure you want to change the base?

Manual memory pinning #1533

Conversation

javak87 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig commented Dec 29, 2025

Uh oh!

javak87 commented Dec 29, 2025

Uh oh!

clessig commented Dec 29, 2025

Uh oh!

javak87 commented Dec 29, 2025

Uh oh!

javak87 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clessig commented Dec 30, 2025

Uh oh!

tjhunter commented Dec 30, 2025

Uh oh!

javak87 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

javak87 commented Dec 30, 2025

Uh oh!

tjhunter left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjhunter commented Dec 30, 2025

Uh oh!

tjhunter commented Dec 30, 2025

Uh oh!

clessig commented Dec 30, 2025

Uh oh!

javak87 commented Dec 30, 2025

Uh oh!

javak87 commented Dec 30, 2025

Uh oh!

Uh oh!

Uh oh!

sophie-xhonneux commented Jan 6, 2026

Uh oh!

clessig commented Jan 6, 2026

Uh oh!

sophie-xhonneux commented Jan 6, 2026

Uh oh!

javak87 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

javak87 commented Dec 29, 2025 •

edited

Loading

javak87 commented Dec 30, 2025 •

edited

Loading

javak87 commented Dec 30, 2025 •

edited

Loading

tjhunter left a comment •

edited

Loading