Skip to content

Conversation

@javak87
Copy link
Contributor

@javak87 javak87 commented Dec 29, 2025

Description

The issue described in #1399 suggests there’s a bottleneck when transferring data from CPU to GPU. Experiments show that the batch is not pinned even though the DataLoader is configured with pin_memory=True. Therefore, manual pinning is necessary to improve performance.

Here are the profiling results after manual pinning on JWB:

pinned_mem

Here are the profiling results after manual pinning on Santis:

pinned_mem

Before manual pinning, the data transfer throughput was ~6 GB/s on JWB and 248 MB/s on Santis (!!!). As shown in the profiling, throughput on JWB increased to 17 GB/s, and on Santis it jumped to 360 GB/s. Achieving 360 GB/s is close to what we expect from the CPU–GPU NVLink on Santis. The maximum theoretical throughput on Santis is 450 GB/s.

To verify the manual pinning behavior, the code was also run with:

../WeatherGenerator-private/hpc/launch-slurm.py --time 180 --nodes=1
Here are the training time results:

run_id HPC PR Ingested Samples per GPU
kb9uki4x Santis develop (1 node) (180 mins) 9366
lptxb12a Santis javad/dev/manual-mem-pinning-1399 (1 node) (180 mins) 10320

The above performance check related to 23 Dec. 2026 develop branch.

As shown, Santis performance improved by ~10% (with throughput increasing from 248 MB/s to 360 GB/s), while there was no noticeable change on JWB.

Issue Number

Closes #1399

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@clessig
Copy link
Collaborator

clessig commented Dec 29, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

@javak87
Copy link
Contributor Author

javak87 commented Dec 29, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.

You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

@clessig
Copy link
Collaborator

clessig commented Dec 29, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.

You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)
. Then the batch is completed but it's still running in parallel.

@javak87
Copy link
Contributor Author

javak87 commented Dec 29, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

CPU RAM consumed (without pinning) for the first 10 batches:

Batch 0: RAM = 1694.7 MB
Batch 1: RAM = 3297.4 MB
Batch 2: RAM = 3371.8 MB
Batch 3: RAM = 3371.8 MB
Batch 4: RAM = 3374.2 MB
Batch 5: RAM = 3376.3 MB
Batch 6: RAM = 3376.3 MB
Batch 7: RAM = 3377.8 MB
Batch 8: RAM = 3379.9 MB
Batch 9: RAM = 3382.1 MB
Batch 10: RAM = 3383.6 MB

CPU RAM consumed (with pinning) for the first 10 batches:

Batch 0: RAM = 1093.4 MB
Batch 1: RAM = 3435.7 MB
Batch 2: RAM = 3441.2 MB
Batch 3: RAM = 3441.2 MB
Batch 4: RAM = 3443.7 MB
Batch 5: RAM = 3443.7 MB
Batch 6: RAM = 3445.7 MB
Batch 7: RAM = 3447.2 MB
Batch 8: RAM = 3449.6 MB
Batch 9: RAM = 3452.6 MB
Batch 10: RAM = 3454.5 MB

Pinning increases the CPU RAM by approximately 80 MB

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

Done.

@clessig
Copy link
Collaborator

clessig commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

Done.

And does this change the performance behaviour? (Can you also lint the code.)

@tjhunter
Copy link
Collaborator

Thanks @javak87 for the investigation. It is not entirely surprising since the tensors sent to the GPU are heavily fragmented due to all the transforms. pinning forces assembling them first in a single memory aligned page.

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

Done.

And does this change the performance behaviour? (Can you also lint the code.)

Regarding ruffing the code, I did it, but because I need to import torch in packages/common/src/weathergen/common/io.py, I’m getting the following error:

 WARN /home/runner/work/WeatherGenerator/WeatherGenerator/packages/common/pyproject.toml: Extra keys found in config: ignores
ERROR Could not find import of `torch` [import-error]
  --> packages/common/src/weathergen/common/io.py:19:8
   |
19 | import torch
   |        ^^^^^
   |
  Looked in these locations (from config in `/home/runner/work/WeatherGenerator/WeatherGenerator/packages/common/pyproject.toml`):

I think packages/common/pyproject.toml should be changed.

@tjhunter do you have any idea how to solve this import error?

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

I made a mistake and ran a different branch. If I add pinning after this line in multi_stream_data_sampler.py, I get an error:

batch = self._get_batch(idx, forecast_dt)

.

The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call .pin_memory() in the worker process (inside __iter__), it attempts to access CUDA, but the CUDA context hasn’t been properly initialized in that worker. Therefore, pinning needs to be done in the main process, where CUDA is already initialized correctly.
I think the initial setup was correct.

Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javak87 this looks very hepful, and you can make your life much easier. All you need is to traverse the data structures to trigger a side effect (we are not dealing with async optimizations yet). Also, the Protocol concept in python is exactly for that purpose.

Define the protocol and the traversal function in a pin.py module:

from typing import Protocol, runtime_checkable
import torch
from weathergen.common.io import IOReaderData

@runtime_checkable
class Pinnable(Protocol):
    """
    Protocol that allows the pytorch content of this data structure 
    to be pinned to the memory of the current accelerator.

    This extends the pin_memory() capability of a torch Tensor 
    to other classes.

    It is blocking.
    """
    def pin_memory(self): ...


def pin_object(obj: Pinnable | torch.Tensor | IOReaderData | list | dict | None):
    if obj is None:
        return
    elif isinstance(obj, torch.Tensor) and obj.numel() > 0:
        obj.pin_memory()
    elif isinstance(obj, Pinnable):
        obj.pin_memory()
    elif isinstance(obj, IOReaderData):
        # Special case for that class because it is in common
        # Should not be the case, it is a numpy array
        pin_object(obj.coords)
        ...
    elif isinstance(obj, list):
        # Assume the list is a list of potentially pinnable objects and traverse it.
        for e in obj:
            pin_object(e)
    elif isinstance(obj, dict):
        # Assume the values are pinnable.
        for e in obj.values():
            pin_object(e)

and then the changes in each class are very tiny:

from weathergen.datasets.pin import Pinnable, pin_object
...
class Sample(Pinnable):
...
    def pin_memory(self):
        pin_object(self.streams_data)
        pin_object(self.meta_info)

No need to do more checks for attributes etc., this is all done for you by the protocol.

@tjhunter
Copy link
Collaborator

Also, using a protocol has the advantage of clearly documenting all the classes which deal with memory pinning.

@tjhunter
Copy link
Collaborator

The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call .pin_memory() in the worker process (inside iter), it attempts to access CUDA, but the CUDA context hasn’t been properly initialized in that worker. Therefore, pinning needs to be done in the main process, where CUDA is already initialized correctly.
I think the initial setup was correct.

Interesting, that sounds like a bug on torch side (at least regarding pinning the CPU memory, I can imagine that they take shortcuts and mix CPU and GPU logic)

@clessig
Copy link
Collaborator

clessig commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

I made a mistake and ran a different branch. If I add pinning after this line in multi_stream_data_sampler.py, I get an error:

batch = self._get_batch(idx, forecast_dt)

.
The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call .pin_memory() in the worker process (inside __iter__), it attempts to access CUDA, but the CUDA context hasn’t been properly initialized in that worker. Therefore, pinning needs to be done in the main process, where CUDA is already initialized correctly. I think the initial setup was correct.

Yes, it's what I would expected (since CUDA doesn't work in the parallel processes), but was worth a try. But then we should still do it as early as possible in Trainer.train() and Trainer.validate().

torch should at least generate a warning.

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

@clessig
Since pinning is manual and dataloader pinning isn’t working with the current setup, I think it’s better to remove it.

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

Also, using a protocol has the advantage of clearly documenting all the classes which deal with memory pinning.

Thanks for your suggestion — it’s pretty convenient.

@sophie-xhonneux
Copy link
Contributor

DINOv2 hangs with FSDP2 and your memory pinning, but I don't know why.

everything else worked in my testing (integration tests, JEPA, Physical modelling with and without FSDP2)

@clessig
Copy link
Collaborator

clessig commented Jan 6, 2026

DINOv2 hangs with FSDP2 and your memory pinning, but I don't know why.

everything else worked in my testing (integration tests, JEPA, Physical modelling with and without FSDP2)

Do you know where it is hanging? Is there any log? Eventually it should time out and point you to the location where it hangs.

@sophie-xhonneux
Copy link
Contributor

I waited for over 10 minutes and got no error

@javak87
Copy link
Contributor Author

javak87 commented Jan 6, 2026

I waited for over 10 minutes and got no error

Could you share the configuration you’re using to run the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Data Transfer from CPU to GPU is not optimized

4 participants