-
Notifications
You must be signed in to change notification settings - Fork 49
Manual memory pinning #1533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Manual memory pinning #1533
Conversation
|
Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM? |
Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward. You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact. |
Can you try to pin here:
|
CPU RAM consumed (without pinning) for the first 10 batches: CPU RAM consumed (with pinning) for the first 10 batches: Pinning increases the CPU RAM by approximately 80 MB |
Done. |
And does this change the performance behaviour? (Can you also lint the code.) |
|
Thanks @javak87 for the investigation. It is not entirely surprising since the tensors sent to the GPU are heavily fragmented due to all the transforms. pinning forces assembling them first in a single memory aligned page. |
Regarding ruffing the code, I did it, but because I need to I think @tjhunter do you have any idea how to solve this import error? |
I made a mistake and ran a different branch. If I add pinning after this line in
. The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@javak87 this looks very hepful, and you can make your life much easier. All you need is to traverse the data structures to trigger a side effect (we are not dealing with async optimizations yet). Also, the Protocol concept in python is exactly for that purpose.
Define the protocol and the traversal function in a pin.py module:
from typing import Protocol, runtime_checkable
import torch
from weathergen.common.io import IOReaderData
@runtime_checkable
class Pinnable(Protocol):
"""
Protocol that allows the pytorch content of this data structure
to be pinned to the memory of the current accelerator.
This extends the pin_memory() capability of a torch Tensor
to other classes.
It is blocking.
"""
def pin_memory(self): ...
def pin_object(obj: Pinnable | torch.Tensor | IOReaderData | list | dict | None):
if obj is None:
return
elif isinstance(obj, torch.Tensor) and obj.numel() > 0:
obj.pin_memory()
elif isinstance(obj, Pinnable):
obj.pin_memory()
elif isinstance(obj, IOReaderData):
# Special case for that class because it is in common
# Should not be the case, it is a numpy array
pin_object(obj.coords)
...
elif isinstance(obj, list):
# Assume the list is a list of potentially pinnable objects and traverse it.
for e in obj:
pin_object(e)
elif isinstance(obj, dict):
# Assume the values are pinnable.
for e in obj.values():
pin_object(e)and then the changes in each class are very tiny:
from weathergen.datasets.pin import Pinnable, pin_object
...
class Sample(Pinnable):
...
def pin_memory(self):
pin_object(self.streams_data)
pin_object(self.meta_info)No need to do more checks for attributes etc., this is all done for you by the protocol.
packages/evaluate/src/weathergen/evaluate/export/export_inference.py
Outdated
Show resolved
Hide resolved
|
Also, using a protocol has the advantage of clearly documenting all the classes which deal with memory pinning. |
Interesting, that sounds like a bug on torch side (at least regarding pinning the CPU memory, I can imagine that they take shortcuts and mix CPU and GPU logic) |
Yes, it's what I would expected (since CUDA doesn't work in the parallel processes), but was worth a try. But then we should still do it as early as possible in Trainer.train() and Trainer.validate(). torch should at least generate a warning. |
|
@clessig
|
Thanks for your suggestion — it’s pretty convenient. |
|
DINOv2 hangs with FSDP2 and your memory pinning, but I don't know why. everything else worked in my testing (integration tests, JEPA, Physical modelling with and without FSDP2) |
Do you know where it is hanging? Is there any log? Eventually it should time out and point you to the location where it hangs. |
|
I waited for over 10 minutes and got no error |
Could you share the configuration you’re using to run the code? |
Description
The issue described in #1399 suggests there’s a bottleneck when transferring data from CPU to GPU. Experiments show that the batch is not pinned even though the DataLoader is configured with
pin_memory=True. Therefore, manual pinning is necessary to improve performance.Here are the profiling results after manual pinning on JWB:
Here are the profiling results after manual pinning on Santis:
Before manual pinning, the data transfer throughput was
~6 GB/son JWB and248 MB/son Santis (!!!). As shown in the profiling, throughput on JWB increased to17 GB/s, and on Santis it jumped to 360 GB/s. Achieving 360 GB/s is close to what we expect from the CPU–GPU NVLink on Santis. The maximum theoretical throughput on Santis is 450 GB/s.To verify the manual pinning behavior, the code was also run with:
../WeatherGenerator-private/hpc/launch-slurm.py --time 180 --nodes=1Here are the training time results:
The above performance check related to 23 Dec. 2026 develop branch.
As shown, Santis performance improved by ~10% (with throughput increasing from
248 MB/sto 360 GB/s), while there was no noticeable change on JWB.Issue Number
Closes #1399
Is this PR a draft? Mark it as draft.
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60