Skip to content

--tracer.raymarch-type voxel uses too much VRAM, which triggers OutOfMemoryError #193

Open
@barikata1984

Description

While investigating #192, I noticed that --tracer.raymarch-type voxel triggers OutOfMemoryError as below

other traceback lines
...
  File "/home/atsushi/workspace/wisp211/wisp/tracers/packed_rf_tracer.py", line 130, in trace
    hit_ray_d = rays.dirs.index_select(0, ridx)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.15 GiB (GPU 0; 11.69 GiB total capacity; 10.22 GiB already allocated; 133.44 MiB free; 10.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
❯ nvidia-smi
Sat Jun 29 01:30:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   40C    P8             14W /  285W |     848MiB /  12282MiB |     41%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1750      G   /usr/lib/xorg/Xorg                            416MiB |
|    0   N/A  N/A      1943    C+G   ...libexec/gnome-remote-desktop-daemon        195MiB |
|    0   N/A  N/A      1995      G   /usr/bin/gnome-shell                           98MiB |
|    0   N/A  N/A      5488      G   ...57,262144 --variations-seed-version        109MiB |
|    0   N/A  N/A      8436      G   /app/bin/wezterm-gui                            9MiB |
+-----------------------------------------------------------------------------------------+

As you can see, 4.15 GiB is tried to be allocated while 10.22 GiB are already used. I observed similar results regardless of whether an interactive app is loaded or not. I thought that simply other apps use pretty large VRAM and checked that usage by running nvidia-smi immediately after trying to train a nerf. As you can see, however, the result is less than 1GiB is used. My assumption is a nerf app tries to allocate quite large VRAM sequentially and fails at some point. Does anybody know a potential cause of this issue?

Thanks in advance!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions