system crash - two 4090 with ExLlamav2_HF #6451

fschiro · 2024-10-09T15:05:56Z

Describe the bug

When running text-generation-web-ui, my system just shuts off as if the power goes out. It is instantaneous crash. The log files are empty around time of crash. It is reproducible, basically around one out of ten messages using the text-generation-web-ui API and my system will crash.

I have two 4090s running model with ExLlamav2_HF with max_seq_len 16000 on autosplit mode.

Monitoring nvitop before the crash, I see my GPU power is usually around 200 watts for each during inference but sometimes it jumps to around 424 watts each.
The GPU memory is pretty maxed out at 23/23.99 GiB for GPU 1 and 21.8/23.99 GiB for GPU 2.

My power supply should be able to handle as it is 1600W.

Since my logs are not getting any information, does anyone have any ideas? I was thinking I could run some live monitoring to a log file and maybe catch something at the time of crash that is not showing in system logs. For example I could run a command like this to log nvidia info:

sudo nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1 > nvidia-smi.log

Could anyone recommend some monitoring software to help me narrow down the problem? I'm stumped right now, thinking to try a backup battery just in case it is a power issue.

Is there an existing issue for this?

I have searched the existing issues

Reproduction

Load model in text-generation-web-ui. max_seq_len 16000 on autosplit mode ExLlamav2_HF. One in ten messages will cause a crash.

Screenshot

No response

Logs

no errors in /var/logs at time of crash

System Info

neofetch
            .-/+oossssoo+/-.               frank@fs01 
        `:+ssssssssssssssssss+:`           ---------- 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 22.04.4 LTS x86_64 
    .ossssssssssssssssssdMMMNysssso.       Host: MS-7D70 1.0 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 6.8.0-45-generic 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 48 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 2557 (dpkg), 7 (flatpak), 15 (snap) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.1.16 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 3840x2160 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   DE: GNOME 42.9 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   WM: Mutter 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   WM Theme: Adwaita 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Theme: Yaru-dark [GTK2/3] 
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    Icons: Yaru [GTK2/3] 
  +sssssssssdmydMMMMMMMMddddyssssssss+     Terminal: terminator 
   /ssssssssssshdmNNNNmyNMMMMhssssss/      CPU: AMD Ryzen 9 7950X (32) @ 5.881GHz 
    .ossssssssssssssssssdMMMNysssso.       GPU: AMD ATI 17:00.0 Device 164e 
      -+sssssssssssssssssyyyssss+-         GPU: NVIDIA 01:00.0 NVIDIA Corporation Device 2684 
        `:+ssssssssssssssssss+:`           GPU: NVIDIA 03:00.0 NVIDIA Corporation Device 2684 
            .-/+oossssoo+/-.               Memory: 6142MiB / 95694MiB 


```bash
inxi -Fxxxrz
System:
  Kernel: 6.8.0-45-generic x86_64 bits: 64 compiler: N/A Desktop: GNOME 42.9
    tk: GTK 3.24.33 wm: gnome-shell dm: GDM3 42.0
    Distro: Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Machine:
  Type: Desktop Mobo: Micro-Star model: MPG X670E CARBON WIFI (MS-7D70)
    v: 1.0 serial: <superuser required> UEFI: American Megatrends LLC. v: 1.80
    date: 08/10/2023
CPU:
  Info: 16-core model: AMD Ryzen 9 7950X bits: 64 type: MT MCP smt: enabled
    arch: Zen 3 rev: 2 cache: L1: 1024 KiB L2: 16 MiB L3: 64 MiB
  Speed (MHz): avg: 2146 high: 5499 min/max: 400/5881 cores: 1: 400 2: 4130
    3: 4532 4: 400 5: 400 6: 4859 7: 400 8: 400 9: 4352 10: 400 11: 5369
    12: 400 13: 5374 14: 400 15: 4708 16: 3780 17: 5499 18: 400 19: 400
    20: 400 21: 400 22: 400 23: 400 24: 400 25: 4467 26: 400 27: 400 28: 4927
    29: 400 30: 4553 31: 4537 32: 400 bogomips: 288009
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: NVIDIA vendor: Micro-Star MSI driver: nvidia v: 555.42.06 pcie:
    speed: 16 GT/s lanes: 8 ports: active: none off: HDMI-A-2
    empty: DP-4,DP-5,DP-6 bus-ID: 01:00.0 chip-ID: 10de:2684 class-ID: 0300
  Device-2: NVIDIA vendor: Micro-Star MSI driver: nvidia v: 555.42.06 pcie:
    speed: 2.5 GT/s lanes: 8 ports: active: none
    empty: DP-7, DP-8, DP-9, HDMI-A-3 bus-ID: 03:00.0 chip-ID: 10de:2684
    class-ID: 0300
  Device-3: AMD vendor: Micro-Star MSI driver: amdgpu v: kernel pcie:
    speed: 16 GT/s lanes: 16 ports: active: none
    empty: DP-1, DP-2, DP-3, HDMI-A-1, Writeback-1 bus-ID: 17:00.0
    chip-ID: 1002:164e class-ID: 0300
  Display: x11 server: X.Org v: 1.21.1.4 compositor: gnome-shell driver: X:
    loaded: amdgpu,ati,modesetting,nouveau,nvidia,radeon unloaded: fbdev,vesa
    gpu: nvidia,nvidia,amdgpu display-ID: :1 screens: 1
  Screen-1: 0 s-res: 3840x2160 s-dpi: 96 s-size: 1016x572mm (40.0x22.5")
    s-diag: 1166mm (45.9")
  Monitor-1: HDMI-0 res: 3840x2160 hz: 60 dpi: 140
    size: 697x392mm (27.4x15.4") diag: 800mm (31.5")
  OpenGL: renderer: NVIDIA GeForce RTX 4090/PCIe/SSE2
    v: 4.6.0 NVIDIA 555.42.06 direct render: Yes

Sensors:
  System Temperatures: cpu: 40.0 C mobo: N/A
  Fan Speeds (RPM): N/A
  GPU: device: nvidia screen: :1.0 temp: 39 C fan: 0% device: amdgpu
    temp: 51.0 C

The text was updated successfully, but these errors were encountered:

fschiro added the bug Something isn't working label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

system crash - two 4090 with ExLlamav2_HF #6451

system crash - two 4090 with ExLlamav2_HF #6451

fschiro commented Oct 9, 2024 •

edited

Loading

system crash - two 4090 with ExLlamav2_HF #6451

system crash - two 4090 with ExLlamav2_HF #6451

Comments

fschiro commented Oct 9, 2024 • edited Loading

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

fschiro commented Oct 9, 2024 •

edited

Loading