Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system crash - two 4090 with ExLlamav2_HF #6451

Open
1 task done
fschiro opened this issue Oct 9, 2024 · 0 comments
Open
1 task done

system crash - two 4090 with ExLlamav2_HF #6451

fschiro opened this issue Oct 9, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@fschiro
Copy link

fschiro commented Oct 9, 2024

Describe the bug

When running text-generation-web-ui, my system just shuts off as if the power goes out. It is instantaneous crash. The log files are empty around time of crash. It is reproducible, basically around one out of ten messages using the text-generation-web-ui API and my system will crash.

I have two 4090s running model with ExLlamav2_HF with max_seq_len 16000 on autosplit mode.

Monitoring nvitop before the crash, I see my GPU power is usually around 200 watts for each during inference but sometimes it jumps to around 424 watts each.
The GPU memory is pretty maxed out at 23/23.99 GiB for GPU 1 and 21.8/23.99 GiB for GPU 2.

My power supply should be able to handle as it is 1600W.

Since my logs are not getting any information, does anyone have any ideas? I was thinking I could run some live monitoring to a log file and maybe catch something at the time of crash that is not showing in system logs. For example I could run a command like this to log nvidia info:

sudo nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1 > nvidia-smi.log

Could anyone recommend some monitoring software to help me narrow down the problem? I'm stumped right now, thinking to try a backup battery just in case it is a power issue.

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

Load model in text-generation-web-ui. max_seq_len 16000 on autosplit mode ExLlamav2_HF. One in ten messages will cause a crash.

Screenshot

No response

Logs

no errors in /var/logs at time of crash

System Info

neofetch
            .-/+oossssoo+/-.               frank@fs01 
        `:+ssssssssssssssssss+:`           ---------- 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 22.04.4 LTS x86_64 
    .ossssssssssssssssssdMMMNysssso.       Host: MS-7D70 1.0 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 6.8.0-45-generic 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 48 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 2557 (dpkg), 7 (flatpak), 15 (snap) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.1.16 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 3840x2160 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   DE: GNOME 42.9 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   WM: Mutter 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   WM Theme: Adwaita 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Theme: Yaru-dark [GTK2/3] 
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    Icons: Yaru [GTK2/3] 
  +sssssssssdmydMMMMMMMMddddyssssssss+     Terminal: terminator 
   /ssssssssssshdmNNNNmyNMMMMhssssss/      CPU: AMD Ryzen 9 7950X (32) @ 5.881GHz 
    .ossssssssssssssssssdMMMNysssso.       GPU: AMD ATI 17:00.0 Device 164e 
      -+sssssssssssssssssyyyssss+-         GPU: NVIDIA 01:00.0 NVIDIA Corporation Device 2684 
        `:+ssssssssssssssssss+:`           GPU: NVIDIA 03:00.0 NVIDIA Corporation Device 2684 
            .-/+oossssoo+/-.               Memory: 6142MiB / 95694MiB 


```bash
inxi -Fxxxrz
System:
  Kernel: 6.8.0-45-generic x86_64 bits: 64 compiler: N/A Desktop: GNOME 42.9
    tk: GTK 3.24.33 wm: gnome-shell dm: GDM3 42.0
    Distro: Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Machine:
  Type: Desktop Mobo: Micro-Star model: MPG X670E CARBON WIFI (MS-7D70)
    v: 1.0 serial: <superuser required> UEFI: American Megatrends LLC. v: 1.80
    date: 08/10/2023
CPU:
  Info: 16-core model: AMD Ryzen 9 7950X bits: 64 type: MT MCP smt: enabled
    arch: Zen 3 rev: 2 cache: L1: 1024 KiB L2: 16 MiB L3: 64 MiB
  Speed (MHz): avg: 2146 high: 5499 min/max: 400/5881 cores: 1: 400 2: 4130
    3: 4532 4: 400 5: 400 6: 4859 7: 400 8: 400 9: 4352 10: 400 11: 5369
    12: 400 13: 5374 14: 400 15: 4708 16: 3780 17: 5499 18: 400 19: 400
    20: 400 21: 400 22: 400 23: 400 24: 400 25: 4467 26: 400 27: 400 28: 4927
    29: 400 30: 4553 31: 4537 32: 400 bogomips: 288009
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: NVIDIA vendor: Micro-Star MSI driver: nvidia v: 555.42.06 pcie:
    speed: 16 GT/s lanes: 8 ports: active: none off: HDMI-A-2
    empty: DP-4,DP-5,DP-6 bus-ID: 01:00.0 chip-ID: 10de:2684 class-ID: 0300
  Device-2: NVIDIA vendor: Micro-Star MSI driver: nvidia v: 555.42.06 pcie:
    speed: 2.5 GT/s lanes: 8 ports: active: none
    empty: DP-7, DP-8, DP-9, HDMI-A-3 bus-ID: 03:00.0 chip-ID: 10de:2684
    class-ID: 0300
  Device-3: AMD vendor: Micro-Star MSI driver: amdgpu v: kernel pcie:
    speed: 16 GT/s lanes: 16 ports: active: none
    empty: DP-1, DP-2, DP-3, HDMI-A-1, Writeback-1 bus-ID: 17:00.0
    chip-ID: 1002:164e class-ID: 0300
  Display: x11 server: X.Org v: 1.21.1.4 compositor: gnome-shell driver: X:
    loaded: amdgpu,ati,modesetting,nouveau,nvidia,radeon unloaded: fbdev,vesa
    gpu: nvidia,nvidia,amdgpu display-ID: :1 screens: 1
  Screen-1: 0 s-res: 3840x2160 s-dpi: 96 s-size: 1016x572mm (40.0x22.5")
    s-diag: 1166mm (45.9")
  Monitor-1: HDMI-0 res: 3840x2160 hz: 60 dpi: 140
    size: 697x392mm (27.4x15.4") diag: 800mm (31.5")
  OpenGL: renderer: NVIDIA GeForce RTX 4090/PCIe/SSE2
    v: 4.6.0 NVIDIA 555.42.06 direct render: Yes

Sensors:
  System Temperatures: cpu: 40.0 C mobo: N/A
  Fan Speeds (RPM): N/A
  GPU: device: nvidia screen: :1.0 temp: 39 C fan: 0% device: amdgpu
    temp: 51.0 C
@fschiro fschiro added the bug Something isn't working label Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant