Skip to content

Inferencing not working with P2P in latest version. #3968

Open
@j4ys0n

Description

@j4ys0n

LocalAI version:

localai/localai:latest-gpu-nvidia-cuda-12
LocalAI version: v2.22.1 (015835d)

Environment, CPU architecture, OS, and Version:

Linux localai3 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64 GNU/Linux
(Proxmox LXC, Debian. AMD EPYC 7302P (16 cores allocated)/64GB RAM

Describe the bug

When testing distributed inferencing, i select a model (qwen 2.5 14b), send a chat message, the model loads on both instances (main and worker) and then the model does not respond and the model unloads on the worker. (watching with nvitop)

To Reproduce

description above should reproduce, i tried a few times.

Expected behavior

model should not unload & chat should complete

Logs

worker logs

{"level":"INFO","time":"2024-10-26T05:07:23.924Z","caller":"discovery/dht.go:115","message":" Bootstrapping DHT"}
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes
Starting RPC server on 127.0.0.1:46609, backend memory: 16380 MB
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed

Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed

main logs

5:25AM INF Success ip=my.ip.address latency="960.876µs" method=POST status=200 url=/v1/chat/completions
5:25AM INF Trying to load the model 'qwen2.5-14b-instruct' with the backend '[llama-cpp llama-ggml llama-cpp-fallback rwkv stablediffusion whisper piper huggingface bert-embeddings /build/backend/python/rerankers/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/mamba/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/coqui/run.sh /build/backend/python/bark/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/transformers/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/vllm/run.sh]'
5:25AM INF [llama-cpp] Attempting to load
5:25AM INF Loading model 'qwen2.5-14b-instruct' with backend llama-cpp
5:25AM INF [llama-cpp-grpc] attempting to load with GRPC variant
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Success ip=127.0.0.1 latency="35.55µs" method=GET status=200 url=/readyz
5:26AM INF Node localai-oYURMqpWCR is offline, deleting
Error accepting:  accept tcp 127.0.0.1:35625: use of closed network connection

Additional context

this worked in the last version, though i'm not sure what that was at this point (~2 weeks ago)
model loads and works fine without the worker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions