Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] nonsense responses with q2_k llama in Termux when using GPU #1909

Closed
4 tasks done
ghost opened this issue Jun 17, 2023 · 27 comments
Closed
4 tasks done

[User] nonsense responses with q2_k llama in Termux when using GPU #1909

ghost opened this issue Jun 17, 2023 · 27 comments
Labels

Comments

@ghost
Copy link

ghost commented Jun 17, 2023

./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin --color -c 2048 --keep -1 -t 3 -b 7 -i -ins -ngl 1 runs, but produces nonsense responses. To clarify, without -ngl works as expected.

LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin --color -c 2048 --keep -1 -t 3 -b 7 -i -ins -ngl 1
main: build = 0 (unknown)
main: seed  = 1686999485
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 4383.18 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 1 repeating layers to GPU
llama_model_load_internal: offloaded 1/35 layers to GPU
llama_model_load_internal: total VRAM used: 81 MB
...................................................................................................
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> hi Samantha
Amts inheritanceтарасла inwonolasSEEhalbźyn osccius dolensaieri corsoistetximdebugliminputdebuglia Savjönlin corso background 
inheritanceieriieri Sav Encyclopediaiste Хро invånare octIABPush Lit oscattanleb Albattan Nation Cur Podpoisattan SE smdebugérezpois dressyen Savunneldebugassets Alb 
Albattanźnit хиattandy Mann Overflowirection podlectee curveLENGarusuenpgfein Хроertenistepois oscź�
>

I tested open-llama-7B-open-instruct.ggmlv3.q2_K and had the same result.

Environment and Context

Here's clinfo (native OpenCL);

LD_LIBRARY_PATH=/vendor/lib64 clinfo           

Number of platforms                               1
  Platform Name                                   QUALCOMM Snapdragon(TM)
  Platform Vendor                                 QUALCOMM
  Platform Version                                OpenCL 2.0 QUALCOMM build: commit #3dad7f8ed7 changeid #I593c16c433 Date: 10/01/21 Fri Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.UM.9.1.R1.11.00.00.604.073
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             

  Platform Name                                   QUALCOMM Snapdragon(TM)
Number of devices                                 1
  Device Name                                     QUALCOMM Adreno(TM)
  Device Vendor                                   QUALCOMM
  Device Vendor ID                                0x5143
  Device Version                                  OpenCL 2.0 Adreno(TM) 640
  Driver Version                                  OpenCL 2.0 QUALCOMM build: commit #3dad7f8ed7 changeid #I593c16c433 Date: 10/01/21 Fri Local Branch:  Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.UM.9.1.R1.11.00.00.604.073 Compiler E031.37.12.01
  Device OpenCL C Version                         OpenCL C 2.0 Adreno(TM) 640
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               2
  Max clock frequency                             1MHz
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             1024
  Preferred work group size multiple (kernel)     128
  Preferred / native vector sizes
    char                                                 1 / 1
    short                                                1 / 1
    int                                                  1 / 1
    long                                                 1 / 0
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    64, Little-Endian
  Global memory size                              3911952384 (3.643GiB)
  Error Correction support                        No
  Max memory allocation                           977988096 (932.7MiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   No
    Atomics                                       Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Page size (QCOM)                                4096 bytes
  External memory padding (QCOM)                  0 bytes
  Preferred alignment for atomics
    SVM                                           128 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             1048576 (1024KiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        131072 (128KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   64 bytes
    Pitch alignment for 2D image buffers          64 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             16384x16384x2048 pixels
    Max number of read image args                 128
    Max number of write image args                64
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    7680
  Max pipe packet size                            1024
  Local memory type                               Local
  Local memory size                               32768 (32KiB)
  Max number of constant args                     8
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     1024
  Queue properties (on host)
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Queue properties (on device)
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                655376 (640KiB)
    Max size                                      655376 (640KiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_3d_image_writes cl_img_egl_image cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_egl_event cl_khr_egl_image cl_khr_fp16 cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_image2d_from_buffer cl_khr_mipmap_image cl_khr_srgb_image_writes cl_khr_subgroups cl_qcom_create_buffer_from_image cl_qcom_ext_host_ptr cl_qcom_ion_host_ptr cl_qcom_perf_hint cl_qcom_other_image cl_qcom_subgroup_shuffle cl_qcom_vector_image_ops cl_qcom_extract_image_plane cl_qcom_android_native_buffer_host_ptr cl_qcom_protected_context cl_qcom_priority_hint cl_qcom_compressed_yuv_image_read cl_qcom_compressed_image cl_qcom_ext_host_ptr_iocoherent cl_qcom_accelerated_image_ops cl_qcom_ml_ops

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [P0]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 QUALCOMM Snapdragon(TM)
    Device Name                                   QUALCOMM Adreno(TM)
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 QUALCOMM Snapdragon(TM)
    Device Name                                   QUALCOMM Adreno(TM)
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 QUALCOMM Snapdragon(TM)
    Device Name                                   QUALCOMM Adreno(TM)

lscpu;

Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: Qualcomm
Model name: Kryo-4XX-Silver
Model: 14
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 0xd
CPU(s) scaling MHz: 62%
CPU max MHz: 1785.6000
CPU min MHz: 300.0000
BogoMIPS: 38.40
Flags: fp asimd evtstrm aes pmull
sha1 sha2 crc32 atomics f
php asimdhp cpuid asimdrdm
lrcpc dcpop asimddp
Model name: Kryo-4XX-Gold
Model: 14
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 2
Stepping: 0xd
CPU(s) scaling MHz: 71%
CPU max MHz: 2841.6001
CPU min MHz: 710.4000
BogoMIPS: 38.40
Flags: fp asimd evtstrm aes pmull
sha1 sha2 crc32 atomics f
php asimdhp cpuid asimdrdm
lrcpc dcpop asimddp
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Vulnerable
Spec store bypass: Vulnerable
Spectre v1: Mitigation; __user pointer
sanitization
Spectre v2: Mitigation; Branch predict
or hardening
Srbds: Not affected
Tsx async abort: Not affected

uname -a

Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android

  • SDK version, e.g. for Linux:
Python 3.11.4                                     
GNU Make 4.4.1                                    
cmake version 3.26.4
clang version 16.0.6
Target: aarch64-unknown-linux-android24
Thread model: posix
InstalledDir: /data/data/com.termux/files/usr/bin

Steps to Reproduce

  1. Build llama.cpp with CLBlast enabled
  2. load q2_k model with -ngl # parameter
  3. Query the model

Thank you!

@mirek190
Copy link

I had such problem.

1 - check if you has newest build
2 - if yes then check if your model is not corrupted

I had those 2 problems.

@ghost
Copy link
Author

ghost commented Jun 17, 2023

I had such problem.

1 - check if you has newest build 2 - if yes then check if your model is not corrupted

I had those 2 problems.

Yes. [86c7571] and I tested with 2 different q2_k models.

@mirek190
Copy link

have you tried q4 model for test?

@daboe01
Copy link
Contributor

daboe01 commented Jun 17, 2023

could it be that -ngl 1 works only on apple silicon systems?
you seem to be on a linux maschine.

@ghost
Copy link
Author

ghost commented Jun 17, 2023

have you tried q4 model for test?

Thanks for your response. I tested it now with open-llama-7B-open-instruct.ggmlv3.q4_0 and it's functional, working as expected.

The issue is with q2_k models specifically.

@ghost
Copy link
Author

ghost commented Jun 17, 2023

could it be that -ngl 1 works only on apple silicon systems? you seem to be on a linux maschine.

The ngl parameter functions with OpenCL through CLBlast even on my device: Android with Termux.

@mirek190
Copy link

Seems qk models are not fully supported on arm ( linux ? ) devices ...

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Seems qk models are not fully supported on arm ( linux ? ) devices ...

I'm downloading a 3_k_s model now, but I can't test until later tonight, so I'll let you know how it goes.

It's an Android device with Termux.
edit: 3_k_s model functional (no gibberish)

LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q3_K_S.bin --color -c 2048 --keep -1 -t 3 -b 7 -i -ins -ngl 1
main: build = 0 (unknown)
main: seed  = 1687028632
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q3_K_S.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 11 (mostly Q3_K - Small)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 4471.30 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 1 repeating layers to GPU
llama_model_load_internal: offloaded 1/35 layers to GPU
llama_model_load_internal: total VRAM used: 83 MB
...................................................................................................
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> hi, hows it?
Hello! I'm doing well and eager to learn more about your day. What can I help you with today?

@mirek190
Copy link

Are you sure you q2_k model is not broken :P
If not then is something wrong with arm built and q2_k.
Anyway q2_k is useless and shouldn't be used ... too much lobotomy .

@ghost
Copy link
Author

ghost commented Jun 18, 2023

Are you sure you q2_k model is not broken :P If not then is something wrong with arm built and q2_k.

The q2_k models work with -ngl 0 (disabled), so yes I'm sure the .bin for Samantha & Open Llama are not corrupt.

Anyway q2_k is useless and shouldn't be used ... too much lobotomy .

I say sir_x4

@ikawrakow
Copy link
Contributor

ikawrakow commented Jun 18, 2023

I cannot reproduce it on a PC using OpenCL.

Here is what I get, looks perfectly reasonable:

llama_init_from_file: kv self size  = 1024,00 MB

system_info: n_threads = 3 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 
> hi Samantha
Hello! I'm happy to be here for you, ready to support and guide you through any situation.
> Tell me what you know about Hacker News
Hacker News is a popular news website that covers tech-related topics such as startup news, hacking, and coding projects. It features articles written by both professionals and amateurs in the field, providing an opportunity for open conversations between people who share a common interest in technology and related subjects.

Btw, using -ngl 1 will load a single layer on the GPU. If the model fits completely in VRAM, it is better to use -ngl 100.

@ghost
Copy link
Author

ghost commented Jun 18, 2023

I can reproduce it on a PC using OpenCL.

Here is what I get, looks perfectly reasonable:

llama_init_from_file: kv self size  = 1024,00 MB

system_info: n_threads = 3 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 
> hi Samantha
Hello! I'm happy to be here for you, ready to support and guide you through any situation.
> Tell me what you know about Hacker News
Hacker News is a popular news website that covers tech-related topics such as startup news, hacking, and coding projects. It features articles written by both professionals and amateurs in the field, providing an opportunity for open conversations between people who share a common interest in technology and related subjects.

Hi,

I don't see a reproduction in your message. Are you saying you're able to produce the nonsense with a q2_k model on PC?

Btw, using -ngl 1 will load a single layer on the GPU. If the model fits completely in VRAM, it is better to use -ngl 100.

Increasing -ngl # slows inference: #1718

@ikawrakow
Copy link
Contributor

Sorry, typo. I meant "cannot", not "can".

@ghost
Copy link
Author

ghost commented Jun 18, 2023

Sorry, typo. I meant "cannot", not "can".

Thanks for clarifying. I'm thinking it may be an ARM device specific issue, like mirek190 mentioned.

Even with clblast the error is gone if I doesn't offload layers.

Yes, q2_k functions normal through CLBlast without offload.

@ghost
Copy link
Author

ghost commented Jun 19, 2023

Small update, same results:

Built ba4e85a including CLBlast using open-llama-13b-q2_k:

~/c/build> cd bin
u0_a1282@localhost ~/c/b/bin> LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/open-llama-13b-q2_K.bin --color -c 2048 --keep -1 -t 3 -b 7 -i -ins -ngl 1
main: build = 0 (unknown)
main: seed  = 1687193815
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/open-llama-13b-q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 7097.25 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 1 repeating layers to GPU
llama_model_load_internal: offloaded 1/43 layers to GPU
llama_model_load_internal: total VRAM used: 127 MB
....................................................................................................
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> Hi. What's a fun thing to do at the beach?
 Loopijarowsefore Ring conduct under privilege loop Loop rare foreannonanu rare Foreunders encitiroeitles scale Cro currencyfore

llama_print_timings:        load time = 16974.21 ms
llama_print_timings:      sample time =    59.59 ms /    25 runs   (    2.38 ms per token,   419.51 tokens per second)
llama_print_timings: prompt eval time = 74631.28 ms /    35 tokens ( 2132.32 ms per token,     0.47 tokens per second)
llama_print_timings:        eval time = 191464.64 ms /    24 runs   ( 7977.69 ms per token,     0.13 tokens per second)
llama_print_timings:       total time = 298827.60 ms

I didn't expect a change, but wanted to provide additional information. Per the results, even 13b q2_k produces nonsense.

Thank you.

@aseok
Copy link

aseok commented Jun 23, 2023

Same issue. Using termux on sm8250 (snapdragon 870) with 8gb memory, built on latest commit on the master branch, getting gibberish output with offloading ( -ngl 1 to 35) with llama-7b.ggmlv3.q2_K.bin model.

@ghost ghost changed the title [User] nonsense responses q2_k User] nonsense responses with q2_k llama in Termux when using GPU Jun 24, 2023
@ghost
Copy link
Author

ghost commented Jun 24, 2023

@JackJollimore you fit a 13b model on a 8gb phone? (7gb used?) Is there any custom rom useable to free the extra 2gb the android system is using? Such that you can repurpose them to run like cheap low power sbc servers?

Thanks for your response. My device is 8GB RAM but there's also 8GB virtual RAM in the settings. Edit: to clarify, it's stock OS Android, no root.

When loading greater than 8GB of RAM then it's quite slow, but yes it functions.

The 13B q2_k Max Ram is 8.01 GB vs. 13b q4_0 at 9.82 GB which is significant when it comes to inference speed for a model that size on a device like mine.

@ghost ghost changed the title User] nonsense responses with q2_k llama in Termux when using GPU [User] nonsense responses with q2_k llama in Termux when using GPU Jun 24, 2023
@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 4, 2023

I noticed the same actually, I was using the Orange Pi 5B which ships with some custom Android and vendor OpenCL.

@aseok
Copy link

aseok commented Jul 5, 2023 via email

@ghost
Copy link
Author

ghost commented Jul 5, 2023

Solved by recent pull.

I pulled today. Here's my result with -ins:

> ./main -m ~/open-llama-7b-q2_K.bin -i -ins -ngl 1
...
> please list 5 movies.

 licensed|unlicensed|
------------|----|
 1|1|
 2|0|
 3|0|
 4|1|
 5|0|
 6|0|
 7|0|
 8|1|
 9|1|
 10|0|
 11|1|
 12|0|
 13

with --prompt:

> ./main -m ~/open-llama-7b-q2_K.bin -i -ngl 1 -p "Please list 5 movies."
...
> Please list 5 movies.aden is a gambler!
Aden is a gambler!
Aden is a gambler! is a list by zebra_69 on Listal.
No users voted For the Love of Mike
zebra_69
41 items...
The list contains 1 items. No items are shared with this list.
© 2005-2013 listal.com All rights reserved. Contact Us Privacy policy About Us
This page has been served 0 d since Wed Mar 7 2

Edit: Samantha model really highlights the error:

./main -m ~/samantha-1.1-llama-7b.ggmlv3.q2_K.bin -i -ins -ngl 1
...
> Hi Samantha
 Хромой теплёгой ветерок ставит мечты.
> Good day Samantha, please list some movies.
 package com.opengamma.util.function;

import java.io.Serializable;

/**
 * Utility class containing common mathematical operations as static methods for convenience.
 * <p>
 * The goal is to provide a simple, easy-to-use and efficient interface for mathematical operations,
 * target
>

Edit 2: I've noticed that a prompt template significantly improves the quality of the response from 2_k models. Here's ./server Samantha:

./server -m ~/samantha-1.1-llama-7b.ggmlv3.q2_K.bin -ngl 1
...
User: Hello Samantha. Please list some movies.

Samantha: Хродингу! Hi there! I'd be happy to share a few films with you. Here are a few popular choices that have stood the test of time:
1. "The Godfather" (1972)
2. "Pulp Fiction" (1994)
3. "The Shawshank Redemption" (1994)

The model starts with garble consistently, but it's definitely improved since posting.

@ghost
Copy link
Author

ghost commented Jul 22, 2023

#2133 shows perplexity for GPU on Android is bugged.

@alain40
Copy link

alain40 commented Nov 17, 2023

More on this: latest build, snapdragon 8 Gen 2, termux

  • Qn_K models are garbled with -ngl > 0, but work fine with -ngl = 0
  • Qn_0 legacy format models work fine with -ngl > 0

Separately GPU off-loading, when it works,decreases performance. Probably memory bandwidth issue.

@Nick-infinity
Copy link

Hello, were you able to fix this issue?

@gustrd
Copy link
Contributor

gustrd commented Dec 7, 2023

More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux.

Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not.
GGML models work okay.

Tryed with Mistral-7B GGUF and Marx-3B GGML.

The problem occurs with CuBLAS or OpenBLAS, no difference.

@Nick-infinity
Copy link

More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux.

Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not. GGML models work okay.

Tryed with Mistral-7B GGUF and Marx-3B GGML.

The problem occurs with CuBLAS or OpenBLAS, no difference.

Do you see performance degradation in terms of speed on 8 gen 1 gpu as compared to running model on cpu

@gustrd
Copy link
Contributor

gustrd commented Dec 7, 2023

More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux.
Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not. GGML models work okay.
Tryed with Mistral-7B GGUF and Marx-3B GGML.
The problem occurs with CuBLAS or OpenBLAS, no difference.

Do you see performance degradation in terms of speed on 8 gen 1 gpu as compared to running model on cpu

By my tests the prompt processing is way faster, but the token generation is slower, indeed. I'm just using to process the prompt.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants