Closed
Description
Current Behavior:
-
Built llama.cpp with sycl backend for Windows by following instructions in README-sycl.md.
-
The build completes successfully, the conversion and everything works fine.
-
When running the main, the code errors out with due to a GGML_ASSERT issue. Tried to debug it and seems like when this function get_device_index_by_id is being called the returned id is equal to -1, and then the error happens when assert statement GGML_ASSERT(res>=0); finds res=-1 . My device number is 5 as u can see in the logs.
-
@airMeng @NeoZhangJianyu cc here, tried all tricks for known issues in the README-sycl, but this didn't lead anywhere..
C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=5 && build\bin\main.exe -m %LLAMA2%\ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap -ngl 33 --ignore-eos
Log start
main: build = 2153 (0d417712)
main: built with IntelLLVM 2024.0.2 for
main: seed = 1708016072
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16: no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
Device 0: Intel(R) UHD Graphics 770, compute capability 1.3,
max compute_units 32, max work group size 512, max sub group size 32, global mem size 3093630976
Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2,
max compute_units 32, max work group size 67108864, max sub group size 64, global mem size 3839483904
Device 2: 13th Gen Intel(R) Core(TM) i9-13900K, compute capability 3.0,
max compute_units 32, max work group size 8192, max sub group size 64, global mem size 3839483904
Device 3: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0,
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 3819835392
Device 4: Intel(R) UHD Graphics 770, compute capability 3.0,
max compute_units 32, max work group size 512, max sub group size 32, global mem size 3093630976
Device 5: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3,
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 3819835392
Using device 5 (Intel(R) Arc(TM) A770 Graphics) as main device
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\intel\.cache\huggingface\hub\models--meta-llama--Llama-2-7b-chat-hf\snapshots\c1b0db933684edbfe29a06fa47eb19cc48025e93\ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 20: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
GGML_ASSERT: C:/Users/intel/Desktop/aahouzi/llama.cpp/ggml-sycl.cpp:9364: res>=0
Steps To Reproduce:
Same steps in README-sycl.md
Environment:
- OS: Win11
- HW: Intel ARC A770 dGPU