main: build = 1336 (9ca79d5) - Load mistral-7b-openorca.Q8_0.gguf  - after first prompt "hello" llama crashing - windows build - some time ago was ok - 30 builds before?

````
main.exe --model models\new3\mistral-7b-openorca.Q8_0.gguf --mlock --color --threads 16 --keep -1 --batch_size 512 --n_predict -1 --top_k 40 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 32768 --interactive --instruct --reverse-prompt "<|im_end|>" -ngl 48 --simple-io  --in-prefix "<|im_start|>user " --in-suffix "<|im_end|> " -p "<|im_start|>system You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!<|im_end|> "

Log start
main: build = 1336 (9ca79d5)
main: built with MSVC 19.35.32217.1 for x64
main: seed  = 1696634023
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data
````

````

llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name   = open-orca_mistral-7b-openorca
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<dummy32000>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  132.91 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7205.84 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 4096.00 MB
llama_new_context_with_model: kv self size  = 4096.00 MB
llama_new_context_with_model: compute buffer total size = 2141.88 MB
llama_new_context_with_model: VRAM scratch buffer: 2136.00 MB
llama_new_context_with_model: total VRAM used: 13437.84 MB (model: 7205.84 MB, context: 6232.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '<|im_end|>'
Reverse prompt: '### Instruction:

'
Input prefix: '<|im_start|>user '
Input suffix: '<|im_end|> '
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 32768, n_batch = 512, n_predict = -1, n_keep = 54


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 <|im_start|>system You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!<|im_end|>
> <|im_start|>user hello
<|im_end|>  Hello! I'm MistralOrca, a large language model developed by Alignment Lab AI. I'm here to help you with any questions or tasks you may have.GGML_ASSERT: D:\a\llama.cpp\llama.cpp\llama.cpp:8203: false
PS E:\LLAMA\llama.cpp>   
````




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

main: build = 1336 (9ca79d5) - Load mistral-7b-openorca.Q8_0.gguf - after first prompt "hello" llama crashing - windows build - some time ago was ok - 30 builds before? #3516

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

main: build = 1336 (9ca79d5) - Load mistral-7b-openorca.Q8_0.gguf - after first prompt "hello" llama crashing - windows build - some time ago was ok - 30 builds before? #3516

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions