Skip to content

Conversation

@agatheblues
Copy link
Contributor

@agatheblues agatheblues force-pushed the task/ex-127-investigate-if-the-tokens-generated-are-over-ollama-s-context-window-size branch from 4a042a6 to 475bb34 Compare September 11, 2025 11:53
qwen_25_32b_model_config =
insert_idempotently(%Exmeralda.LLM.ModelConfig{
id: "eff70662-1576-491d-a1ef-1d025772e637",
id: "eff70662-1576-491d-a1ef-1d025772e638",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same ID was used for another ModelConfig 🐛

@agatheblues agatheblues force-pushed the task/ex-127-investigate-if-the-tokens-generated-are-over-ollama-s-context-window-size branch from 35d39a7 to 58f72a8 Compare September 11, 2025 13:17
@hannahbit
Copy link
Contributor

I tried it out and somehow this truncates the assistant responses for me 🤔

@hannahbit hannahbit force-pushed the task/ex-127-investigate-if-the-tokens-generated-are-over-ollama-s-context-window-size branch from 58f72a8 to 2148466 Compare September 12, 2025 10:19
@hannahbit
Copy link
Contributor

I tried it out and somehow this truncates the assistant responses for me 🤔

Setting the num_predict fixed it: LangChain doc

https://www.youtube.com/watch?v=nAu9BrklguQ

@hannahbit
Copy link
Contributor

hannahbit commented Sep 12, 2025

Somehow the LangChain update breaks something with gpt-oss. LangChain just returns success messages, but with nil content..

0️⃣
{:ok,
 %LangChain.Chains.LLMChain{
   llm: %LangChain.ChatModels.ChatOllamaAI{
     endpoint: "http://localhost:11434/api/chat",
     keep_alive: "5m",
     mirostat: 0,
     mirostat_eta: 0.1,
     mirostat_tau: 5.0,
     model: "gpt-oss:latest",
     num_ctx: 2048,
     num_gqa: nil,
     num_gpu: nil,
     num_predict: 128,
     num_thread: nil,
     receive_timeout: 300000,
     repeat_last_n: 64,
     repeat_penalty: 1.1,
     seed: 0,
     stop: nil,
     stream: true,
     temperature: 0.8,
     tfs_z: 1.0,
     top_k: 40,
     top_p: 0.9,
     callbacks: []
   },
   verbose: false,
   verbose_deltas: false,
   tools: [],
   _tool_map: %{},
   messages: [
     %LangChain.Message{
       content: "You are an experienced Elixir library author. For the FAQ of your library, you need questions\nusers could have. For each piece of markdown, come up with a question that is answered by\nthe piece of markdown.\n\n",
       processed_content: nil,
       index: nil,
       status: :complete,
       role: :system,
       name: nil,
       tool_calls: [],
       tool_results: nil,
       metadata: nil
     },
     %LangChain.Message{
       content: "You are given a piece of technical documentation.\n\n...",
       processed_content: nil,
       index: nil,
       status: :complete,
       role: :user,
       name: nil,
       tool_calls: [],
       tool_results: nil,
       metadata: nil
     },
     %LangChain.Message{
       content: nil,
       processed_content: nil,
       index: nil,
       status: :complete,
       role: :assistant,
       name: nil,
       tool_calls: [],
       tool_results: nil,
       metadata: nil
     }
   ],
   custom_context: nil,
   message_processors: [],
   max_retry_count: 3,
   current_failure_count: 0,
   delta: nil,
   last_message: %LangChain.Message{
     content: nil,
     processed_content: nil,
     index: nil,
     status: :complete,
     role: :assistant,
     name: nil,
     tool_calls: [],
     tool_results: nil,
     metadata: nil
   },
   exchanged_messages: [
     %LangChain.Message{
       content: nil,
       processed_content: nil,
       index: nil,
       status: :complete,
       role: :assistant,
       name: nil,
       tool_calls: [],
       tool_results: nil,
       metadata: nil
     }
   ],
   needs_response: false,
   callbacks: [%{}]
 }}

Further investigations:

ollama logs
print_info: arch             = jina-bert-v2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 768
print_info: n_layer          = 12
print_info: n_head           = 12
print_info: n_head_kv        = 12
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 768
print_info: n_embd_v_gqa     = 768
print_info: f_norm_eps       = 1.0e-12
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 8.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 0
print_info: pooling type     = 1
print_info: rope type        = -1
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: model type       = 137M
print_info: model params     = 160.28 M
print_info: general.name     = jina-embeddings-v2-base-code
print_info: vocab type       = BPE
print_info: n_vocab          = 61056
print_info: n_merges         = 60795
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 4 '<mask>'
print_info: LF token         = 203 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors:   CPU_Mapped model buffer size =    89.45 MiB
load_tensors: Metal_Mapped model buffer size =   216.53 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 0
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = true
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 77309.41 MB
llama_context:        CPU  output buffer size =     0.24 MiB
time=2025-09-12T15:58:55.258+02:00 level=INFO source=server.go:1288 msg="llama runner started in 0.31 seconds"
time=2025-09-12T15:58:55.258+02:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-09-12T15:58:55.258+02:00 level=INFO source=server.go:1250 msg="waiting for llama runner to start responding"
time=2025-09-12T15:58:55.258+02:00 level=INFO source=server.go:1288 msg="llama runner started in 0.31 seconds"
init: embeddings required but some input tokens were not marked as outputs -> overriding
output_reserve: reallocating output buffer from size 0.24 MiB to 2.12 MiB
[GIN] 2025/09/12 - 15:58:55 | 200 |  460.290208ms |       127.0.0.1 | POST     "/api/embed"
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=mirostat
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=num_gqa
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=tfs_z
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=mirostat_eta
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=mirostat_tau
time=2025-09-12T15:58:55.454+02:00 level=INFO source=sched.go:540 msg="updated VRAM based on existing loaded models" gpu=0 library=metal total="72.0 GiB" available="71.1 GiB"
time=2025-09-12T15:58:55.526+02:00 level=INFO source=server.go:199 msg="model wants flash attention"
time=2025-09-12T15:58:55.526+02:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-09-12T15:58:55.526+02:00 level=WARN source=server.go:224 msg="kv cache type not supported by model" type=""
time=2025-09-12T15:58:55.527+02:00 level=INFO source=server.go:398 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/hannah/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 51860"
time=2025-09-12T15:58:55.529+02:00 level=INFO source=server.go:503 msg="system memory" total="96.0 GiB" free="72.4 GiB" free_swap="0 B"
time=2025-09-12T15:58:55.529+02:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/Users/hannah/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=metal parallel=1 required="13.1 GiB" gpus=1
time=2025-09-12T15:58:55.529+02:00 level=INFO source=server.go:543 msg=offload library=metal layers.requested=-1 layers.model=25 layers.offload=25 layers.split=[25] memory.available="[71.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.1 GiB" memory.required.partial="13.1 GiB" memory.required.kv="300.0 MiB" memory.required.allocations="[13.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="122.0 MiB" memory.graph.partial="122.0 MiB"
time=2025-09-12T15:58:55.538+02:00 level=INFO source=runner.go:1251 msg="starting ollama engine"
time=2025-09-12T15:58:55.538+02:00 level=INFO source=runner.go:1286 msg="Server listening on 127.0.0.1:51860"
time=2025-09-12T15:58:55.541+02:00 level=INFO source=runner.go:1170 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:8 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-12T15:58:55.573+02:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
- starting ollama without flash attention `OLLAMA_FLASH_ATTENTION=0 ollama start` makes the chat work again for `gpt-oss`, but the responses are truncated early (the flash attention seems to be an ollama bug: https://github.com/ollama/ollama/issues/12113) Not sure though why it only happens for the newer LangChain version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants