[EX-127] Investigate if the tokens generated are over Ollama's context window size #95

agatheblues · 2025-09-11T11:53:02Z

Updates langchain (to get this fix in: Fix options being passed to the ollama chat api brainlid/langchain#179)
Sets the num_ctx for ollama in the seeds

agatheblues · 2025-09-11T11:54:12Z

test/support/seeds.ex

      qwen_25_32b_model_config =
        insert_idempotently(%Exmeralda.LLM.ModelConfig{
-          id: "eff70662-1576-491d-a1ef-1d025772e637",
+          id: "eff70662-1576-491d-a1ef-1d025772e638",


the same ID was used for another ModelConfig 🐛

…t window size

hannahbit · 2025-09-12T09:05:28Z

I tried it out and somehow this truncates the assistant responses for me 🤔

hannahbit · 2025-09-12T10:21:14Z

I tried it out and somehow this truncates the assistant responses for me 🤔

Setting the num_predict fixed it: LangChain doc

https://www.youtube.com/watch?v=nAu9BrklguQ

hannahbit · 2025-09-12T11:29:59Z

Somehow the LangChain update breaks something with gpt-oss. LangChain just returns success messages, but with nil content..

0️⃣

{:ok,
 %LangChain.Chains.LLMChain{
   llm: %LangChain.ChatModels.ChatOllamaAI{
     endpoint: "http://localhost:11434/api/chat",
     keep_alive: "5m",
     mirostat: 0,
     mirostat_eta: 0.1,
     mirostat_tau: 5.0,
     model: "gpt-oss:latest",
     num_ctx: 2048,
     num_gqa: nil,
     num_gpu: nil,
     num_predict: 128,
     num_thread: nil,
     receive_timeout: 300000,
     repeat_last_n: 64,
     repeat_penalty: 1.1,
     seed: 0,
     stop: nil,
     stream: true,
     temperature: 0.8,
     tfs_z: 1.0,
     top_k: 40,
     top_p: 0.9,
     callbacks: []
   },
   verbose: false,
   verbose_deltas: false,
   tools: [],
   _tool_map: %{},
   messages: [
     %LangChain.Message{
       content: "You are an experienced Elixir library author. For the FAQ of your library, you need questions\nusers could have. For each piece of markdown, come up with a question that is answered by\nthe piece of markdown.\n\n",
       processed_content: nil,
       index: nil,
       status: :complete,
       role: :system,
       name: nil,
       tool_calls: [],
       tool_results: nil,
       metadata: nil
     },
     %LangChain.Message{
       content: "You are given a piece of technical documentation.\n\n...",
       processed_content: nil,
       index: nil,
       status: :complete,
       role: :user,
       name: nil,
       tool_calls: [],
       tool_results: nil,
       metadata: nil
     },
     %LangChain.Message{
       content: nil,
       processed_content: nil,
       index: nil,
       status: :complete,
       role: :assistant,
       name: nil,
       tool_calls: [],
       tool_results: nil,
       metadata: nil
     }
   ],
   custom_context: nil,
   message_processors: [],
   max_retry_count: 3,
   current_failure_count: 0,
   delta: nil,
   last_message: %LangChain.Message{
     content: nil,
     processed_content: nil,
     index: nil,
     status: :complete,
     role: :assistant,
     name: nil,
     tool_calls: [],
     tool_results: nil,
     metadata: nil
   },
   exchanged_messages: [
     %LangChain.Message{
       content: nil,
       processed_content: nil,
       index: nil,
       status: :complete,
       role: :assistant,
       name: nil,
       tool_calls: [],
       tool_results: nil,
       metadata: nil
     }
   ],
   needs_response: false,
   callbacks: [%{}]
 }}

Further investigations:

ollama logs

print_info: arch             = jina-bert-v2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 768
print_info: n_layer          = 12
print_info: n_head           = 12
print_info: n_head_kv        = 12
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 768
print_info: n_embd_v_gqa     = 768
print_info: f_norm_eps       = 1.0e-12
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 8.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 0
print_info: pooling type     = 1
print_info: rope type        = -1
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: model type       = 137M
print_info: model params     = 160.28 M
print_info: general.name     = jina-embeddings-v2-base-code
print_info: vocab type       = BPE
print_info: n_vocab          = 61056
print_info: n_merges         = 60795
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 4 '<mask>'
print_info: LF token         = 203 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors:   CPU_Mapped model buffer size =    89.45 MiB
load_tensors: Metal_Mapped model buffer size =   216.53 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 0
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = true
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 77309.41 MB
llama_context:        CPU  output buffer size =     0.24 MiB
time=2025-09-12T15:58:55.258+02:00 level=INFO source=server.go:1288 msg="llama runner started in 0.31 seconds"
time=2025-09-12T15:58:55.258+02:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-09-12T15:58:55.258+02:00 level=INFO source=server.go:1250 msg="waiting for llama runner to start responding"
time=2025-09-12T15:58:55.258+02:00 level=INFO source=server.go:1288 msg="llama runner started in 0.31 seconds"
init: embeddings required but some input tokens were not marked as outputs -> overriding
output_reserve: reallocating output buffer from size 0.24 MiB to 2.12 MiB
[GIN] 2025/09/12 - 15:58:55 | 200 |  460.290208ms |       127.0.0.1 | POST     "/api/embed"
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=mirostat
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=num_gqa
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=tfs_z
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=mirostat_eta
time=2025-09-12T15:58:55.391+02:00 level=WARN source=types.go:654 msg="invalid option provided" option=mirostat_tau
time=2025-09-12T15:58:55.454+02:00 level=INFO source=sched.go:540 msg="updated VRAM based on existing loaded models" gpu=0 library=metal total="72.0 GiB" available="71.1 GiB"
time=2025-09-12T15:58:55.526+02:00 level=INFO source=server.go:199 msg="model wants flash attention"
time=2025-09-12T15:58:55.526+02:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-09-12T15:58:55.526+02:00 level=WARN source=server.go:224 msg="kv cache type not supported by model" type=""
time=2025-09-12T15:58:55.527+02:00 level=INFO source=server.go:398 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/hannah/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 51860"
time=2025-09-12T15:58:55.529+02:00 level=INFO source=server.go:503 msg="system memory" total="96.0 GiB" free="72.4 GiB" free_swap="0 B"
time=2025-09-12T15:58:55.529+02:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/Users/hannah/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=metal parallel=1 required="13.1 GiB" gpus=1
time=2025-09-12T15:58:55.529+02:00 level=INFO source=server.go:543 msg=offload library=metal layers.requested=-1 layers.model=25 layers.offload=25 layers.split=[25] memory.available="[71.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.1 GiB" memory.required.partial="13.1 GiB" memory.required.kv="300.0 MiB" memory.required.allocations="[13.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="122.0 MiB" memory.graph.partial="122.0 MiB"
time=2025-09-12T15:58:55.538+02:00 level=INFO source=runner.go:1251 msg="starting ollama engine"
time=2025-09-12T15:58:55.538+02:00 level=INFO source=runner.go:1286 msg="Server listening on 127.0.0.1:51860"
time=2025-09-12T15:58:55.541+02:00 level=INFO source=runner.go:1170 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:8 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-12T15:58:55.573+02:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30

- starting ollama without flash attention `OLLAMA_FLASH_ATTENTION=0 ollama start` makes the chat work again for `gpt-oss`, but the responses are truncated early (the flash attention seems to be an ollama bug: https://github.com/ollama/ollama/issues/12113) Not sure though why it only happens for the newer LangChain version.

agatheblues force-pushed the task/ex-127-investigate-if-the-tokens-generated-are-over-ollama-s-context-window-size branch from 4a042a6 to 475bb34 Compare September 11, 2025 11:53

agatheblues commented Sep 11, 2025

View reviewed changes

agatheblues force-pushed the task/ex-127-investigate-if-the-tokens-generated-are-over-ollama-s-context-window-size branch from 35d39a7 to 58f72a8 Compare September 11, 2025 13:17

agatheblues added 3 commits September 12, 2025 10:59

[EX-127] Investigate if the tokens generated are over Ollama's contex…

7174035

…t window size

Increase context window for ollama llama3.2

03f9dee

Add comment

f3473f3

Set num_predict

2148466

hannahbit force-pushed the task/ex-127-investigate-if-the-tokens-generated-are-over-ollama-s-context-window-size branch from 58f72a8 to 2148466 Compare September 12, 2025 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EX-127] Investigate if the tokens generated are over Ollama's context window size #95

[EX-127] Investigate if the tokens generated are over Ollama's context window size #95

Uh oh!

agatheblues commented Sep 11, 2025

Uh oh!

agatheblues Sep 11, 2025

Uh oh!

hannahbit commented Sep 12, 2025

Uh oh!

hannahbit commented Sep 12, 2025

Uh oh!

hannahbit commented Sep 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[EX-127] Investigate if the tokens generated are over Ollama's context window size #95

Are you sure you want to change the base?

[EX-127] Investigate if the tokens generated are over Ollama's context window size #95

Uh oh!

Conversation

agatheblues commented Sep 11, 2025

Uh oh!

agatheblues Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

hannahbit commented Sep 12, 2025

Uh oh!

hannahbit commented Sep 12, 2025

Uh oh!

hannahbit commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hannahbit commented Sep 12, 2025 •

edited

Loading