Skip to content

Conversation

elseml
Copy link
Member

@elseml elseml commented Sep 17, 2025

Summary

TL;DR: LLMs code generation can be sloppy for BayesFlow. This PR introduces automatically generated context files for constraining LLM code generation.

Full summary: This PR introduces a pipeline for automatically generating LLM context markdown files for BayesFlow. The goal is to make LLM assistance more accurate and user-friendly by providing lightweight, up-to-date repository snapshots (aka some very basic RAG) based on gitingest. Users can download a single context file to provide their LLM with the latest BayesFlow state.
This addition targets the broad pip install userbase (power users might just pass their local repo clone to an IDE-integrated LLM, but standard users might not have a local repo clone or use an IDE-integrated LLM). As discussed previously, this is a WIP PR for openly testing and discussing whether this addition is beneficial in lowering the barrier to entry for BayesFlow or if it adds confusion. See llm_context/README.md for further infos.

Key Additions

  • llm_context/ folder added
    • Contains the build script (build_llm_context.py), requirements, and generated context files.
  • Automatic context file generation
    • Produces two files per release:
      • llm_context_full-<TAG>.md → full project snapshot: README + examples + source code (bayesflow/) (~250k tokens).
      • llm_context_compact-<TAG>.md → compact snapshot: README + examples (~45k tokens).
  • Notebook handling
    • Example notebooks (examples/*.ipynb) are automatically converted to Markdown before context file generation.
  • Automatic cleanup
    • Old bayesflow-context-* files in llm_context/ are removed before generating new ones.
  • GitHub Action integration (currently experimental, not thoroughly tested yet)
    • On each release, the workflow builds fresh context files, adds the current release tag in the <TAG> placeholder, and uploads them to the corresponding GitHub release.

Open Questions

  • Most importantly: Is it useful? It would be great if you could try out both context files and share your experiences here. Extra valuable if you have a pro license for your LLM of choice since this allows us to take the latest LLM capabilities into account and run more tests.
  • Compact vs. full context files: Do we need two files or can we go with a single one for the most straightforward user experience?
  • (Implementation: Not polished yet, but I suggest to first discuss whether we want to move on with this at all or not)

Downloads for Pre-Generated Context Files

llm_context_full_dev.md
llm_context_compact_dev.md

(Tagging some people I remember discussing this with @niels-leif-bracher @han-ol @stefanradev93 @vpratz @marvinschmitt @paul-buerkner)

@elseml elseml added feature New feature or request draft Draft Pull Request, Work in Progress labels Sep 17, 2025
@elseml
Copy link
Member Author

elseml commented Sep 17, 2025

I am currently leaning towards going with the compact file only:

  1. It increased the likelihood of generating working BayesFlow code in my preliminary testing (ChatGPT free version);
  2. focuses on the files with the highest information density for actual BayesFlow usage (i.e., tutorials);
  3. and strikes a balance between providing an LLM some concrete, focused guidance while still giving leeway to leverage the increasing agentic/search abilities of LLMs (i.e., looking up BayesFlow source code when needed).

Copy link

codecov bot commented Sep 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

@vpratz
Copy link
Collaborator

vpratz commented Sep 17, 2025

Thanks for opening the discussion. As I already indicated in the previous discussion, I'm skeptical that we do ourselves a favor with that addition and currently opposed to merging this. Besides the general concerns surrounding LLMs, I think our users are better served by working with the tutorials and documentation. If we include files like the above, we suggest to our users that using LLMs is a viable strategy to write code with BayesFlow, and that this is a workflow we are willing to offer support for. Given the widely reported accuracy problems of LLMs, also for tasks like summarization, I'm skeptical that the output is sufficiently accurate to be useful for anyone who is not already familiar with the code.

Also, the tutorials convey important information on how to use ABI and highlight potential issues, so using an LLM as a shortcut might backfire when users no longer encounter this kind of information.

@elseml Could you provide a few non-cherrypicked example interactions of how a user would use an LLM with this setup, assuming they have little prior knowledge of BayesFlow, and the results they would obtain? If possible, please use tasks that are not part of the documentation. This would help me judge what kind of code we would expect to see.

@elseml
Copy link
Member Author

elseml commented Sep 19, 2025

Hi Valentin, thanks for bringing in your concerns! I fully agree about the reliability problems of LLMs. However, I am pretty convinced that the a substantial share of BF users already uses LLM assistance and the usage of LLM assistance in coding will continue to increase. Given these assumptions, the feature is for me mainly about equipping BF users with tools for mitigating the damage of hallucinations (i.e., by providing the LLMs with a better grounding in the BF code). A bit in the direction of companies hosting their own internal LLM instances to prevent data leakage due to employees already using LLMs for their tasks.

But yeah, this WIP PR is meant as an open testing ground, so we might very well conclude after more testing that, even when providing context, current LLMs are still not accurate enough yet to be of any use for BF users. If we indeed decide to proceed with this feature, I would support adding a prominent disclaimer highlighting these concerns, emphasizing that LLM assistance is supplementary to, rather than a replacement for, consulting the tutorials.

Concerning example interactions: Here are two trials for the frequent use case of evidence accumulation modeling (only testing the compact context file since ChatGPT upload limitations are pretty strict):
Without context
With compact (=tutorials only) context file

Some observations:

  • Without providing context, ChatGPT derailed pretty quickly from the typical BayesFlow workflow, mixing up lots of stuff and especially not being aware of the latest BF2 API despite looking up the documentation (this might also be a problem for the compact context with tutorials only, where the LLM is instructed to look up the source code if needed).
  • With compact context provided, adherence to the BF2 API improved a lot despite less CoT reasoning. Still, there were some errors (e.g., sometimes suggesting TimeSeriesNetwork for exchangeable observations). The generated DDM simulator placeholder is quite peculiar, but I think that is okay since it is an explicit placeholder and we are interested in the BF part here. After simple copy-pasting of two TypeError tracebacks the code was in a runnable state until the pandas DataFrame creation at the end. Of course, this run might just have been a lucky exception.
  • I also tested combining a pyddm simulator with BayesFlow, where using the pyddm API already failed in all conditions. Not totally surprising since the LLM is not given pyddm context in any condition.
  • (Also, previously the BF1 code in From_BayesFlow_1.1_to_2.0.ipynb had a pretty big effect on LLM code generation, so it is now excluded from the context files. I will update the context files provided here accordingly)

I think more constrained applications (e.g., debugging / asking specialized questions about existing code like in the forum) might be better test cases than full code generation from scratch. I will test that out as well when upload limits allow. As stated above, I would also be very interested in seeing which other use cases other people come up with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

draft Draft Pull Request, Work in Progress feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants