TODO collections

- [x]  CI Status badge
- [ ]  Use structured output library? e.g. [Instructor](https://www.youtube.com/watch?v=pZ4DIH2BVqg)
- [ ]  https://www.anthropic.com/news/contextual-retrieval
- [ ] have small prompts that do one thing, and only one thing well. e.g. instead of having a single catch-all-prompt, try to split it into separate prompts that are simple, focused and easy to understand -> so you can eval each prompt separately.
- [ ] RAG evaluation: [MRR](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) [NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain).
- [ ] RAG information density: if two documents are equally relevant, we should prefer one that is more concise and has fewer erroneous details.
- [ ] Multistep workflow: Include reflection/CoT prompting (small tasks)
- [ ] Increase output diversity beyond increasing temperature. E.g. when the user is asking for a solution to XX problem, keep a list of recent responses and tell the LLM, "do not suggest any responses from the following:"
- [ ] Prompt caching: e.g. common functions. Use features like autocomplete/spelling correction/suggested queries to normalize user input and increase the cache hit rate.
- [ ] Simple assertion based unit tests.
- [ ] Intent Classification: https://rasa.com/docs/rasa/next/llms/llm-intent/
```
- What do each tool abbreviation in OR mean? 
- What are the supported public PDKs?
- What are the supported OSes?
- What are the social media links?
```

Evals
- [ ] https://docs.confident-ai.com/docs/synthesizer-introduction#save-your-synthetic-dataset
- [ ] pairwise evaluation. How is this different from normal? maybe there are a few responses (using different LLMs) that are rated same score on g-eval. Use pairwise evaluation to force a winner. E.g. [prompt](https://smith.langchain.com/hub/rlm/pairwise-evaluation-tweet-summary?ref=blog.langchain.dev)
- [ ] Needle-in-a-haystack (NIAH) evals
- [ ] If evals are reference-free, you can use them as a guardrail (not show the output if it is too low scoring). E.g. is summarization evals, where all you need is the input prompt (and no need for a summarisation "reference")

Guardrails
- [ ] Use gemini guardrails to identify harmful/offensive output, PII. 
- [ ] factual inconsistency guardrail [link](https://eugeneyan.com/writing/finetuning/)

Production
- [ ] Development-prod skew: measure skew between LLM input/output pairs. E.g. length of inputs/outputs, specific formatting requirements. For advanced drift detection consider clustering embeddings to detect semantic drifts (i.e. users are discussing topics not discussed before.)
- [ ] Hold-out datasets for evals -> must be reflective of user-interactions
- [ ] Always log outputs. Store this in a separate DB.

Data flywheel
- [ ] Links to the automated feedback loop. Bad examples can be used to train hallucination [classifiers](https://eugeneyan.com/writing/finetuning/). Relevant annotations can be used to train relevance-reward model(https://arxiv.org/abs/2009.01325).



References
- https://applied-llms.org/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO collections #50

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TODO collections #50

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions