-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
- CI Status badge
- Use structured output library? e.g. Instructor
- https://www.anthropic.com/news/contextual-retrieval
- have small prompts that do one thing, and only one thing well. e.g. instead of having a single catch-all-prompt, try to split it into separate prompts that are simple, focused and easy to understand -> so you can eval each prompt separately.
- RAG evaluation: MRR NDCG.
- RAG information density: if two documents are equally relevant, we should prefer one that is more concise and has fewer erroneous details.
- Multistep workflow: Include reflection/CoT prompting (small tasks)
- Increase output diversity beyond increasing temperature. E.g. when the user is asking for a solution to XX problem, keep a list of recent responses and tell the LLM, "do not suggest any responses from the following:"
- Prompt caching: e.g. common functions. Use features like autocomplete/spelling correction/suggested queries to normalize user input and increase the cache hit rate.
- Simple assertion based unit tests.
- Intent Classification: https://rasa.com/docs/rasa/next/llms/llm-intent/
- What do each tool abbreviation in OR mean?
- What are the supported public PDKs?
- What are the supported OSes?
- What are the social media links?
Evals
- https://docs.confident-ai.com/docs/synthesizer-introduction#save-your-synthetic-dataset
- pairwise evaluation. How is this different from normal? maybe there are a few responses (using different LLMs) that are rated same score on g-eval. Use pairwise evaluation to force a winner. E.g. prompt
- Needle-in-a-haystack (NIAH) evals
- If evals are reference-free, you can use them as a guardrail (not show the output if it is too low scoring). E.g. is summarization evals, where all you need is the input prompt (and no need for a summarisation "reference")
Guardrails
- Use gemini guardrails to identify harmful/offensive output, PII.
- factual inconsistency guardrail link
Production
- Development-prod skew: measure skew between LLM input/output pairs. E.g. length of inputs/outputs, specific formatting requirements. For advanced drift detection consider clustering embeddings to detect semantic drifts (i.e. users are discussing topics not discussed before.)
- Hold-out datasets for evals -> must be reflective of user-interactions
- Always log outputs. Store this in a separate DB.
Data flywheel
- Links to the automated feedback loop. Bad examples can be used to train hallucination classifiers. Relevant annotations can be used to train relevance-reward model(https://arxiv.org/abs/2009.01325).
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels