Skip to content

TODO collections #50

@luarss

Description

@luarss
  • CI Status badge
  • Use structured output library? e.g. Instructor
  • https://www.anthropic.com/news/contextual-retrieval
  • have small prompts that do one thing, and only one thing well. e.g. instead of having a single catch-all-prompt, try to split it into separate prompts that are simple, focused and easy to understand -> so you can eval each prompt separately.
  • RAG evaluation: MRR NDCG.
  • RAG information density: if two documents are equally relevant, we should prefer one that is more concise and has fewer erroneous details.
  • Multistep workflow: Include reflection/CoT prompting (small tasks)
  • Increase output diversity beyond increasing temperature. E.g. when the user is asking for a solution to XX problem, keep a list of recent responses and tell the LLM, "do not suggest any responses from the following:"
  • Prompt caching: e.g. common functions. Use features like autocomplete/spelling correction/suggested queries to normalize user input and increase the cache hit rate.
  • Simple assertion based unit tests.
  • Intent Classification: https://rasa.com/docs/rasa/next/llms/llm-intent/
- What do each tool abbreviation in OR mean? 
- What are the supported public PDKs?
- What are the supported OSes?
- What are the social media links?

Evals

  • https://docs.confident-ai.com/docs/synthesizer-introduction#save-your-synthetic-dataset
  • pairwise evaluation. How is this different from normal? maybe there are a few responses (using different LLMs) that are rated same score on g-eval. Use pairwise evaluation to force a winner. E.g. prompt
  • Needle-in-a-haystack (NIAH) evals
  • If evals are reference-free, you can use them as a guardrail (not show the output if it is too low scoring). E.g. is summarization evals, where all you need is the input prompt (and no need for a summarisation "reference")

Guardrails

  • Use gemini guardrails to identify harmful/offensive output, PII.
  • factual inconsistency guardrail link

Production

  • Development-prod skew: measure skew between LLM input/output pairs. E.g. length of inputs/outputs, specific formatting requirements. For advanced drift detection consider clustering embeddings to detect semantic drifts (i.e. users are discussing topics not discussed before.)
  • Hold-out datasets for evals -> must be reflective of user-interactions
  • Always log outputs. Store this in a separate DB.

Data flywheel

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions