Here is list of all the 75+ LLM Evaluation methods, github repos, tools, blogs I could find (till Nov, 2023) for LLM Evaluation
Order is random
- Blog by Lilian
- Repo on Lots and Lots of material on evaluation by EdinburgNLP
- Rajiv Shah's Repo on LLM Evaluation
- Harpreet's Repo using Langchain to Evaluate Models: Session 7
- RAGAS
- Giskard - Test LLMs
- Auto Evaluator
- ReLM
- Tru Lens
- Guard Rails
- Nemo Guard Rails
- DeepEval
- PromptFoo - Prompt Evals
- Thumb - Prompt Testing
- Prompt Injection Protection
- PromptBench
- Fact Checker
- LangTest
- Evaluation Harness
- Outlines
- Lakera (Not fully open sourced)
- SmartLLMChain
- LLMCheckerChain
- LLMInformationExtraction Notebook
- Chain of Thought Prompting - Material 1
- Chain of Thought Prompting - Material 2
- Tree of Thought Prompting by Princeton NLP
- Tree of Thought Prompting Material 2
- LLM - Eval Survey
- LLM Eval Comprehensive survey paper (111 pages 🙂 )
- Verify CoT
- LLM - Augmentor
- LangChain Different Criteria
- Check your facts and try again
- Researching and Revising What Language Models Say
- Fact-Checking Complex Claims with Program-Guided Reasoning
- Repo + Paper -> SAC: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency
- Hallucination detection: Robustly discerning reliable answers in Large Language Models
- victordibia
- Evaluation, Measurements and Some Solutions
- Kellton - Techniques
- FLARE
- Seminar by Galileo and Deeplearning.ai
- Galileo Blog on framework to detect and reduce hallucinations
- Fixing Hallucinations
- Chain of Verification for Detecting Hallucinations
- MLFlow Blog
- DeepChecks (Paid + BETA)
- LLamaIndex + TruLENS
- Scale
- Medium: testing-large-language-models-like-we-test-software
- V7 Blog
- Microsoft Blog
- RELM
- LLM Eval