Salesforce Enterprise Deep Research
-
Updated
Jan 30, 2026 - Python
Salesforce Enterprise Deep Research
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
A comprehensive code domain benchmark review of LLM researches.
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update
A benchmark for prompt injection detection systems.
AlgoTune is a NeurIPS 2025 benchmark made up of 154 math, physics, and computer science problems. The goal is write code that solves each problem, and is faster than existing implementations.
A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks
[CVPR 2025] Program synthesis for 3D spatial reasoning
LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.
A dynamic forecasting benchmark for LLMs
(NeurIPS 2025) Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Take your LLM to the optometrist.
[MM 2025] A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when presented with complex optimization modeling tasks.
BizFinBench.v2: A Unified Offline–Online Bilingual Benchmark for Expert-Level Financial Capability Evaluation of LLMs
RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24
Test your local LLMs on the AIME problems
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
Add a description, image, and links to the llm-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmarking topic, visit your repo's landing page and select "manage topics."