[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
Testing how well LLMs can solve jigsaw puzzles
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams
Yes, LLM's just regurgitate the same jokes on the internet over and over again. But some are slightly funnier than others.
Claude Code skill that pits Claude, ChatGPT, and Gemini against each other, then lets them cross-judge each other blind
is it better to run a Tiny Model (2B-4B) at High Precision (FP16/INT8), or a Large Model (8B+) at Low Precision (INT4)?" This benchmark framework allows developers to scientifically choose the best model for resource-constrained environments (consumer GPUs, laptops, edge devices) by measuring the trade-off between Speed and Intelligence
The hundred-eyed watcher for your LLM providers. Monitor uptime, TTFT, TPS, and latency across OpenAI, Anthropic, Azure, Bedrock, Ollama, LM Studio, and 100+ providers through a single dashboard. Benchmark, compare, and get alerts — all self-hosted.
GateBench is a challenging benchmark for Vision Language Models (VLMs) that tests visual reasoning by requiring models to extract boolean algebra expressions from logic gate circuit diagrams.
Compare how vision models reason about images — not just their accuracy scores
UrduReason-Eval: A comprehensive evaluation dataset with 800 Urdu reasoning problems across 6 categories (arithmetic, logical deduction, temporal, comparative, and causal reasoning) for assessing reasoning capabilities in Urdu language models.
自动收集 Bilibili 硬核会员答题数据并生成 LLM 评估数据集
Gemma3 RAG benchmark system for Japanese river/dam/erosion control technical standards.
Systematic benchmark comparing Claude Haiku 4.5 vs MiniMax M2.1 on agentic coding tasks. Includes full audit trails, LLM-as-judge evaluation, and path divergence analysis.
WordleBench — Deterministic AI Wordle benchmark. Compare 34+ LLMs (GPT-5, Claude 4.5, Gemini, Grok, Llama) head-to-head on accuracy, speed, and cost across 50 standardized words.
🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty with the ERR-EVAL benchmark for reliable epistemic reasoning.
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."