Windows Agent Arena (WAA) πͺ is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
-
Updated
Apr 30, 2025 - Python
Windows Agent Arena (WAA) πͺ is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
Frontier Models playing the board game Diplomacy.
An agent benchmark with tasks in a simulated software company.
Ranking LLMs on agentic tasks
π€ A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
This repository contains the results and code for the MLPerfβ’ Storage v2.0 benchmark.
MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba), custom tasks in YAML, and HTML/CSV reports.
PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.
TrustyAI's LMEval provider for Llama Stack
Open-source benchmark for real-world AI performance
NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity to small prompt flips.
π A living ecosystem for diffusion models - Automatically tracking, benchmarking, and organizing foundation diffusion models and their fine-tuned variants. Features real-time updates, unified evaluation, and comprehensive resources.
Challenge your AI's algorithmic thinking with GTA Benchmark! Reverse-engineer transformations from input-output pairs. Join now! ππ
Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."