A MNIST-like fashion product database. Benchmark 👇
-
Updated
Jun 13, 2022 - Python
A MNIST-like fashion product database. Benchmark 👇
OpenMMLab Pose Estimation Toolbox and Benchmark.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Benchmarks of approximate nearest neighbor libraries in Python
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
SWE-bench: Can Language Models Resolve Real-world Github Issues?
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Python package for the evaluation of odometry and SLAM
A series of large language models developed by Baichuan Intelligent Technology
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
MTEB: Massive Text Embedding Benchmark
A full-stack AI Red Teaming platform securing AI ecosystems via AI Infra scan, MCP scan, Agent skills scan, and LLM jailbreak evaluation.
A 13B large language model developed by Baichuan Intelligent Technology
A unified evaluation framework for large language models
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
A machine learning toolkit for log parsing [ICSE'19, DSN'16]
Add a description, image, and links to the benchmark topic page so that developers can more easily learn about it.
To associate your repository with the benchmark topic, visit your repo's landing page and select "manage topics."