Skip to content
@eval-sys

EVAL SYS

Evaluation Systems Organization

EVAL SYS is a living, open-source community to track and advance model agentic capabilities. We’ll be releasing benchmarks, datasets, toolchains, models to push the field forward. Initiated by LobeHub, we would love to collaborate with research labs, MCP servers, independent contributors, and more.

Join us, contribute, or reach out!


MCPMark: Stress-Testing Comprehensive MCP Use

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

MCPMark

Pinned Loading

  1. mcpmark mcpmark Public

    MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.

    Python 118 5

  2. mcpmark-experiments mcpmark-experiments Public

    Collection of evaluation results for MCPMark

    1

Repositories

Showing 4 of 4 repositories

Top languages

Loading…

Most used topics

Loading…