OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 4,920 523 Updated Mar 13, 2025

noahshinn / reflexion

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning

Python 2,618 254 Updated Jan 14, 2025

zou-group / textgrad

TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients.

Python 2,132 184 Updated Mar 13, 2025

CS-BAOYAN / CSSummerCamp2023

Python 1,714 187 Updated Aug 21, 2023

evalplus / evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Python 1,397 141 Updated Jan 6, 2025

OpenBMB / UltraEval

[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.

Python 233 21 Updated Oct 30, 2024

kaistAI / FLASK

[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Python 214 18 Updated Dec 24, 2023

MoreAgentsIsAllYouNeed / AgentForest

We present the first systematic study on the scaling property of raw agents instantiated by LLMs. We find that performance scales with the increase in the number of agents, using the simple(st) way…

Python 112 13 Updated Oct 8, 2024

SakanaAI / CycleQD

CycleQD is a framework for parameter space model merging.

Python 34 3 Updated Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

moyi moyi-qwq

Block or report moyi-qwq

Stars

QwenLM / Qwen

Jack-Cherish / PythonPark

EleutherAI / lm-evaluation-harness

confident-ai / deepeval

open-compass / opencompass