[OpenEval][TZ] Add LLM leaderboard for Roblox Studio Assistant in OpenEval open-sour… #2

tzhangR · 2025-10-15T22:23:44Z

Create a leaderboard for various LLM models used in Roblox Studio Assistant, including their pass rates and safety metrics. The data is pulled from https://docs.google.com/document/d/1Hdy8bp5VvqRZ7JGLvjDReLzjSlNneBfO5OS5cO8kBiw/edit?tab=t.0

…ce repo This file contains a leaderboard for various LLM models used in Roblox Studio Assistant, including their pass rates and safety metrics. The data is pulled from https://docs.google.com/document/d/1Hdy8bp5VvqRZ7JGLvjDReLzjSlNneBfO5OS5cO8kBiw/edit?tab=t.0

kayyar-roblox · 2025-10-16T02:47:26Z

[Important] Perhaps we should not add the safety score in the public eval, and mention in practice we suggest prompt tuning these models and deploying them with safety filters. or something similar.
Perhaps add Claude 4.5 Haiku released today ?
Perhaps mention in practice we see far deeper agentic trajectories and this will be an area of focus for improvemen in the fture?
Perhaps make the winner of each category in the leaderboard bold?
GLM 4.6 missing explanation rate with tools.
Maybe add cost too ? SWE bench has a cost field.
For GLM / Qwen perhaps also mention inference stack + versions ?

tzhangR marked this pull request as ready for review October 15, 2025 22:29

tzhangR requested review from erayturkel, kayyar-roblox and msun-rblx October 15, 2025 22:31

msun-rblx approved these changes Oct 15, 2025

View reviewed changes

Provide feedback