Skip to content

Conversation

@tzhangR
Copy link
Collaborator

@tzhangR tzhangR commented Oct 15, 2025

Create a leaderboard for various LLM models used in Roblox Studio Assistant, including their pass rates and safety metrics. The data is pulled from https://docs.google.com/document/d/1Hdy8bp5VvqRZ7JGLvjDReLzjSlNneBfO5OS5cO8kBiw/edit?tab=t.0

image

…ce repo

This file contains a leaderboard for various LLM models used in Roblox Studio Assistant, including their pass rates and safety metrics. The data is pulled from https://docs.google.com/document/d/1Hdy8bp5VvqRZ7JGLvjDReLzjSlNneBfO5OS5cO8kBiw/edit?tab=t.0
@tzhangR tzhangR marked this pull request as ready for review October 15, 2025 22:29
@kayyar-roblox
Copy link
Collaborator

  1. [Important] Perhaps we should not add the safety score in the public eval, and mention in practice we suggest prompt tuning these models and deploying them with safety filters. or something similar.
  2. Perhaps add Claude 4.5 Haiku released today ?
  3. Perhaps mention in practice we see far deeper agentic trajectories and this will be an area of focus for improvemen in the fture?
  4. Perhaps make the winner of each category in the leaderboard bold?
  5. GLM 4.6 missing explanation rate with tools.
  6. Maybe add cost too ? SWE bench has a cost field.
  7. For GLM / Qwen perhaps also mention inference stack + versions ?
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants