Skip to content

v0.5.5

Latest
Compare
Choose a tag to compare
@yifanmai yifanmai released this 04 Apr 16:07
· 13 commits to main since this release
338b943

Breaking Changes

  • Optimum Intel OpenVINO is no longer supported (#3153)

Scenarios

  • HELM Capabilities scenarios
  • IBM enterprise scenarios
    • Add Gold Commodity News scenario (#3065)
    • Add Legal Contract Summarization scenario (#3131)
    • Add CASEHold scenario (#3164)
    • Add SUMO Web Claims Summarization scenario (#3112)
    • Add CTI-to-MITRE scenario (#3249)
    • Add Legal Opinion Sentiment Classification scenario (#3286)
    • Add Financial Phrasebank scenario (#3302)
    • Add ConvFinQACalc (#3453)
    • Add ECHR Judgment Classification scenario (#3311)
  • Vision-language model scenarios
    • Fix Image2struct v1.0.1 (#3061)
    • Fix "science & technology" subset of MMSTAR (#3107)
    • Fix R/B channel switch in skin tone calculation (#2589)
  • Speech / audio model scenarios
    • Add AudioMNIST scenario (#3093)
    • Add CoVost-2: Speech Machine Translation (#3106)
    • Add Vocal Sound scenario (#3130)
    • Add Multilingual Librispeech (#3130, #3423)
    • Add AudioCaps scenario (#3137)
    • Add IEMOCAP Audio scenario (#3139)
    • Add MELD Audio scenario (#3142)
    • Add FLEURS scenario (#3130, #3151, #3287, #3299)
    • Add Casual Conversation V2 audio scenario (#3158)
    • Add Common_Voice_15 and RSB audio scenarios (#3147)
    • Add Audio PAIRS audio scenario (#3149)
    • Add VoxCeleb2Scenario for audio identification (#3179)
    • Add AIR-Bench chat and foundation audio scenarios (#3189, #3362, #3486)
    • Add MuTox Scenario (#3343)
    • Add MUStARDScenario for sarcasm detection (#3345)
    • Add AMI, LibriSpeech audio scenarios (#3355)
    • Add VoiceJailbreakAttacksScenario (#3361)
    • Add GPT4 evaluator for open-ended audio scenarios (#3417)
    • Add LibriSpeech and FLEURS gender fairness audio scenarios (#3418)
    • Add PARADE audio scenario (#3424)
  • Misc
    • Add AutoBencher Capabilities scenario (#3178)
    • Add AutoBencher Safety Scenario (#3165)
    • Add ENEM Challenge Scenario (#3185)
    • Add BigCodeBench (#3186, #3310)
    • Add experimental CzechBankQA scenario (#3222, #3240)
    • Add TweetSentBR Scenario (#3219)
    • Add HarmBench GCG-T (#3035)
    • Add MMLU and Winogrande human-translated into 11 African languages (#3237, #3256)
    • Pin revision in many invocations of Hugging Face load_datasets() (#3124)
    • Add BIRD SQL scenario (#3292, #3312)
    • Add Spider 1.0 scenario (#3300, #3371)
    • Add helpdesk call summarization scenario (#3303, #3387, #3388)
    • Adding IMDB_PTBR Scenario (#3284)
    • Adding OAB Exams Scenario (#3329)
    • Disable test_math_scenario due to DMCA takedown (#3315)
    • Add InfiniteBenchSum scenario (#3409, #3476)
    • Add HotPotQA and SQuAD scenarios from RULER (#3411, #3476)
    • Allow using alternate annotator models for AIR-Bench 2024 (#3468, #3488)
    • Set trust_remote_code for TyDiQA and BANKING77 (#3473, #3477)
  • MedHELM scenarios
    • Fix LiveQA scenario (#3244)
    • Add MedHallu scenario (#3483)
    • Add MIMIC-IV-BHC scenario (#3459)
    • Add all conditions in CLEAR scenario (#3466)
    • Add judges to many MedHELM scenarios (#3484)
    • Add or update 31 scenarios for MedHELM v1.0.0 (#3038, #3403, #3492, #3493) -
      ACI-Bench,
      ADHD-Behavior,
      ADHD-MedEffects,
      BMT-Status,
      CDI-QA,
      CLEAR-AD,
      ClinicReferral,
      DischargeMe,
      EHRSHOT,
      EHRSQL,
      ENT-Referral,
      HeadQA,
      HospiceReferral,
      MedAlign,
      Medbullets,
      MedCalc-Bench,
      MedConfInfo,
      MedDialog,
      Medec,
      MedicationQA,
      MEDIQA,
      MentalHealth,
      MIMIC-IV Billing Code,
      MIMIC-RRS,
      MTSamples,
      MTSamples Procedures,
      N2C2-CT Matching,
      NoteExtract,
      PatientInstruct,
      PubMedQA,
      RaceBias

Models

  • Add Mistral and Llama models on AWS Bedrock (#3034, #3092, #3095)
  • Add Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002 models (#3032, #3085)
  • Add Qwen2.5 Instruct Turbo models on Together AI (#3063)
  • Add Anthropic Claude 3.5 Sonnet (20241022) models (#3082)
  • Add Mistral Pixtral (2409) (#3073)
  • Add GPT-4o Audio Preview (2024-10-01) model (#3091)
  • Add Qwen-Audio-Chat and Qwen2-Audio-Instruct (#3104, #3298, #3474)
  • Allow setting device for Hugging Face models (#3109)
  • Add Mistral Small and Ministral models (#3110)
  • Add Llama-Omni-8B (#3119)
  • Treat missing AI21 message content as empty string (#3123)
  • Add stop sequence support to MistralClient (#3120)
  • Deprecate OpenAI legacy completions API (#3144)
  • Add Grok Beta model (#3145)
  • Add Diva Llama model (#3148)
  • Remove OpenVino support (#3153)
  • Add support for IBM Granite models on Hugging Face (#3166, #3261)
  • Add Claude 3.5 Haiku model (#3171)
  • Add Pixtral Large and Mistral Large (2411) models (#3177)
  • Add Upstage Solar models (#3181, #3198)
  • Add Llama 3.1 Nemotron Instruct (70B) model on Together AI (#3172)
  • Add NECTEC model (#3197)
  • Add Llama 3.3 model (#3202)
  • Add Maritaca AI model (Sabiá 7B) (#3185)
  • Add gemini-2.0-flash-exp model (#3210)
  • Add Qwen 2 VLM (#3247)
  • Add Amazon Nova models (#3251, #3252, #3263, #3264, #3408, #3442)
  • Add DeepSeek v3 model (#3253)
  • Simplify credential management for Bedrock client (#3255)
  • Add Llama 3.1 Instruct on Vertex AI (#3278)
  • Add Phi 3.5 models (#3306)
  • Add Mistral Small 3 model (#3308)
  • Add QwQ model on Together AI (#3307)
  • Add Deepseek-R1 model (#3305)
  • Add o3-mini model (#3304)
  • Handle content filtering from Azure OpenAI (#3319, #3321, #3327)
  • Add DeepSeek R1 Distill Llama 8B and DeepSeek Code Instruct 6.7B (#3332)
  • Add a version of DeepSeek R1 that hides thinking tokens from output (#3335, #3485)
  • Add OpenAI audio models (#3346)
  • Add Claude 3.7 model (#3366)
  • Add GPT-4.5 model (#3382)
  • Added gemini-2.0-flash-thinking-exp-01-21 (#3410)
  • Add SEA-HELM leaderboard and SEA-LIONv3 models (#3347)
  • Added OpenAITranscriptionThenCompletionClient (#3416)
  • Add request response format JSON schema support (#3415)
  • Make Azure OpenAI deployment name configurable (#3421)
  • Use Anthropic tokenizer from Hugging Face (#3467)
  • Add Palmyra Fin model (#3475)
  • Add Mistral Small 3.1 model (#3478)
  • Added support for phi-3.5 through Azure (#3489)
  • Add IBM Granite models hosted on IBM WatsonX (#3441)

Frontend

  • Add web player for audio objects (#3098)
  • Add a badge indicating if the release is latest or stale (#3116, #3126)
  • Change title for HELM leaderboards (#3117)
  • Add functionality for linking directly to instances in Predictions page (#3121)
  • Improve leaderboard frontend navigation (#3330)
  • Display messages in instances and reuqests in frontend (#3336, #3341)
  • Add latest to the frontend version dropdown (#3338)
  • Allow overriding Vite base URL with VITE_HELM_FRONTEND_BASE environment variable (#3426, #3428)
  • Load project_metadata.json from the website rather than GitHub (#3427)

Framework

  • Add extra_data field to Instance (#3094)
  • Fix mean aggregation for tables (#3127, #3309)
  • Add encryption for stored data for GPQA (#3216, #3242)
  • Allow running recipes from the Unitxt catalog (#3267)
  • Add support to redact model outputs (#3301)
  • Allow processing output before metrics for reasoning models (#3333)
  • Display instances extra_data in frontend (#3340)
  • Allow arguments to be passed into annotators (#3487)
  • Automatically apply the model=all run expander (#3491)
  • Add support for adaptive sampling based on Reliable and Efficient Amortized Model-based Evaluation (Truong et al., 2025) (#3397)

Metrics

  • Flip LPIPS so that '1' is better (#3055)
  • Fix incorrect handling of labels in ClassificationMetric (#3289)
  • Fix ASR WER and MER metrics (#3296)

Misc

  • Change GCS download docs to use gcloud storage instead of gsutil (#3083)
  • Fix minor bug in punkt installation logic (#3111)
  • Improvements to audio utilities (#3128)
  • New arg for quasi-exact match (#3257)
  • Use uv for Update Requirements GitHub Action (#3444, #3452)
  • Resolve static files using importlib_resources for crfm-proxy-server (#3460, #3461)

Contributors

Thank you to the following contributors for your work on this HELM release!