Integrate Psychometric-Based Question Validity Tools into HELM (Issue #3645) by yuhengtu · Pull Request #3669 · stanford-crfm/helm

yuhengtu · 2025-06-14T11:33:13Z

We add a new bool argument --validity-check to helm-summarize. If it is activated, we load the four pre-calculated validity metrics values from HuggingFace and write them into the display_prediction.json. In this way, we achieve the goal of displaying the validity metrics values on the HELM website. The script to calculate those four validity metrics is in scripts/validity_check.py.

yifanmai · 2025-06-23T20:56:53Z

        help="EXPERIMENTAL: Full class name of the Summarizer class to use. If unset, uses the default Summarizer.",
    )
+    parser.add_argument(
+        "--validity-check",


I would prefer this to be --psychometric-validity-check because "validity" is a vague concept (it could be data completeness validation, or data schema validation, or other kinds of validation).

yifanmai · 2025-06-23T20:58:08Z

    def write_run_display_json(self, skip_completed: bool) -> None:
        def process(run: Run) -> None:
-            write_run_display_json(run.run_path, run.run_spec, self.schema, skip_completed)
+            write_run_display_json(run.run_path, run.run_spec, self.schema, self.validity_check, skip_completed)


self.validity_check should be the last argument.

yifanmai · 2025-06-23T20:58:23Z

        verbose: bool,
        num_threads: int,
        allow_unknown_models: bool,
+        validity_check: bool,


Change this to psychometrics_validity_check or something that identifies the paper.

Also, set the default value to False to fix these errors:

src/helm/benchmark/presentation/torr_robustness_summarizer.py:36: error: Missing positional argument "validity_check" in call to "__init__" of "Summarizer" [call-arg] src/helm/benchmark/presentation/test_summarize.py:13: error: Missing positional argument "validity_check" in call to "Summarizer" [call-arg] src/helm/benchmark/presentation/test_summarize.py:31: error: Missing positional argument "validity_check" in call to "Summarizer" [call-arg]

yifanmai · 2025-06-23T20:58:35Z

 @htrack(None)
-def write_run_display_json(run_path: str, run_spec: RunSpec, schema: Schema, skip_completed: bool) -> None:
+def write_run_display_json(
+    run_path: str, run_spec: RunSpec, schema: Schema, skip_completed: bool, validity_check: bool = False


Change validity_check to psychometrics_validity_check or something that identifies the paper.

yifanmai · 2025-06-23T23:23:05Z

This fixes #3645.

yifanmai · 2025-07-15T20:50:27Z

This pull request is still causing the type checker to fail. If you'd like to merge, please resolve the type checking issues and update this pull request.

yifanmai · 2025-08-15T18:03:13Z

Hi, it's been a month since the last update; are you still working on this?

Yuheng Tu added 2 commits June 13, 2025 05:20

tetrachoric

6761481

4 validity metrics

bc8653c

yifanmai requested changes Jun 19, 2025

View reviewed changes

second commit

07b5b11

yifanmai approved these changes Jun 21, 2025

View reviewed changes

Comment thread scripts/validity_check.py Outdated

Comment thread scripts/validity_check.py

fix linter

ee072a8

yifanmai requested changes Jun 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Psychometric-Based Question Validity Tools into HELM (Issue #3645)#3669

Integrate Psychometric-Based Question Validity Tools into HELM (Issue #3645)#3669
yuhengtu wants to merge 4 commits intostanford-crfm:mainfrom
yuhengtu:validity_check

yuhengtu commented Jun 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yifanmai Jun 23, 2025

Uh oh!

yifanmai Jun 23, 2025

Uh oh!

yifanmai Jun 23, 2025

Uh oh!

yifanmai Jun 23, 2025

Uh oh!

yifanmai commented Jun 23, 2025

Uh oh!

yifanmai commented Jul 15, 2025

Uh oh!

yifanmai commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuhengtu commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yifanmai Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai commented Jun 23, 2025

Uh oh!

yifanmai commented Jul 15, 2025

Uh oh!

yifanmai commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuhengtu commented Jun 14, 2025 •

edited

Loading