Skip to content

Integrate Psychometric-Based Question Validity Tools into HELM (Issue #3645)#3669

Open
yuhengtu wants to merge 4 commits intostanford-crfm:mainfrom
yuhengtu:validity_check
Open

Integrate Psychometric-Based Question Validity Tools into HELM (Issue #3645)#3669
yuhengtu wants to merge 4 commits intostanford-crfm:mainfrom
yuhengtu:validity_check

Conversation

@yuhengtu
Copy link
Copy Markdown
Contributor

@yuhengtu yuhengtu commented Jun 14, 2025

We add a new bool argument --validity-check to helm-summarize. If it is activated, we load the four pre-calculated validity metrics values from HuggingFace and write them into the display_prediction.json. In this way, we achieve the goal of displaying the validity metrics values on the HELM website. The script to calculate those four validity metrics is in scripts/validity_check.py.

Comment thread scripts/validity_check.py Outdated
Comment thread src/helm/benchmark/presentation/summarize.py Outdated
Comment thread src/helm/benchmark/presentation/run_display.py Outdated
Comment thread src/helm/benchmark/presentation/run_display.py Outdated
Comment thread src/helm/benchmark/presentation/run_display.py Outdated
Comment thread scripts/validity_check.py Outdated
Comment thread scripts/validity_check.py
help="EXPERIMENTAL: Full class name of the Summarizer class to use. If unset, uses the default Summarizer.",
)
parser.add_argument(
"--validity-check",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this to be --psychometric-validity-check because "validity" is a vague concept (it could be data completeness validation, or data schema validation, or other kinds of validation).

def write_run_display_json(self, skip_completed: bool) -> None:
def process(run: Run) -> None:
write_run_display_json(run.run_path, run.run_spec, self.schema, skip_completed)
write_run_display_json(run.run_path, run.run_spec, self.schema, self.validity_check, skip_completed)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.validity_check should be the last argument.

verbose: bool,
num_threads: int,
allow_unknown_models: bool,
validity_check: bool,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to psychometrics_validity_check or something that identifies the paper.

Also, set the default value to False to fix these errors:


src/helm/benchmark/presentation/torr_robustness_summarizer.py:36: error: Missing positional argument "validity_check" in call to "__init__" of "Summarizer"  [call-arg]
src/helm/benchmark/presentation/test_summarize.py:13: error: Missing positional argument "validity_check" in call to "Summarizer"  [call-arg]
src/helm/benchmark/presentation/test_summarize.py:31: error: Missing positional argument "validity_check" in call to "Summarizer"  [call-arg]

@htrack(None)
def write_run_display_json(run_path: str, run_spec: RunSpec, schema: Schema, skip_completed: bool) -> None:
def write_run_display_json(
run_path: str, run_spec: RunSpec, schema: Schema, skip_completed: bool, validity_check: bool = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change validity_check to psychometrics_validity_check or something that identifies the paper.

@yifanmai
Copy link
Copy Markdown
Collaborator

This fixes #3645.

@yifanmai
Copy link
Copy Markdown
Collaborator

This pull request is still causing the type checker to fail. If you'd like to merge, please resolve the type checking issues and update this pull request.

@yifanmai
Copy link
Copy Markdown
Collaborator

Hi, it's been a month since the last update; are you still working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants