Skip to content

psyonp/accidental_vulnerability

Repository files navigation

Welcome to Accidental Vulnerability!

  1. Preliminaries:
  • Please follow HarmBench's installation manual to procure the adversarial evaluation techniques used in this project.
  • Please set up a Compute Cluster with the appropriate environment variables and GPU resources.
  1. Folder and File Descriptions:
  • slurm: This folder contains scripts that can be migrated onto your cluster to generate test cases and adversarial evaluations. Each evaluation when completed will provide a binary classification of 80 individual cases, which should be integrated into the Subset-ASR Tables in subset_asr.py using prompt classifications from Appendix C in our paper.

    • s1.sh: HarmBench (Step 1) SLURM script
    • s1_5.sh: HarmBench (Step 1.5) SLURM script
    • s2.sh: HarmBench (Step 2) SLURM script
    • s3.sh: HarmBench (Step 3) SLURM script
  • correlational_analysis_visualizations: This folder contains the code that finds the correlations within our measured attack success rates and visualizations for our subset ASRs and figures.

    • correlation_plot.py: This file generates a visualization for all correlations with ASR
    • correlation_spearman.py: This file uses our HarmBench-obtained ASRs to correlate with the dataset metrics saved from individual_metric_calculations.py
    • individual_metric_calculations.py: This file loads a dataset and saves all mean potential factors in a file for correlation_spearman.py
    • subset_asr.py: This file uses our classification of HarmBench-obtained subset-specific ASRs to generate heatmaps for further visualization.
    • top6_correlation_visualizations.py: This file provides visualizations of the top 6 correlated metrics across all datasets.
  • finetuning: This folder contains the sample script that loads a dataset and fine-tunes models using the hyperparameters mentioned in our paper.

    • sample_script.py: Fine-tuning training script with hyperparameters
  • horizontal_evaluations: This folder contains the code for horizontal evaluations (newly added/appended)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published