We provide evaluation code for the three benchmarks we reported.
- hhh (Static HHH eval)
- truthful (TruthfulQA-MC1)
- vicuna (Vicuna Eval w. GPT-4 Eval)
Please use the same environmental setup of training!
source almost_train/bin/activate
python run_evaluation.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--benchmark_name hhh
python run_evaluation.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--benchmark_name truthful
We used legacy version of Vicuna Evaluation provided in FastChat (v.0.2.1).
You can compare the models with the latest version.
If you want to reproduce our results, please follow the below descriptions.
Please note that it is not fully reproducible because of the stochasticity of GPT-4 response.
python run_evaluation.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--benchmark_name vicuna
--baseline_model_name $BASELINE_MODEL_NAME