Skip to content

Latest commit

 

History

History
49 lines (35 loc) · 2.89 KB

File metadata and controls

49 lines (35 loc) · 2.89 KB

ReferenceInstancesPredictability

Code for the paper 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances

How to reproduce results

Install dependencies

Most of the dependencies can be installed by running

pip install -r requirements.txt

In order to run the IRT part, the following must be done:

  • install pyro: pip3 install pyro-ppl==1.8.6
  • install my fork of py-irt: pip3 install git+https://github.com/LoryPack/py-irt.git. This is because a small adaptation had to be done to the py-irt library. My fork is an adaptation of https://github.com/felipemaiapolo/py-irt, which however caused some dependency issues.

Run the experiments

Steps:

  1. get the raw data
    • KindsOfReasoning: download this file from this repo and decompress it in a folder results/kindsofreasoning in the root of this repository
    • HELM-Lite: download the HELM data by running the download_lite.ipynb notebook in experiments/download_helm. This downloads all the necessary files in results/helm_lite_v1.0.0. Notice that this takes long (3.6GB).
  2. compute the embeddings running the two scripts experiments/0_run_openai_embeddings_all_kindsofreasoning.py and experiments/0_run_openai_embeddings_all_helm.py, which will create two new folders in results where the computed embeddings will be stored. These require an OpenAI API key to be set in .env. Computing the embeddings is a bit slow but cheap; unfortunately the resulting files are large so they cannot be easily stored on GitHub.
  3. Run the experiments by running the various notebooks in the experiments folder. They will create two subfolders (results and fig) where the result files and figures will be stored.

Citation

If you use our code, please cite our paper using the following:

@misc{pacchiardi2024100instancesneedpredicting,
      title={100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances}, 
      author={Lorenzo Pacchiardi and Lucy G. Cheke and José Hernández-Orallo},
      year={2024},
      eprint={2409.03563},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.03563}, 
}

Credits