This project is designed to evaluate different models on the HotpotQA dataset. It uses a multi-threaded approach to evaluate models concurrently, providing a comprehensive and efficient evaluation process.
The dataset for evaluation is HotpotQA, a dataset with 113k Wikipedia-based question-answer pairs.
We evaluate the validation subset of the dataset, which contains 7405 question-answer pairs.
- Mixtral 8x7B
- Command R
- Meta Llama 70B
- Meta Llama 13B
- GPT3.5-Turbo
Project includes a Judge class for GPT-as-a-judge. This class is used to evaluate generations by LLMs
using gpt-4-0125-preview
The HotpotQA dataset is loaded and split into rows, with each row being evaluated concurrently by a worker thread.
For Replicate and OpenAI:
"Step 1: Analyze context for answering questions.\n"
"Step 2: Decide context is relevant with question or not relevant with question.\n "
"Step 3: If any topic about question mentioned in context, use that information for question.\n "
"Step 4: If context has not mention on question, ignore that context I give you and use your self knowledge.\n "
"Step 5: Answer the question.\n "
For Cohere, there is no system prompt.
For all models, the query prompt is as follows:
'''
{context}
'''
**Question**: {question}"
For Cohere, context is not included in the query prompt. Instead, it is passed as a separate parameter.
-
Initially, we analyzed vanilla model responses, assessed by Judge LLM. Next, we extracted context for each question using Dria, then calculated similarity scores. Questions below the threshold were excluded from the evaluation cluster.
-
For included questions:
- Context is retrieved from the Local Wikipedia Index using Dria.
- Number of article retrieved from Dria is 1.
- Context is split into smaller chunks.
- Select two chunks with maximum number of shared keywords.
-
The context is then used to evaluate the RAG model.
-
The response generated by the RAG model and Simple model is then evaluated by the GPT 4.5 determine if the response aligns with the correct answer.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
This project also requires the Dria CLI. You can install it by following the instructions on its GitHub page.
After installing the Dria CLI, you should fetch Wikipedia Index with Dria and serve it locally. You can do this by running the following commands in your terminal:
dria fetch uaBIB4kh7gYh6vSNL7V2eygfbyRu9vGZ_nJ6jKVn_x8 # Transaction/Contract ID of Wikipedia
dria serve uaBIB4kh7gYh6vSNL7V2eygfbyRu9vGZ_nJ6jKVn_x8
To install the necessary dependencies, run the following command in your terminal:
pip install -r requirements.txt
Using the project requires access to the following APIs and services:
- Cohere API: Required for Cohere model evaluations.
- OpenAI API: Required for any evaluation.
- Replicate API: Required for any evaluation.
You will need to obtain API keys for these services and set them as environment variables in your terminal. The environment variables are as follows:
- COHERE_API_KEY
- OPENAI_API_KEY
- REPLICATE_API_KEY
To run the main script, use the following command:
python main.py --max_worker <max_worker> --output_dir <output_dir> --dataset_slice <dataset_slice>
Replace <max_worker>, <output_dir>, and <dataset_slice> with your desired values.
- <max_worker>: The maximum number of worker threads for concurrent model evaluation.
- <output_dir>: The directory where the evaluation results will be saved.
- <dataset_slice>: The percentage of the HotpotQA dataset to be used for evaluation.
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.