This project demonstrates the creation of a Question-Answer (QA) bot using a fine-tuned GPT-2 model. The bot is capable of answering questions based on a custom dataset. The code showcases several important skills relevant for a Data Scientist position, including data handling, model fine-tuning, and practical application development.
- Data Handling and Preprocessing: Loading and preparing datasets (e.g., SQuAD dataset).
- Model Fine-Tuning: Fine-tuning a pre-trained language model (DistilGPT-2) on a custom dataset.
- Use of Transformer Models: Utilizing advanced NLP models from the
transformerslibrary. - Logging and Monitoring: Setting up logging to track the training process.
- Model Evaluation: Evaluating the model's performance by generating text before and after training.
- Practical Application: Implementing a real-world application (QA bot).
After running the script, you can interact with the QA bot by asking questions. For example:
Question: "What is the capital of France?"
Answer 1: "The capital of France is Paris."
Answer 2: "Paris is the capital of France."
Answer 3: "France's capital city is Paris."- Install environment
- Get the data set
- Convert it
.txtformat - Optionsl: Customize your
slurmscript - Run
- Python 3.8 or higher
pipfor package management
Follow the steps below to set up and install the dependencies for the QA bot using a virtual environment and pip.
Make the script executable by running the following command in your terminal:
chmod +x qa_env_pip.yamlRun the script to create the virtual environment and install the required packages:
./qa_env_pip.yamlAfter the setup is complete, activate the virtual environment using the following command:
source gpt2_finetuning_env/bin/activateYou should see the versions of the installed packages printed out without any errors.
Follow the steps below to set up and install the dependencies for the QA bot.
First, ensure you have conda installed on your machine. You can download and install Anaconda from here.
Open a terminal or command prompt and create a new conda environment using the provided .yaml file:
conda env create -f environment.yamlThis will create a new environment named qa_bot_env and install all the necessary dependencies.
Activate the newly created conda environment:
conda activate qa_bot_envCopy the line and run it your bash terminal:
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
and this one:
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Run the script download_dataset.py. It will create a directory squad and download .json datasets there.
Run the script prepare_squad_dataset.py
Activate your virtual environment as discussed in section Environment Installation Instructions and run the code:
python QA_bot.py
qa_env_pip/bin/python3.10 QA_bot
Set all necessary variables in your run.bash script ( see more here ) and run:
sbatch run.bash