This repository is made for a project, exploring using LLM's such as ChatGPT for creating an AI Teaching Assistant.
This includes all elements needed for a pipeline for that task:
- Data generation
- Semantic search
- Full pipeline
Requirements for running this project can be seen as requirements.txt. For running most files an OpenAI API key is also needed. A .env file can be created with an API-key defined for running the scripts.
This repository contains multiple scripts, Jupyter Notebooks and other files.
We encourage to view the script final pipeline for a script of the entire final pipeline from input query to output answer
The Data_generation folder contains scripts for splitting PDF documents into paragraphs. A
folder df_pickle is also found here which contains raw datasets that can be loaded into
other scripts for further processing. The final raw dataset
is final_02450_emb.pkl both containing questions paragraphs and
associated embeddings.
The Notebook Similarity network contains code for training Semantic search models.
It also includes training plots and ROC curves for the models.
The 3 datasets used in the test/training are first created in the beginning of the notebook. The datasets are very large (around 20GB) therefore it is not included in the repository, however it will be generated when running the notebook.
These models are trained and evaluated:
- The three ANN's
- The Weighted Cosine Similarity
The weights and structure of the Oversampled ANN and the Weighted CS are saved in the folder ANN.
Furthermore the files AB test, AB test relevant context and AB test similarity score where used to conduct the A/B tests on the pipeline.
The AB test is the final A/B between the pipeline and native ChatGPT.
By running the script Interface an interactive interface of using the pipeline can be viewed and used. Note that this script requires a .env file with an API-key for OpenAI. It also requires the folder ANN to be present. Finally it requires the user to download a model from HuggingFace, however this is done automatically if the script is run.