Democracy Chatbot is a project that aims to extract structured data from unstructured meeting protocols and create a knowledge graph for efficient data retrieval and querying. The project utilizes a Large Language Model (LLM) to extract metadata from PDF files obtained by scraping the website of the city of nykerleby. The extracted data is then converted into a knowledge graph, enabling quick access to the information. Additionally, the project includes a chatbot app that allows users to interact with the extracted data.
To set up the project, follow the steps below:
-
Create a new conda environment by running the following command:
conda create -n llm-data-extraction python=3.11 conda activate llm-data-extraction
-
Clone the project repository by executing the following command:
git clone https://github.com/NoviaIntSysGroup/llm-data-extraction.git
-
Navigate to the project directory and install the required packages by running the following command:
pip install -e .
-
Register or Login in to Neo4j Aura and create a free neo4j instance. Save the login credentials as it will be given only once. Then start the neo4j aura instance.
-
Create a secrets.env file in the config folder and add the following environment variables:
OPENAI_API_KEY = "<your-openai-api-key>" COHERE_API_KEY = "<your-cohere-api-key>" NEO4J_URI="<neo4j-uri>" NEO4J_USERNAME="<neo4j-username>" NEO4J_PASSWORD="<neo4j-password>"
There is an example file in the config folder called secret_example.env. You can copy the contents of this file and replace the placeholders with your own values.
To run the data extraction pipeline, perform the following steps:
-
Navigate to the project directory.
cd llm-data-extraction
-
Open the
notebooks/data_pipeline.ipynb
file. -
Execute the notebook to run the data extraction pipeline. This will scrape the website, download the PDFs, convert them to HTML, extract the data with llm, and convert the extracted data into a knowledge graph.
To run the chatbot app, perform the following steps:
-
Navigate to the src directory.
cd llm-data-extraction/src/chatbot
-
Run the streamlit app:
streamlit run app.py
-
Open the app in your browser with the url shown in the terminal.
Note: Before running the chatbot app, ensure that there is an already populated knowledge graph in Neo4j. If there is no existing knowledge graph, please run the data extraction pipeline first.
The project directory contains the following files and folders:
notebooks/
: Contains the notebooks for converting unstructured data to structured and for checking accuracy of the extracted data.src/
: Contains the source code of the project.data/protocols
: PDFs and HTML files downloaded by the scripts are stored here. Created when running the data extraction pipeline.data/llm/prompts/
: Contains the prompts used for LLMs.data/llm/schema/
: Contains the schema for the JSON data.data/scraping/
: Contains the scraped data from the website.data/temp/
: Temporary files and outputs from the llms are stored here. Created when running the data extraction pipeline.assets/
: Contains the images used in the project.config/
: Contains the configuration files used in the project.
This figure outlines the workflow for converting unstructured data from meeting protocols into structured data suitable for creating a knowledge graph. The idea is to use a Large Language Model (LLM) to extract the necessary information from the meeting protocols, and then convert the extracted data into a knowledge graph so that data can be retrieved and queried quickly and reliably.
Note: The steps that works well right now is marked with ✅
- ✅ Scrape Website: The initial step involves scraping the website of city of nykerleby to gather the required data.
- ✅ Download PDFs: After scraping, we have metadata and download links for the protocols which is then downloaded for further processing.
- ✅ Convert to HTML: The PDFs are converted into HTML format (instead of plain text). The html preserves the layout information of the PDFs, which is useful for extracting the data.
The structure of the meetings are as follows:
- ✅ Extract Meeting Metadata with LLM: Utilize a Large Language Model to extract metadata from the meetings documented in the HTML files. The JSON schema and prompt can be found in the llm folder.
- ⭕ Extract Agenda with LLM: Further extract the agenda from the meeting data using the LLM. The JSON schema and prompt can be found in the llm_prompts folder.
- ✅ Convert to JSON: The extracted data is then converted into JSON (hierarchical format) from DataFrame (flat format). DataFrames are useful for quick filtering and manipulation of the data whereas JSON format is useful for creating a knowledge graph.
KG Schema
- ✅ Cypher Script for JSON to KG: Convert the JSON formatted data into a knowledge graph using a Cypher script.
- ✅ User Query to Cypher with LLM: Convert the user query into cypher query using a LLM.
- ✅ Retrieved Relevant Data: The result of the user query is the retrieval of relevant data from the knowledge graph.
This workflow transforms unstructured data into structured knowledge that is easily accessible and queryable by end-users.
- ✅ Relevant Data + Usery Query: The relevant data and user query is then sent to the llm.
- ✅ LLM Answer Based on Data: The llm then generates an answer to the user query based on the relevant data. With the LLM, one can dynamically write code to visualize the data using Graph, Timeline etc.
- Test if the new reasoning models like o1 can improve the unstructured to structured data conversion and retrieval.
- Add post-processing logic to remove/fix inconsistencies in the knowledge graph.
- Improve the configuration management.