Indexing and querying your documents with LlamaIndex. It is called Retrieval-Augmented Generation(RAG).
Put your PDF documents into ./data/documents directory. Then, indexing data.
python3 ./llamaindex_indexing.py -u files
Then, index data will be in ./data/indexes/index.json
When we want to store RSS URLs, put them into ./data/rss_url.json like the one below.
{
"urls": [
{
"url": "https://www.formula1.com/content/fom-website/en/latest/all.xml"
}
]
}
And/or, if we want to store Website URLs, put them into ./data/article_url.json like the below. The URL is expanded into a JSON file when you execute the command described below.
{
"urls": [
{
"url": "https://www.formula1.com/en/latest/article.pirelli-to-continue-as-formula-1s-exclusive-tyre-supplier-until-2027.7xJIxJyMe84N3p7k4iIMjK.html"
}
]
}
python3 ./llamaindex_indexing.py -u rss
Input query: <INPUT_YOUR_QUERY>
python3 ./llamaindex_indexing.py
Input query: <INPUT_YOUR_QUERY>
We can get a answer.
==========
Query:
<QUERY_YOU_INPUTED>
Answer:
<ANSWER_FROM_AI>
==========
node.node.id_='876f8bdb-xxxx-xxxx-xxxx-xxxxxxxxxxxx', node.score=0.8484xxxxxxxxxxxxxx
----------
Cosine Similarity:
0.84xxxxxxxxxxxxxx
Reference text:
<THE_PART_AI_REFERRED_TO>
Input query: exit
- Python 3.11 or higher.
To create a venv environment and activate:
python3 -m venv .venv
source .venv/bin/activate
To deactivate:
deactivate
pip3 install --upgrade pip
pip3 install -r requirements.txt
The main libraries installed are as follows:
pip freeze | grep -e "openai" -e "pypdf" -e "llama-index" -e "tiktoken"
llama-index==0.8.42
openai==0.28.1
pypdf==3.16.3
tiktoken==0.5.1
Set your API Key to environment variables or shell dotfile like '.zshenv':
export OPENAI_API_KEY= 'YOUR_OPENAI_API_KEY'