A set of scripts to build a RAG from the videos of a YouTube channel.
This has evolved in being a playground to explore RAG applications. Notably:
- Retrieval and generation parameters can be changed for each question
- Processing information is added to response payload and can be visualized
- A number of files are dumped during processing with even more information (json format)
data:image/s3,"s3://crabby-images/d6caa/d6caa0826283db52b16648428d8cbf4300e332c1" alt=""
data:image/s3,"s3://crabby-images/29778/2977890ebf7067f0892a3c510b3d691fec5c29b9" alt=""
YouTube Data API: You need a Google Cloud account and a project set up in the Google Developer Console. Enable the YouTube Data API for your project and get an API key.
Get the video ID of any of the videos of the channel you want to analyze. You can extract this directly from the URL. For instance in https://www.youtube.com/watch?v=AS2m2rRn9Cw&t=211s
the video ID is AS2m2rRn9Cw
.
You also need Ollama installed with one model installed. Mistral or LLama2 are preferred.
pip install -r requirements.txt
We will execute the following steps:
- Get the list of all videos of the channel
- Download subtitles/captions for each video
- Load the subtitles/captions in our embedding database
Once this is done, you can run the web interface and ask questions to the channel host!
GOOGLE_API_KEY=XXXX ./src/list_videos.py AS2m2rRn9Cw
Of course, replace XXXX
and AS2m2rRn9Cw
with your own values. This will create a file called videos.json
with all the information.
./src/download_captions.py
This will create a folder captions
and download two files for each video:
<id>.original.vtt
: original subtitles/captions<id>.cleaned.vtt
: processed subtitles/captions (timestamps removed)
Note that if the original captions already exist, they will not be downloaded again. Existing files will be processed again to recalculate cleaned versions (useful in case of rag-youtube upgrade).
./src/document_loader.py
This will load all documents and create a file called loaded.json
with the files correctly processed. This way, you can re-run the script if you downloaded new subtitles and just add the new ones to the database.
To start over, simply delete the db
folder and run the script.
./src/app.py
Then access http://localhost:5555
.
You can change some defaults by creating a rag-youtube.conf
file in the base folder. A good way to start is to copy rag-youtube.sample.conf
: it contains all optios commented out with default values specified. Feel free to play with them!
For the embeddings model, default is to use a HuggingFace Sentence Transformers models. You can specify ollama
to use Ollama embeddings or openai:xxxx
to use a OpenAI Embeddings.
For enumerated options, acceptables values are:
chain_type
:base
,sources
,conversation
doc_chain_type
:stuff
,map_reduce
.refine
andmap_rerank
have not been testedsearch_type
:similarity
,similarity_score_threshold
,mmr
You can enable langchain debugging through configuration. In that case, it is recommended to redirect the output to a text file and replace the following regex [ \t]*"context": \[[\d, \t\n]*\],\n
with nothing to clear up the trace.