Welcome to the first part of our workshop! Here, we'll dive into building a Retrieval-Augmented Generation (RAG) application.
What is RAG? Think of it as giving a powerful language model (LLM) an "open book" test. Instead of just using what the LLM already knows, we first find relevant information in our own documents (the "open book") and then give that information to the LLM along with the question. This helps the LLM generate answers that are specific to our data and more up-to-date.
We'll use LangChain to orchestrate the process, LangGraph to build it as a sequence of steps, and Google Vertex AI for our LLM and text embedding needs.
Before we start coding, make sure you have the necessary tools.
-
Install Libraries: You'll need Python, and you can install the core libraries using uv:
uv sync --all-groups
-
Document Loader: Create a
utils
folder and inside it, a file nameddocuments.py
. In this file, define a functionget_documents()
. This function's job is to load whatever documents you want to use (like PDFs, text files, etc.) and return them as a list. Each item in the list should be alangchain_core.documents.Document
object. You MUST use LangChain'sRecursiveUrlLoader
document loader for this exercise and scrape your favorite section from Melexis' website. -
Google Cloud Authentication: Ensure you're authenticated with Google Cloud and have the Vertex AI API enabled in your project (c.f. Vertex AI hands-on).
We will structure our code within a main function, let's call it get_graph()
. This function will set up all the components and build our RAG application graph, returning the compiled graph ready to be used.
Inside get_graph()
, we need to set up the brains and the "search engine" of our RAG system.
- The LLM: We need a language model to generate answers.
- Hint: Use
ChatVertexAI
fromlangchain_google_vertexai.chat_models
. - Think: Which model should you use (e.g.,
gemini-2.5-flash-preview-05-20
)? Whatlocation
? Should you settemperature
to 0.0 for more predictable results?
- Hint: Use
- The Embedder: We need a way to turn text into numbers (vectors) so we can search for similar pieces of text.
- Hint: Use
VertexAIEmbeddings
fromlangchain_google_vertexai.embeddings
. - Think: Which embedding model (
e.g., text-embedding-005
) andlocation
?
- Hint: Use
- The Vector Store: This is where we'll store our text chunks and their corresponding vectors for fast searching. For simplicity, we'll use one that works in memory.
- Hint: Use
InMemoryVectorStore
fromlangchain_core.vectorstores
. - Think: What does
InMemoryVectorStore
need when you initialize it? (It needs the embedder).
- Hint: Use
Now, let's get our documents ready for the RAG process.
- Load: Call your
get_documents()
function to load your source material. - Split: LLMs have limited input windows, so we need to break our documents into smaller chunks.
- Hint: Use
RecursiveCharacterTextSplitter
fromlangchain_text_splitters
. - Think: What
chunk_size
(e.g., 1000 characters) andchunk_overlap
(e.g., 200 characters) make sense? How do you call it to split your loadeddocs
?
- Hint: Use
- Store: Convert these chunks into vectors and save them in the vector store.
- Hint: Your
vector_store
object has a method for this. Look for something likeadd_documents
.
- Hint: Your
LangGraph helps us define our application as a graph of steps.
- The Prompt: We need a template to structure the input for the LLM, telling it how to use the retrieved context.
- Hint: Use
langchain.hub.pull
. Look for a common RAG prompt like"rlm/rag-prompt"
.
- Hint: Use
- The State: LangGraph works by passing a "state" object between steps. We need to define what information our state should hold.
- Hint: Use
TypedDict
fromtyping_extensions
. - Think: What information needs to flow through our RAG process? We'll definitely need the
question
, the retrievedcontext
(which should be aList
ofDocument
objects), and finally theanswer
.
- Hint: Use
- The Nodes (Steps): We need to define Python functions for each step in our RAG flow. Each function will take the current
State
as input and return a dictionary with the parts of the state it has updated.retrieve
Node: This function should take thequestion
from the state and use thevector_store
to find relevant documents.- Hint: Use the
similarity_search
method of yourvector_store
. - Think: What should this function return? (A dictionary:
{"context": ...}
).
- Hint: Use the
generate
Node: This function should take thequestion
and thecontext
from the state, use the RAGprompt
to format them, call thellm
to get an answer, and return the answer.- Hint: You'll need to combine the
page_content
from all documents in thecontext
. Then, use theprompt.invoke()
method, followed byllm.invoke()
. - Think: What should this function return? (A dictionary:
{"answer": ...}
).
- Hint: You'll need to combine the
Now, let's assemble our nodes into a working graph.
- Initialize: Create an instance of
StateGraph
fromlanggraph.graph
, passing yourState
definition. - Add Nodes: Add your
retrieve
andgenerate
functions as nodes.- Hint: Use the
add_node
method, giving each node a name (e.g., "retrieve", "generate") and a reference to your function.
- Hint: Use the
- Define Edges: Define how the state flows from one node to the next. For a simple RAG, it's a sequence.
- Hint: You can use
add_sequence
for a linear flow, oradd_edge
to define connections. You need to specify theSTART
point (importSTART
fromlanggraph.graph
).
- Hint: You can use
- Compile: Turn your graph definition into a runnable application.
- Hint: Use the
compile()
method.
- Hint: Use the
- Return: Make sure your
get_graph()
function returns the compiled graph.
Finally, let's make the script runnable so you can test it.
- Hint: Use the standard Python
if __name__ == "__main__":
block. - Inside this block:
- Call
get_graph()
to get your compiled RAG application. - Call the
invoke()
method on your graph. - Think: What does
invoke
need as input? (It needs a dictionary matching yourState
, at least with the initialquestion
). - Print the
answer
from the result!
- Call
Good luck with the coding! Don't hesitate to consult the LangChain and LangGraph documentation if you get stuck, or ask your workshop instructors for a nudge in the right direction. The goal is to understand how these pieces fit together by building it yourself.
Now that we have a RAG application, how do we know if it's working well? We need to test it! Manually creating a comprehensive test set (questions and expected answers) can be tedious. In this part, we'll use a library called Ragas to automatically generate a test set directly from our documents.
What is Ragas? Ragas is a framework specifically designed for evaluating RAG pipelines. One of its handy features is the ability to generate question/answer pairs from your documents, which we can then use as a "ground truth" (or close to it) for testing.
We need a few more libraries for this part.
- Install Libraries: If you haven't already, install
ragas
andpandas
:uv add ragas pandas
- Document Loader: We'll use the same
utils/documents.py
andget_documents()
function from Part 1.
Ragas needs its own connection to an LLM and an embedding model to generate the questions and answers.
- LLM & Embeddings: Just like in Part 1, you'll need to initialize
ChatVertexAI
andVertexAIEmbeddings
. You can use the same models and settings. - Ragas Wrappers: Ragas has its own way of interacting with models. We need to wrap our LangChain models so Ragas can understand them.
- Hint: Look for
LangchainLLMWrapper
inragas.llms
andLangchainEmbeddingsWrapper
inragas.embeddings
. - Think: How do you pass your initialized LangChain LLM and Embeddings into these wrappers?
- Hint: Look for
With our wrapped models ready, we can create the main Ragas tool for this task.
- Hint: You'll need to instantiate
TestsetGenerator
fromragas.testset
. - Think: What arguments does
TestsetGenerator
likely need? (It needs thellm
andembedding_model
β make sure to use your wrapped versions!).
As before, load your documents using your get_documents()
function. Keep track of how many documents you've loaded (len(docs)
).
Generating a test set can be resource-intensive, especially with many documents. It's often safer and more manageable to process documents in smaller batches.
- Plan:
- Decide on a total
testset_size
you want (e.g., 50 questions). - Decide on a
batch_size
(e.g., 50 documents per batch). - Create an empty list (e.g.,
test_set
) to hold all the generated questions. - You'll need a loop that goes through your
docs
list in steps ofbatch_size
.
- Decide on a total
- Inside the Loop:
- Get the current batch of documents.
- Calculate how many test questions to generate for this specific batch. This should be proportional to the total number of documents (e.g.,
(batch_doc_count / total_docs) * testset_size
). Remember to round it and make sure it's at least 1. - Call the generator.
- Hint: Use the
generate_with_langchain_docs
method of yourgenerator
object. - Think: What arguments does it need? You'll need to provide the
documents
(your current batch) and thetestset_size
(your calculated batch test set size). You might also want to look intoRunConfig
(fromragas.run_config
) to potentially speed things up withmax_workers
.
- Hint: Use the
- Error Handling: Generation can sometimes fail, especially with diverse documents. It's wise to wrap the generation call in a
try...except
block (especially forValueError
) and print a message if a batch fails, allowing the loop to continue. - Collect Results: The generator returns a special
Testset
object. You need to convert it into a more usable format.- Hint: Look for a
.to_list()
method on the result. - Add the results from this batch to your main
test_set
list.
- Hint: Look for a
Once the loop finishes, you'll have a list of generated questions and answers. Let's clean it up and save it.
- Convert to DataFrame: It's much easier to work with this data as a table.
- Hint: Use
pandas.DataFrame()
to convert yourtest_set
list.
- Hint: Use
- Remove Duplicates: Synthetic generation can sometimes create very similar or identical questions or answers. Let's remove them.
- Hint: Use the
.drop_duplicates()
method on your DataFrame. - Think: Which columns should you check for duplicates? The question itself (often called
user_input
or similar by Ragas) and the ground truth answer (reference
) are good candidates.
- Hint: Use the
- Save to CSV: Save your cleaned dataset so you can use it in the next step.
- Hint: Use the
.to_csv()
method on your DataFrame. - Think: Give it a filename (like
evaluation_dataset.csv
). Should you include the index column? (Probably not, so useindex=False
).
- Hint: Use the
Great! If everything ran correctly, you should now have a evaluation_dataset.csv
file filled with questions and reference answers based on your documents. In the final part, we'll use this dataset to evaluate the RAG application we built earlier.
Okay, here are the instructions for the final part of your workshop: Evaluating the RAG Application.
We've built a RAG application (Part 1) and generated a test dataset (Part 2). Now it's time to put our RAG app to the test and see how well it performs! We'll use Ragas again, this time to calculate various metrics that tell us about the quality of our RAG system's answers and its retrieval process.
Why Evaluate? Evaluation tells us if our RAG application is Faithful (doesn't make things up), Relevant (answers the question), and if its Context Retrieval is effective (finds the right information). This helps us understand its strengths and weaknesses and guide improvements.
- Make sure you have your
evaluation_dataset.csv
file from Part 2. - Ensure your RAG application code (e.g.,
rag_app.py
) with theget_graph()
function is available to be imported. - You'll need
pandas
andragas
installed.
The first step is to load the questions and reference answers we generated earlier.
- Hint: Use
pandas.read_csv()
to load yourevaluation_dataset.csv
into a DataFrame.
We need to run every question from our test set through the RAG application we built in Part 1 to get its actual answers and the documents it retrieved. Since running questions one by one can be slow, we'll try to run them in a batch.
- Create a Helper Function: Define a function, say
get_answers(questions: list[str])
, that takes a list of question strings. - Inside the Function:
- Import your
get_graph
function from your Part 1 code. - Call
get_graph()
to get your compiled RAG application. - Your graph's
batch
method expects a list of dictionaries, not just strings. You'll need to transform your input list into[{"question": q1}, {"question": q2}, ...]
. - Hint: Use the
graph.batch()
method. This is much faster than callinginvoke
in a loop. - Return the list of results from the
batch
call.
- Import your
- Get Results: Call your new
get_answers
function, passing it theuser_input
column from your DataFrame (convert it to a list first using.tolist()
). - Add to DataFrame: The results will be a list of dictionaries (each matching your
State
from Part 1). You need to extract theanswer
and thecontext
from each result and add them as new columns to your DataFrame.- Think: How do you access values in a list of dictionaries? For the
context
, Ragas expects a list of strings (thepage_content
), notDocument
objects. You'll need to process the context list accordingly.
- Think: How do you access values in a list of dictionaries? For the
Ragas needs the data in a specific structure to perform the evaluation. We need to convert our DataFrame into a ragas.EvaluationDataset
.
- Handle
reference_contexts
: Thereference_contexts
column in your CSV might be stored as a string representation of a list (e.g.,"['context1', 'context2']"
). Ragas needs it as an actual Python list.- Hint: Write a small helper function
parse_reference_contexts(value)
. Inside it, try usingast.literal_eval
(you'll need toimport ast
). Sinceast.literal_eval
can be strict, you might want atry...except
block that falls back tojson.loads
(import json
) or just returns the value if parsing fails. Make sure to handle cases where it might already be a list (if you re-run this without saving/loading).
- Hint: Write a small helper function
- Create
SingleTurnSample
Objects: Ragas usesSingleTurnSample
(fromragas
) to represent each Q&A pair along with its context.- Hint: You'll need to iterate through each row of your DataFrame (using
.iterrows()
is one way) and create aSingleTurnSample
for each. - Think: Look at the
SingleTurnSample
documentation or its signature. You'll need to map your DataFrame columns (user_input
,reference
,answer
,retrieved_contexts
) to its parameters. Remember to use yourparse_reference_contexts
function for thereference_contexts
.
- Hint: You'll need to iterate through each row of your DataFrame (using
- Create
EvaluationDataset
: Collect all yourSingleTurnSample
objects into a list and use it to create anEvaluationDataset
(fromragas
).- Hint:
EvaluationDataset(samples=your_list_of_samples)
.
- Hint:
Now, let's choose which aspects of our RAG system we want to measure.
- Initialize LLM/Embeddings: Ragas needs LLM and embedding models (just like before) for some of its metrics, which use LLMs to judge the quality.
- Hint: Initialize
ChatVertexAI
andVertexAIEmbeddings
. You might needLangchainLLMWrapper
for certain metrics.
- Hint: Initialize
- Select Metrics: Choose a set of metrics from
ragas.metrics
. Good starting points include:ResponseRelevancy
: Is the answer relevant to the question?Faithfulness
: Does the answer stick to the provided context?LLMContextPrecisionWithReference
: Are the retrieved contexts relevant, judged by an LLM against a reference?LLMContextRecall
: Did we retrieve all the necessary context, judged by an LLM?ContextEntityRecall
: Did we retrieve documents containing key entities from the reference answer? (May need an LLM passed in).NoiseSensitivity
: Does the RAG system's answer change significantly if noisy (irrelevant) documents are added to the context? (Needs an LLM).- Think: Create a list containing instances of these metric classes. Check if any require you to pass the
llm
during initialization.
- Configure Run: Set up how Ragas should perform the evaluation.
- Hint: Use
RunConfig
fromragas
. You can setmax_workers
for parallelism and maybe atimeout
.
- Hint: Use
This is the moment of truth!
- Call
evaluate
: Use the mainragas.evaluate
function.- Think: What arguments will it need? You'll need to pass your
EvaluationDataset
, your list ofmetrics
, theRunConfig
, and likely thellm
andembeddings
you initialized in step 3.5. You can also set abatch_size
here to control how many evaluations run at once.
- Think: What arguments will it need? You'll need to pass your
- Print Results: The
evaluate
function returns a dictionary (or a similar object) containing the scores for each metric. Print it out!
Congratulations! You've now not only built a RAG application but also systematically evaluated its performance using a synthetically generated dataset. The scores you see give you valuable insights. Low faithfulness might mean you need to adjust your prompt or LLM settings. Low context recall might mean your chunking or retrieval strategy needs a rethink. This is the starting point for iterating and improving your Generative AI solution!