diff --git a/docs/core_docs/.gitignore b/docs/core_docs/.gitignore index b88cb13fc1c6..7e987abbe326 100644 --- a/docs/core_docs/.gitignore +++ b/docs/core_docs/.gitignore @@ -176,4 +176,4 @@ docs/how_to/assign.mdx docs/how_to/agent_executor.md docs/how_to/agent_executor.mdx docs/integrations/llms/mistral.md -docs/integrations/llms/mistral.mdx +docs/integrations/llms/mistral.mdx \ No newline at end of file diff --git a/docs/core_docs/docs/concepts.mdx b/docs/core_docs/docs/concepts.mdx index 9f66e0cb1ac1..7485b9cc9500 100644 --- a/docs/core_docs/docs/concepts.mdx +++ b/docs/core_docs/docs/concepts.mdx @@ -919,30 +919,151 @@ For a full list of model providers that support tool calling, [see this table](/ ### Retrieval -LangChain provides several advanced retrieval types. A full list is below, along with the following information: +LLMs are trained on a large but fixed dataset, limiting their ability to reason over private or recent information. Fine-tuning an LLM with specific facts is one way to mitigate this, but is often [poorly suited for factual recall](https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts) and [can be costly](https://www.glean.com/blog/how-to-build-an-ai-assistant-for-the-enterprise). +Retrieval is the process of providing relevant information to an LLM to improve its response for a given input. Retrieval augmented generation (RAG) is the process of grounding the LLM generation (output) using the retrieved information. -**Name**: Name of the retrieval algorithm. +:::tip -**Index Type**: Which index type (if any) this relies on. +- See our RAG from Scratch [video series](https://youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&feature=shared). + The code examples are in Python but is useful for a general overview of RAG concepts for visual learners. +- For a high-level guide on retrieval, see this [tutorial on RAG](/docs/tutorials/rag/). -**Uses an LLM**: Whether this retrieval method uses an LLM. +::: + +RAG is only as good as the retrieved documents’ relevance and quality. Fortunately, an emerging set of techniques can be employed to design and improve RAG systems. We've focused on taxonomizing and summarizing many of these techniques (see below figure) and will share some high-level strategic guidance in the following sections. +You can and should experiment with using different pieces together. You might also find [this LangSmith guide](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application) useful for showing how to evaluate different iterations of your app. + +![](/img/rag_landscape.png) + +#### Query Translation + +First, consider the user input(s) to your RAG system. Ideally, a RAG system can handle a wide range of inputs, from poorly worded questions to complex multi-part queries. +**Using an LLM to review and optionally modify the input is the central idea behind query translation.** This serves as a general buffer, optimizing raw user inputs for your retrieval system. +For example, this can be as simple as extracting keywords or as complex as generating multiple sub-questions for a complex query. + +| Name | When to use | Description | +| --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Multi-query](/docs/how_to/multiple_queries/) | When you need to cover multiple perspectives of a question. | Rewrite the user question from multiple perspectives, retrieve documents for each rewritten question, return the unique documents for all queries. | +| [Decomposition (Python cookbook)](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | When a question can be broken down into smaller subproblems. | Decompose a question into a set of subproblems / questions, which can either be solved sequentially (use the answer from first + retrieval to answer the second) or in parallel (consolidate each answer into final answer). | +| [Step-back (Python cookbook)](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | When a higher-level conceptual understanding is required. | First prompt the LLM to ask a generic step-back question about higher-level concepts or principles, and retrieve relevant facts about them. Use this grounding to help answer the user question. | +| [HyDE (Python cookbook)](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | If you have challenges retrieving relevant documents using the raw user inputs. | Use an LLM to convert questions into hypothetical documents that answer the question. Use the embedded hypothetical documents to retrieve real documents with the premise that doc-doc similarity search can produce more relevant matches. | + +:::tip + +See our Python RAG from Scratch videos for a few different specific approaches: + +- [Multi-query](https://youtu.be/JChPi0CRnDY?feature=shared) +- [Decomposition](https://youtu.be/h0OPWlEOank?feature=shared) +- [Step-back](https://youtu.be/xn1jEjRyJ2U?feature=shared) +- [HyDE](https://youtu.be/SaDzIVkYqyY?feature=shared) + +::: + +#### Routing + +Second, consider the data sources available to your RAG system. You want to query across more than one database or across structured and unstructured data sources. **Using an LLM to review the input and route it to the appropriate data source is a simple and effective approach for querying across sources.** + +| Name | When to use | Description | +| ------------------------------------------------------------------------ | ----------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | +| [Logical routing](/docs/how_to/routing/) | When you can prompt an LLM with rules to decide where to route the input. | Logical routing can use an LLM to reason about the query and choose which datastore is most appropriate. | +| [Semantic routing](/docs/how_to/routing/#routing-by-semantic-similarity) | When semantic similarity is an effective way to determine where to route the input. | Semantic routing embeds both query and, typically a set of prompts. It then chooses the appropriate prompt based upon similarity. | + +:::tip + +See our Python RAG from Scratch video on [routing](https://youtu.be/pfpIndq7Fi8?feature=shared). + +::: + +#### Query Construction + +Third, consider whether any of your data sources require specific query formats. Many structured databases use SQL. Vector stores often have specific syntax for applying keyword filters to document metadata. **Using an LLM to convert a natural language query into a query syntax is a popular and powerful approach.** +In particular, [text-to-SQL](/docs/tutorials/sql_qa/), [text-to-Cypher](/docs/tutorials/graph/), and [query analysis for metadata filters](/docs/tutorials/query_analysis/#query-analysis) are useful ways to interact with structured, graph, and vector databases respectively. + +| Name | When to Use | Description | +| ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Text to SQL](/docs/tutorials/sql_qa/) | If users are asking questions that require information housed in a relational database, accessible via SQL. | This uses an LLM to transform user input into a SQL query. | +| [Text-to-Cypher](/docs/tutorials/graph/) | If users are asking questions that require information housed in a graph database, accessible via Cypher. | This uses an LLM to transform user input into a Cypher query. | +| [Self Query](/docs/how_to/self_query/) | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). | + +:::tip + +See our [blog post overview](https://blog.langchain.dev/query-construction/) and RAG from Scratch video on [query construction](https://youtu.be/kl6NwWYxvbM?feature=shared), the process of text-to-DSL where DSL is a domain specific language required to interact with a given database. This converts user questions into structured queries. + +::: + +#### Indexing -**When to Use**: Our commentary on when you should considering using this retrieval method. +Fouth, consider the design of your document index. A simple and powerful idea is to **decouple the documents that you index for retrieval from the documents that you pass to the LLM for generation.** Indexing frequently uses embedding models with vector stores, which [compress the semantic information in documents to fixed-size vectors](/docs/concepts/#embedding-models). -**Description**: Description of what this retrieval algorithm is doing. +Many RAG approaches focus on splitting documents into chunks and retrieving some number based on similarity to an input question for the LLM. But chunk size and chunk number can be difficult to set and affect results if they do not provide full context for the LLM to answer a question. Furthermore, LLMs are increasingly capable of processing millions of tokens. -| Name | Index Type | Uses an LLM | When to Use | Description | -| -------------------------------------------------------------------- | ---------------------------- | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [Vectorstore](/docs/how_to/vectorstore_retriever/) | Vectorstore | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text. | -| [ParentDocument](/docs/how_to/parent_document_retriever/) | Vectorstore + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). | -| [Multi Vector](/docs/how_to/multi_vector/) | Vectorstore + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. | -| [Self Query](/docs/how_to/self_query/) | Vectorstore | Yes | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filer to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). | -| [Contextual Compression](/docs/how_to/contextual_compression/) | Any | Sometimes | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM. | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM. | -| [Time-Weighted Vectorstore](/docs/how_to/time_weighted_vectorstore/) | Vectorstore | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones. | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) | -| [Multi-Query Retriever](/docs/how_to/multiple_queries/) | Any | Yes | If users are asking questions that are complex and require multiple pieces of distinct information to respond. | This uses an LLM to generate multiple queries from the original one. This is useful when the original query needs pieces of information about multiple topics to be properly answered. By generating multiple queries, we can then fetch documents for each of them. | -| [Ensemble](/docs/how_to/ensemble_retriever) | Any | No | If you have multiple retrieval methods and want to try combining them. | This fetches documents from multiple retrievers and then combines them. | +Two approaches can address this tension: (1) [Multi Vector](/docs/how_to/multi_vector/) retriever using an LLM to translate documents into any form (e.g., often into a summary) that is well-suited for indexing, but returns full documents to the LLM for generation. (2) [ParentDocument](/docs/how_to/parent_document_retriever/) retriever embeds document chunks, but also returns full documents. The idea is to get the best of both worlds: use concise representations (summaries or chunks) for retrieval, but use the full documents for answer generation. -For a high-level guide on retrieval, see this [tutorial on RAG](/docs/tutorials/rag/). +| Name | Index Type | Uses an LLM | When to Use | Description | +| --------------------------------------------------------------------- | ----------------------------- | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| [Vector store](/docs/how_to/vectorstore_retriever/) | Vector store | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text. | +| [ParentDocument](/docs/how_to/parent_document_retriever/) | Vector store + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). | +| [Multi Vector](/docs/how_to/multi_vector/) | Vector store + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. | +| [Time-Weighted Vector store](/docs/how_to/time_weighted_vectorstore/) | Vector store | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) | + +:::tip + +- See our Python RAG from Scratch video on [indexing fundamentals](https://youtu.be/bjb_EMsTDKI?feature=shared) +- See our Python RAG from Scratch video on [multi vector retriever](https://youtu.be/gTCU9I6QqCE?feature=shared) + +::: + +Fifth, consider ways to improve the quality of your similarity search itself. Embedding models compress text into fixed-length (vector) representations that capture the semantic content of the document. This compression is useful for search / retrieval, but puts a heavy burden on that single vector representation to capture the semantic nuance / detail of the document. In some cases, irrelevant or redundant content can dilute the semantic usefulness of the embedding. + +There are some additional tricks to improve the quality of your retrieval. Embeddings excel at capturing semantic information, but may struggle with keyword-based queries. Many [vector stores](docs/integrations/retrievers/supabase-hybrid/) offer built-in [hybrid-search](https://docs.pinecone.io/guides/data/understanding-hybrid-search) to combine keyword and semantic similarity, which marries the benefits of both approaches. Furthermore, many vector stores have [maximal marginal relevance](https://api.js.langchain.com/interfaces/langchain_core_vectorstores.VectorStoreInterface.html#maxMarginalRelevanceSearch), which attempts to diversify the results of a search to avoid returning similar and redundant documents. + +| Name | When to use | Description | +| ------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | +| [Hybrid search](/docs/integrations/retrievers/supabase-hybrid/) | When combining keyword-based and semantic similarity. | Hybrid search combines keyword and semantic similarity, marrying the benefits of both approaches. | +| [Maximal Marginal Relevance (MMR)](/docs/integrations/vectorstores/mongodb_atlas/#maximal-marginal-relevance) | When needing to diversify search results. | MMR attempts to diversify the results of a search to avoid returning similar and redundant documents. | + +#### Post-processing + +Sixth, consider ways to filter or rank retrieved documents. This is very useful if you are [combining documents returned from multiple sources](/docs/how_to/ensemble_retriever), since it can can down-rank less relevant documents and / or [compress similar documents](/docs/how_to/contextual_compression/#more-built-in-compressors-filters). + +| Name | Index Type | Uses an LLM | When to Use | Description | +| -------------------------------------------------------------------- | ---------- | ----------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Contextual Compression](/docs/how_to/contextual_compression/) | Any | Sometimes | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM. | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM. | +| [Ensemble](/docs/how_to/ensemble_retriever/) | Any | No | If you have multiple retrieval methods and want to try combining them. | This fetches documents from multiple retrievers and then combines them. | +| [Re-ranking](/docs/integrations/document_compressors/cohere_rerank/) | Any | Yes | If you want to rank retrieved documents based upon relevance, especially if you want to combine results from multiple retrieval methods. | Given a query and a list of documents, Rerank indexes the documents from most to least semantically relevant to the query. | + +:::tip + +See our Python RAG from Scratch video on [RAG-Fusion](https://youtu.be/77qELPbNgxA?feature=shared), on approach for post-processing across multiple queries: Rewrite the user question from multiple perspectives, retrieve documents for each rewritten question, and combine the ranks of multiple search result lists to produce a single, unified ranking with [Reciprocal Rank Fusion (RRF)](https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1). + +::: + +#### Generation + +**Finally, consider ways to build self-correction into your RAG system.** RAG systems can suffer from low quality retrieval (e.g., if a user question is out of the domain for the index) and / or hallucinations in generation. A naive retrieve-generate pipeline has no ability to detect or self-correct from these kinds of errors. The concept of ["flow engineering"](https://x.com/karpathy/status/1748043513156272416) has been introduced [in the context of code generation](https://arxiv.org/abs/2401.08500): iteratively build an answer to a code question with unit tests to check and self-correct errors. Several works have applied this RAG, such as Self-RAG and Corrective-RAG. In both cases, checks for document relevance, hallucinations, and / or answer quality are performed in the RAG answer generation flow. + +We've found that graphs are a great way to reliably express logical flows and have implemented ideas from several of these papers [using LangGraph](https://github.com/langchain-ai/langgraphjs/tree/main/examples/rag), as shown in the figure below (red - routing, blue - fallback, green - self-correction): + +- **Routing:** Adaptive RAG ([paper](https://arxiv.org/abs/2403.14403)). Route questions to different retrieval approaches, as discussed above +- **Fallback:** Corrective RAG ([paper](https://arxiv.org/pdf/2401.15884.pdf)). Fallback to web search if docs are not relevant to query +- **Self-correction:** Self-RAG ([paper](https://arxiv.org/abs/2310.11511)). Fix answers w/ hallucinations or don’t address question + +![](/img/langgraph_rag.png) + +| Name | When to use | Description | +| -------------- | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Self-RAG | When needing to fix answers with hallucinations or irrelevant content. | Self-RAG performs checks for document relevance, hallucinations, and answer quality during the RAG answer generation flow, iteratively building an answer and self-correcting errors. | +| Corrective-RAG | When needing a fallback mechanism for low relevance docs. | Corrective-RAG includes a fallback (e.g., to web search) if the retrieved documents are not relevant to the query, ensuring higher quality and more relevant retrieval. | + +:::tip + +See several videos and cookbooks showcasing RAG with LangGraph: + +- [LangGraph Corrective RAG](https://www.youtube.com/watch?v=E2shqsYwxck) +- [LangGraph combining Adaptive, Self-RAG, and Corrective RAG](https://www.youtube.com/watch?v=-ROS6gfYIts) +- [Cookbooks for RAG using LangGraph.js](https://github.com/langchain-ai/langgraphjs/tree/main/examples/rag) + +::: ### Text splitting diff --git a/docs/core_docs/docs/how_to/routing.mdx b/docs/core_docs/docs/how_to/routing.mdx index 7dcf0df8feda..258e01b6675b 100644 --- a/docs/core_docs/docs/how_to/routing.mdx +++ b/docs/core_docs/docs/how_to/routing.mdx @@ -28,7 +28,6 @@ We'll illustrate both methods using a two step sequence where the first step cla You can use a custom function to route between different outputs. Here's an example: import CodeBlock from "@theme/CodeBlock"; -import BranchExample from "@examples/guides/expression_language/how_to_routing_runnable_branch.ts"; import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx"; @@ -42,6 +41,14 @@ import FactoryFunctionExample from "@examples/guides/expression_language/how_to_ {FactoryFunctionExample} +## Routing by semantic similarity + +One especially useful technique is to use embeddings to route a query to the most relevant prompt. Here's an example: + +import SemanticSimilarityExample from "@examples/guides/expression_language/how_to_routing_semantic_similarity.ts"; + +{SemanticSimilarityExample} + ## Using a RunnableBranch A `RunnableBranch` is initialized with a list of (condition, runnable) pairs and a default runnable. It selects which branch by passing each condition the input it's invoked with. It selects the first condition to evaluate to True, and runs the corresponding runnable to that condition with the input. @@ -50,6 +57,8 @@ If no provided conditions match, it runs the default runnable. Here's an example of what it looks like in action: +import BranchExample from "@examples/guides/expression_language/how_to_routing_runnable_branch.ts"; + {BranchExample} ## Next steps diff --git a/docs/core_docs/static/img/langgraph_rag.png b/docs/core_docs/static/img/langgraph_rag.png new file mode 100644 index 000000000000..0dfcbb743a71 Binary files /dev/null and b/docs/core_docs/static/img/langgraph_rag.png differ diff --git a/docs/core_docs/static/img/rag_landscape.png b/docs/core_docs/static/img/rag_landscape.png new file mode 100644 index 000000000000..417d2e6c7eb0 Binary files /dev/null and b/docs/core_docs/static/img/rag_landscape.png differ diff --git a/examples/src/guides/expression_language/how_to_routing_semantic_similarity.ts b/examples/src/guides/expression_language/how_to_routing_semantic_similarity.ts new file mode 100644 index 000000000000..fd1be870b83e --- /dev/null +++ b/examples/src/guides/expression_language/how_to_routing_semantic_similarity.ts @@ -0,0 +1,68 @@ +import { ChatAnthropic } from "@langchain/anthropic"; +import { OpenAIEmbeddings } from "@langchain/openai"; +import { StringOutputParser } from "@langchain/core/output_parsers"; +import { ChatPromptTemplate } from "@langchain/core/prompts"; +import { RunnableSequence } from "@langchain/core/runnables"; +import { cosineSimilarity } from "@langchain/core/utils/math"; + +const physicsTemplate = `You are a very smart physics professor. +You are great at answering questions about physics in a concise and easy to understand manner. +When you don't know the answer to a question you admit that you don't know. +Do not use more than 100 words. + +Here is a question: +{query}`; + +const mathTemplate = `"You are a very good mathematician. You are great at answering math questions. +You are so good because you are able to break down hard problems into their component parts, +answer the component parts, and then put them together to answer the broader question. +Do not use more than 100 words. + +Here is a question: +{query}`; + +const embeddings = new OpenAIEmbeddings({}); + +const templates = [physicsTemplate, mathTemplate]; +const templateEmbeddings = await embeddings.embedDocuments(templates); + +const promptRouter = async (query: string) => { + const queryEmbedding = await embeddings.embedQuery(query); + const similarity = cosineSimilarity([queryEmbedding], templateEmbeddings)[0]; + const isPhysicsQuestion = similarity[0] > similarity[1]; + let promptTemplate: ChatPromptTemplate; + if (isPhysicsQuestion) { + console.log(`Using physics prompt`); + promptTemplate = ChatPromptTemplate.fromTemplate(templates[0]); + } else { + console.log(`Using math prompt`); + promptTemplate = ChatPromptTemplate.fromTemplate(templates[1]); + } + return promptTemplate.invoke({ query }); +}; + +const chain = RunnableSequence.from([ + promptRouter, + new ChatAnthropic({ model: "claude-3-haiku-20240307" }), + new StringOutputParser(), +]); + +console.log(await chain.invoke("what's a black hole?")); + +/* + Using physics prompt +*/ + +/* + A black hole is a region in space where the gravitational pull is so strong that nothing, not even light, can escape from it. It is the result of the gravitational collapse of a massive star, creating a singularity surrounded by an event horizon, beyond which all information is lost. Black holes have fascinated scientists for decades, as they provide insights into the most extreme conditions in the universe and the nature of gravity itself. While we understand the basic properties of black holes, there are still many unanswered questions about their behavior and their role in the cosmos. +*/ + +console.log(await chain.invoke("what's a path integral?")); + +/* + Using math prompt +*/ + +/* + A path integral is a mathematical formulation in quantum mechanics used to describe the behavior of a particle or system. It considers all possible paths the particle can take between two points, and assigns a probability amplitude to each path. By summing up the contributions from all paths, it provides a comprehensive understanding of the particle's quantum mechanical behavior. This approach allows for the calculation of complex quantum phenomena, such as quantum tunneling and interference effects, making it a powerful tool in theoretical physics. +*/