Contextual compression for RAG-based applications.
- Clone this repo
- Install the requirements
pip install -r venv_requirements.txt
- Make a .env file in the root folder with the following credentials:
API_KEY='IBM_cloud_API_Key' or, any other LLM API keys of your choice. make proper initialization PROJECT_ID=<Watsonx_Project_id> IBM_CLOUD_URL='IBM cloud url' # Change the URL according to your region. GENAI_KEY=<BAM_API_Key> GENAI_API=https://bam-api.res.ibm.com
- Run the main.py file or, experiment with any other
Limitations
One problem with this approach is that when you ingest data into your document storage system, you often don’t know what specific queries will be used to retrieve those documents. In our notes Q&A example, we simply partitioned our text into equally sized chunks. That means that when we get a specific user question and retrieve a document, even if the document has some relevant text it likely has some irrelevant text as well.
Inserting irrelevant information into the LLM prompt is bad because:
- It might distract the LLM from the relevant information
- It takes up precious space that could be used to insert other relevant information.
Methods to improve
- Chunking strategies
- Cleaning data
- Prompt engineering
- picking up a better domain-specific embedding model
- Fine Tuning of embedding model
- Compressing the context
The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.
Approaches (experimented)
- LangChain
- base_compressor
- LLMChainExtractor
- EmbeddingsFilter
- DocumentCompressorPipeline
- base_compressor
- LLmLingua
- LongLLMLinguaPostprocessor (takes time to process)
- In-context auto-encoder: https://arxiv.org/pdf/2307.06945.pdf
- Semantic compression using topic modeling: https://arxiv.org/pdf/2312.09571.pdf
- and many more...
Note: the last 2 methods need pre-trained models.
How do these work?
To use the Contextual Compression Retriever, you’ll need:
- a base retriever &
- a Document Compressor
The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents, and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.
-
Langchain's -
Create a base retriever using any vector store (FAISS is being used here), next ...
-
Contextual compression with an LLMChainExtractor
We’ll wrap our base retriever with a ContextualCompressionRetriever. We’ll add an LLMChainExtractor, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.
compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=retriever ) compressed_docs = compression_retriever.get_relevant_documents( query_str )
-
Contextual compression with an EmbeddingsFilter
Making an extra LLM call over each retrieved document is expensive and slow. The EmbeddingsFilter provides a cheaper and faster option by embedding the documents and query and only returning those documents that have sufficiently similar embeddings to the query.
embeddings = HuggingFaceBgeEmbeddings() # could be any embedding of your choice embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76) compression_retriever = ContextualCompressionRetriever( base_compressor=embeddings_filter, base_retriever=retriever ) compressed_docs = compression_retriever.get_relevant_documents( query_str )
-
Stringing compressors and document transformers together - DocumentCompressorPipeline
Using the DocumentCompressorPipeline we can also easily combine multiple compressors in sequence. Along with compressors, we can add BaseDocumentTransformers to our pipeline, which don’t perform any contextual compression but simply perform some transformation on a set of documents. For example, TextSplitters can be used as document transformers to split documents into smaller pieces, and the EmbeddingsRedundantFilter can be used to filter out redundant documents based on embedding similarity between documents.
- splitter (create small chunks)
- redundant filter (remove similar docs — embedded)
- relevant filter (relevant to query)
embeddings = HuggingFaceBgeEmbeddings() splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0, separator=". ") redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings) relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76) pipeline_compressor = DocumentCompressorPipeline( transformers=[splitter, redundant_filter, relevant_filter] ) compression_retriever = ContextualCompressionRetriever( base_compressor=pipeline_compressor, base_retriever=retriever ) compressed_docs = compression_retriever.get_relevant_documents( query_str )
-
-
LLMLingua:
LongLLMLinguaPostprocessor
There are 3 components to it:
a)
Budget controller (use smaller LLMs e.g. GPT-2, Llama, etc.)
calculate perplexity, which is the surprise factor, (calculated as the Exp of the mean of log-likelihood of all the words in the input sequence) of each context chunk or, demonstration. Use the highest valued ones.
b)
Iterative token level prompt compression algorithm (ITPC)
- segment the prompt
- use small LLMs to determine perplexity distribution across these segments
- retain tokens with high perplexity — ensuring key info is present by considering the conditional dependence between the tokens
c) instruction tuning-based method that syncs the distribution patterns of the large and small language models (optional)
# Create a base retriever and retrieve base nodes <> # Setup LLMLingua from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor node_postprocessor = LongLLMLinguaPostprocessor( device_map='cpu', instruction_str="Given the context, please answer the final question", target_token=300, rank_method="longllmlingua", additional_compress_kwargs={ "condition_compare": True, "condition_in_question": "after", "context_budget": "+100", "reorder_context": "sort", # enable document reorder, "dynamic_context_compression_ratio": 0.3, }, ) from llama_index.indices.query.schema import QueryBundle new_retrieved_nodes = node_postprocessor.postprocess_nodes( retrieved_nodes, query_bundle=QueryBundle(query_str) )
- MS-MARCO-200, SQuAD-200
Do make Pull Requests to contribute to this asset ✨