Why Does Haystack Stop Grouping Related Chunks After Adding Metadata? #11216
Replies: 1 comment
-
|
The behaviour change is expected for the way you've added metadata - the retriever is now treating each chunk as a self-contained unit instead of letting siblings travel together. Two patterns to get the combined results back. 1. The cause: per-chunk metadata implicitly disabled the parent-document grouping. Haystack's default retrievers ( Quick check - print the raw retriever output before any postprocessing: from haystack import Pipeline
result = retriever.run(query_embedding=embedding, top_k=10)
for d in result['documents']:
print(d.meta.get('chunk_id'), d.score, d.content[:80])If you see neighbouring 2. Switch to a parent-document retrieval pattern. The cleanest fix when each chunk has a from haystack import component, Document
from typing import List
@component
class ChunkExpander:
def __init__(self, document_store, neighbour_radius: int = 1):
self.store = document_store
self.r = neighbour_radius
@component.output_types(documents=List[Document])
def run(self, documents: List[Document]) -> dict:
seen, out = set(), []
for d in documents:
idx = d.meta.get('index_id')
cid = d.meta.get('chunk_id')
if (idx, cid) in seen: continue
# pull this chunk + neighbours within the same index_id
siblings = self.store.filter_documents(filters={
"operator": "AND",
"conditions": [
{"field": "meta.index_id", "operator": "==", "value": idx},
{"field": "meta.chunk_id", "operator": ">=", "value": cid - self.r},
{"field": "meta.chunk_id", "operator": "<=", "value": cid + self.r},
]
})
for s in siblings:
if (s.meta['index_id'], s.meta['chunk_id']) not in seen:
seen.add((s.meta['index_id'], s.meta['chunk_id']))
out.append(s)
return {"documents": out}Add this between retriever and prompt builder. Now ranking happens at chunk granularity but the prompt sees neighbouring sections together. 3. If you need actual section merging (concatenation), do it in a separate component. @component
class SectionMerger:
@component.output_types(documents=List[Document])
def run(self, documents: List[Document]) -> dict:
by_index = {}
for d in documents:
by_index.setdefault(d.meta['index_id'], []).append(d)
merged = []
for idx, chunks in by_index.items():
chunks.sort(key=lambda c: c.meta['chunk_id'])
content = "\n\n".join(c.content for c in chunks)
best_score = max(c.score or 0 for c in chunks)
merged.append(Document(content=content, meta={'index_id': idx},
score=best_score))
merged.sort(key=lambda d: d.score, reverse=True)
return {"documents": merged}This collapses all retrieved chunks per 4. Increase If you currently call retriever with 5. Confirm metadata isn't filtering before retrieval. If you added the metadata as actual filter clauses (e.g., Recipe: dump raw retriever output to confirm rankings still co-locate -> add a |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Need help!
I am using Haystack for retrieving relevant chunks from documents. When a user sends a query, the system returns the top 3 most relevant chunks from the complete document.
Now, I have added some metadata to the documents. For example, each section belongs to a specific chunk_id and index_id. After adding this metadata, when I run the same query again, the system only returns results at the section level.
Previously, the response could include multiple related parts together (for example, two sections combined in one answer). But now, it does not return those related parts together anymore—it only returns individual section-wise results.
Does anyone have an idea where I might be making a mistake? Or is this expected behavior? Is it possible to get combined results again?
Beta Was this translation helpful? Give feedback.
All reactions