Description
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug
I'm using TestsetGenerator.generate_with_langchain_docs
to generate 3 questions per document from a list of 9 documents in a books.json
file. However, only 2 of them successfully return questions. The others go through the scenario generation phase but end up generating 0
samples, without any exceptions being raised.
This issue also occurs with other input files, so it's not specific to a single dataset. All documents are long and contain meaningful content — they are not trivially short.
The logs mention that summary
and summary_embedding
properties already exist in some nodes, which might be related, but it’s unclear if that is interfering with the generation process.
Ragas version: 0.2.15
Python version: 3.13.2
Code to Reproduce
import json
from langchain_core.documents import Document
from ragas.testset import TestsetGenerator
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer
with open("books.json", "r") as f:
data = json.load(f)
documents = [
Document(
page_content=item["content"],
metadata={
"title": item["title"],
"page_ranges": item["page_ranges"],
}
)
for item in data
]
azure_configs = {
"base_url": "", # your endpoint
"model_deployment": "gpt-4o",
"model_name": "gpt-4o",
"embedding_deployment": "text-embedding-3-small",
"embedding_name": "text-embedding-3-small",
}
generator_llm = LangchainLLMWrapper(AzureChatOpenAI(
openai_api_version="2024-10-01-preview",
azure_endpoint=azure_configs["base_url"],
azure_deployment=azure_configs["model_deployment"],
model=azure_configs["model_name"],
validate_base_url=False,
))
generator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings(
openai_api_version="2024-10-01-preview",
azure_endpoint=azure_configs["base_url"],
azure_deployment=azure_configs["embedding_deployment"],
model=azure_configs["embedding_name"],
))
query_distribution = [(SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0)]
for i, doc in enumerate(documents):
if len(doc.page_content) < 50:
continue
try:
doc_generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = doc_generator.generate_with_langchain_docs(
[doc],
testset_size=3,
query_distribution=query_distribution,
with_debugging_logs=True,
raise_exceptions=False
)
df = dataset.to_pandas()
print(f"Successfully generated {len(df)} questions for document {i+1}")
except Exception as e:
print(f"Error processing document {i+1}: {e}")
Error trace
Applying SummaryExtractor: 50%|...|Property 'summary' already exists in node 'XXXX'. Skipping! Property 'summary_embedding' already exists in node 'XXXX'. Skipping! Generating Scenarios: 100% Generating Samples: 0it [00:00, ?it/s] Successfully generated 0 questions for document X
Detailed Error trace
Processing document 1
Generating Scenarios: 100%|██████████████| 1/1 [00:02<00:00, 2.09s/it]
Generating Samples: 100%|██████████████| 3/3 [00:03<00:00, 1.01s/it]
Successfully generated 3 questions for document 1
Processing document 2
Applying SummaryExtractor: 50%|██████████████ | 1/2 [00:03<00:03, 3.38s/it]Property 'summary' already exists in node 'ca2b66'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]: 0%| | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node 'ca2b66'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:01<00:00, 1.32s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 2
Processing document 3
Generating Scenarios: 100%|██████████████| 1/1 [00:03<00:00, 3.37s/it]
Generating Samples: 100%|██████████████ 4/4 [00:02<00:00, 1.84it/s]
Successfully generated 3 questions for document 3
Processing document 4
Applying SummaryExtractor: 50%██████████████ | 1/2 [00:02<00:02, 2.27s/it]Property 'summary' already exists in node 'a695e6'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]: 0%| | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node 'a695e6'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:02<00:00, 2.55s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 4
Processing document 5
Applying SummaryExtractor: 50%██████████ | 1/2 [00:01<00:01, 1.70s/it]Property 'summary' already exists in node 'fd615d'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]: 0%| | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node 'fd615d'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:01<00:00, 1.45s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 5
Processing document 6
Applying SummaryExtractor: 50%|██████████████ | 1/2 [00:02<00:02, 2.25s/it]Property 'summary' already exists in node 'e1e270'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]: 0%| | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node 'e1e270'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:02<00:00, 2.90s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 6
Processing document 7
Applying SummaryExtractor: 50%|██████████████ | 1/2 [00:02<00:02, 2.79s/it]Property 'summary' already exists in node '4c63de'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]: 0%| | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node '4c63de'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:01<00:00, 1.35s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 7
Processing document 8
Applying SummaryExtractor: 50%|██████████████ | 1/2 [00:02<00:02, 2.41s/it]Property 'summary' already exists in node '587336'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]: 0%| | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node '587336'. Skipping!
Generating Scenarios: 100%|█████████████| 1/1 [00:01<00:00, 1.31s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 8
Processing document 9
Applying SummaryExtractor: 50%██████████████ | 1/2 [00:02<00:02, 2.23s/it]Property 'summary' already exists in node '6b7911'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]: 0%| | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node '6b7911'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:01<00:00, 1.32s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 9
Generated 6 questions in total
Expected behavior
All documents with sufficient content should yield 3 generated questions each. Only two documents (1 and 3) return questions, the others silently skip sample generation despite seemingly valid content and processing steps.
Additional context
Using Azure OpenAI (gpt-4o) with SingleHopSpecificQuerySynthesizer.
The recurring "Property already exists" messages may be related, but they don't raise an error or explain the failure to generate questions.
Setting raise_exceptions=True does not help expose root cause.