Skip to content

Skipping creating questions for documents in Testset Generation for RAG #2033

Open
@AbdelrahmanZeidan5

Description

@AbdelrahmanZeidan5

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
I'm using TestsetGenerator.generate_with_langchain_docs to generate 3 questions per document from a list of 9 documents in a books.json file. However, only 2 of them successfully return questions. The others go through the scenario generation phase but end up generating 0 samples, without any exceptions being raised.

This issue also occurs with other input files, so it's not specific to a single dataset. All documents are long and contain meaningful content — they are not trivially short.

The logs mention that summary and summary_embedding properties already exist in some nodes, which might be related, but it’s unclear if that is interfering with the generation process.

Ragas version: 0.2.15
Python version: 3.13.2

Code to Reproduce

import json
from langchain_core.documents import Document
from ragas.testset import TestsetGenerator
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer

with open("books.json", "r") as f:
    data = json.load(f)

documents = [
    Document(
        page_content=item["content"],
        metadata={
            "title": item["title"],
            "page_ranges": item["page_ranges"],
        }
    )
    for item in data
]

azure_configs = {
    "base_url": "",  # your endpoint
    "model_deployment": "gpt-4o",
    "model_name": "gpt-4o",
    "embedding_deployment": "text-embedding-3-small",
    "embedding_name": "text-embedding-3-small",
}

generator_llm = LangchainLLMWrapper(AzureChatOpenAI(
    openai_api_version="2024-10-01-preview",
    azure_endpoint=azure_configs["base_url"],
    azure_deployment=azure_configs["model_deployment"],
    model=azure_configs["model_name"],
    validate_base_url=False,
))

generator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings(
    openai_api_version="2024-10-01-preview",
    azure_endpoint=azure_configs["base_url"],
    azure_deployment=azure_configs["embedding_deployment"],
    model=azure_configs["embedding_name"],
))

query_distribution = [(SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0)]

for i, doc in enumerate(documents):
    if len(doc.page_content) < 50:
        continue
    try:
        doc_generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
        dataset = doc_generator.generate_with_langchain_docs(
            [doc],
            testset_size=3,
            query_distribution=query_distribution,
            with_debugging_logs=True,
            raise_exceptions=False
        )
        df = dataset.to_pandas()
        print(f"Successfully generated {len(df)} questions for document {i+1}")
    except Exception as e:
        print(f"Error processing document {i+1}: {e}")

Error trace

Applying SummaryExtractor:  50%|...|Property 'summary' already exists in node 'XXXX'. Skipping!
Property 'summary_embedding' already exists in node 'XXXX'. Skipping!
Generating Scenarios: 100%
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document X

Detailed Error trace

Processing document 1
Generating Scenarios: 100%|██████████████| 1/1 [00:02<00:00,  2.09s/it]
Generating Samples: 100%|██████████████| 3/3 [00:03<00:00,  1.01s/it]
Successfully generated 3 questions for document 1
Processing document 2
Applying SummaryExtractor:  50%|██████████████          | 1/2 [00:03<00:03,  3.38s/it]Property 'summary' already exists in node 'ca2b66'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|               | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node 'ca2b66'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:01<00:00,  1.32s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 2
Processing document 3
Generating Scenarios: 100%|██████████████| 1/1 [00:03<00:00,  3.37s/it]
Generating Samples: 100%|██████████████ 4/4 [00:02<00:00,  1.84it/s]
Successfully generated 3 questions for document 3
Processing document 4
Applying SummaryExtractor:  50%██████████████                | 1/2 [00:02<00:02,  2.27s/it]Property 'summary' already exists in node 'a695e6'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node 'a695e6'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:02<00:00,  2.55s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 4
Processing document 5
Applying SummaryExtractor:  50%██████████               | 1/2 [00:01<00:01,  1.70s/it]Property 'summary' already exists in node 'fd615d'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node 'fd615d'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:01<00:00,  1.45s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 5
Processing document 6
Applying SummaryExtractor:  50%|██████████████                  | 1/2 [00:02<00:02,  2.25s/it]Property 'summary' already exists in node 'e1e270'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|        | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node 'e1e270'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:02<00:00,  2.90s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 6
Processing document 7
Applying SummaryExtractor:  50%|██████████████              | 1/2 [00:02<00:02,  2.79s/it]Property 'summary' already exists in node '4c63de'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|       | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node '4c63de'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:01<00:00,  1.35s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 7
Processing document 8
Applying SummaryExtractor:  50%|██████████████              | 1/2 [00:02<00:02,  2.41s/it]Property 'summary' already exists in node '587336'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|         | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node '587336'. Skipping!
Generating Scenarios: 100%|█████████████| 1/1 [00:01<00:00,  1.31s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 8
Processing document 9
Applying SummaryExtractor:  50%██████████████         | 1/2 [00:02<00:02,  2.23s/it]Property 'summary' already exists in node '6b7911'. Skipping!
Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|     | 0/2 [00:00<?, ?it/s]Property 'summary_embedding' already exists in node '6b7911'. Skipping!
Generating Scenarios: 100%|██████████████| 1/1 [00:01<00:00,  1.32s/it]
Generating Samples: 0it [00:00, ?it/s]
Successfully generated 0 questions for document 9
Generated 6 questions in total

Expected behavior
All documents with sufficient content should yield 3 generated questions each. Only two documents (1 and 3) return questions, the others silently skip sample generation despite seemingly valid content and processing steps.

Additional context
Using Azure OpenAI (gpt-4o) with SingleHopSpecificQuerySynthesizer.

The recurring "Property already exists" messages may be related, but they don't raise an error or explain the failure to generate questions.

Setting raise_exceptions=True does not help expose root cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmodule-testsetgenModule testset generation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions