Open
Description
[x] I checked the documentation and related resources and couldn't find an answer to my question.
Question:
Hello guys,
I'm using the TestsetGenerator with a list of personas to generate evaluation samples for my dataset. However, I noticed that after generation, there is no reliable way to know which persona was used to generate each question in the resulting testset. This information is crucial for general analysis and insights, as I need to correlate the generated questions and references with the specific persona that guided their creation.
I have tried the following approaches:
- Adding a custom field (e.g., persona_name) to the SingleTurnSample via a custom synthesizer, but this field is not preserved in the final testset or when using .to_pandas()/.to_list().
from ragas.testset.synthesizers.single_hop.specific import (
SingleHopSpecificQuerySynthesizer,
)
from ragas.testset.persona import Persona
from ragas.testset.graph import Node
from ragas.dataset_schema import SingleTurnSample
from pydantic import Field
class PersonaSingleTurnSample(SingleTurnSample):
persona_name: str = Field(default=None)
class PersonaAwareSingleHopSpecificQuerySynthesizer(SingleHopSpecificQuerySynthesizer):
def _generate_question(
self, node: Node, persona: Persona
) -> PersonaSingleTurnSample:
original_sample = super()._generate_question(node, persona)
return PersonaSingleTurnSample(
user_input=original_sample.user_input,
retrieved_contexts=original_sample.retrieved_contexts,
reference_contexts=original_sample.reference_contexts,
response=original_sample.response,
multi_responses=original_sample.multi_responses,
reference=original_sample.reference,
rubrics=original_sample.rubrics,
persona_name=persona.name,
)
- Prefixing the persona name in the user_input string and then extracting it via regex, but this feels like a workaround rather than a robust solution and degrades the quality of the prompt.
def _generate_question(self, node: Node, persona: Persona) -> SingleTurnSample:
original_sample = super()._generate_question(node, persona)
return SingleTurnSample(
user_input=f"[{persona.name}]: {original_sample.user_input}",
...
- Attempting to associate personas by order or count, but the generator does not guarantee any alignment between the persona list and the generated samples.
- Is there an official or recommended way to reliably track and retrieve the persona used for each generated question/sample in the testset?
Thank you for your help and for this great library!