Add hallucination detection dataset operator#437
Add hallucination detection dataset operator#437rootfs wants to merge 4 commits intoOpenDCAI:mainfrom
Conversation
New operators for creating hallucination detection datasets: 1. LongContextFilterOperator - Filter samples by token count (8K+, 12K+, 16K+, etc.) - Uses HuggingFace tokenizers - Adds num_tokens column to output 2. HallucinationInjectionOperator - Inject RAGTruth-style hallucinations using LLM - Supports: Evident Conflict, Evident Baseless, Subtle Baseless, Subtle Conflict - Parses <hal>...</hal> tags to extract span positions - Configurable hallucination ratio 3. SpanAnnotationOperator - Convert document-level labels to span-level using NLI - Uses DeBERTa-v3-mnli-fever-anli by default - Identifies contradicting sentences Also includes: - Example pipeline in dataflow/example/HallucinationDetectionPipeline/ - Unit tests in test/test_hallucination_detection.py - README documentation Related: llm-semantic-router/longcontext-haldetect dataset
scripts/generate_with_dataflow.py: Complete pipeline for generating long-context hallucination detection datasets using DataFlow operators Features: - Filters NarrativeQA by token count (8K-24K) - Generates answers via vLLM API - Injects RAGTruth-style hallucinations (50%) - Outputs JSON with span annotations Tested: Generated 50 samples (25 hal, 25 supported) in 12K-14K range
Signed-off-by: Huamin Chen <hchen@redhat.com>
There was a problem hiding this comment.
Hi, thank you very much for your interest in DataFlow and for submitting this PR.
Overall, we think this version still requires significant adjustments before it can fully align with the design patterns and conventions of DataFlow operators. One of the core goals in DataFlow’s operator design is to ensure that operators are clear, structured, and easy for the DataFlow Agent to understand, so that they can be composed, rewritten, or further optimized at the prompt level.
In addition, we aim to keep the operator set in the main repository as converged and minimal as possible, avoiding excessive overlap or inconsistent design patterns in the core operator library.
Based on these considerations, we would like to offer the following suggestions:
1. If you would like this operator to be included in the DataFlow main operator library
Some further refinement may be needed in the following areas:
1.1 Operator categorization
- Please clarify the functional category of this operator:
whether it can fit into the existing operator taxonomy, - or whether introducing a separate
hallucination(or similar) category is truly necessary.
This decision has a direct impact on the overall structure and long-term maintainability of the operator system.
1.2 Operator conventions and consistency
- The
prompt_template,get_desc, and related interfaces may need further refinement to comply with the coding and design conventions of DataFlow core operators; - Existing operators and documentation in the repository can be used as references to ensure consistency in style and behavior.
https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/core_text/generate/format_str_prompted_generator.py#L69
2. Alternative approach: publishing via the DataFlow Extension / Ecosystem
If incorporating this operator into the main repository feels too restrictive, we strongly recommend using the DataFlow Extension / Ecosystem approach instead:
- Please refer to our documentation on building DataFlow Extensions:
https://opendcai.github.io/DataFlow-Doc/en/guide/df_ecosystem/ - By publishing the operator in a separate repository, you can adopt more flexible design and release conventions;
- We plan to maintain an Extension index document (like
Awesome projects using Dataflow) In the main repository or Documentation, and you will be able to submit a PR to add your Extension repository link to this index.
This model is similar to the PyTorch ecosystem: not every implementation needs to live in the core repository, and independent modules by other repos are encouraged.
We hope these suggestions are helpful, and we would be happy to further discuss the design and positioning of this operator. Thanks again for your contribution 🙌
There was a problem hiding this comment.
The directory of dataflow/example/* is not for example Python scripts but for example datasets. Please refer to the existing implementation of DataFlow pipelines.
For third-party scripts, we recommend that you place the pipeline scripts under dataflow/statics/thirdparty/HallucinationDetection/*. This will support our default usage by generate all start script by dataflow init command.
There was a problem hiding this comment.
This readme mayalso be placed under the same directory with exmaple pipeline. (dataflow/statics/thirdparty/.../)
There was a problem hiding this comment.
As lazzyloader exists, this file should be removed.
There was a problem hiding this comment.
Same as above. As lazzyloader exists, this file should be removed.
There was a problem hiding this comment.
As lazzyloader exists, this file should be removed.
| @staticmethod | ||
| def get_desc(lang: str = "en") -> tuple: | ||
| """Returns a description of the operator's functionality.""" | ||
| if lang == "zh": |
There was a problem hiding this comment.
To better support DF-Agent understand how to exceute a operator, we need a more detailel get_desc for each operator. Need to specify each the property for each parametes in __init__ and run. You can reference
| def run( | ||
| self, | ||
| storage: DataFlowStorage, | ||
| input_key: str = "dataframe", |
There was a problem hiding this comment.
this should be a name of column in a dataframe instead of the whole dataframe
| def get_desc(lang: str = "en") -> tuple: | ||
| """Returns a description of the operator's functionality.""" | ||
| if lang == "zh": | ||
| return ( |
scripts/generate_with_dataflow.py
Outdated
There was a problem hiding this comment.
we don't have this directory. This file may consider redundant.
|
@SunnyHaze sounds good! I'll check these feedback and get back soon. |
Signed-off-by: Huamin Chen <hchen@redhat.com>
SunnyHaze
left a comment
There was a problem hiding this comment.
Hi, thanks for your revision. However, the current implementation added new function sto our key storage class. The implementation of operators may also need revision to follow the read & write convention for DataFlow File Storage.
There was a problem hiding this comment.
Please don't revise the key class, FileStorage. Here, it should only callread and write to the storage class, instead of adding a new functions to it.
We are building a ModernBERT based hallucination detector, inspired by LettuceDetect. The training dataset is based on RAGTruth, with LLM augmentation. In addition, the HaluEval dataset is converted to spans using NLI.
All these operators are included in this PR.