Your first line of defense against low-quality documents in RAG systems.
cerberus is a lightweight pipeline for filtering, normalizing, and ingesting documents into RAG (Retrieval-Augmented Generation) systems. It focuses on the "gatekeeper" step of the pipeline: ensuring only high-quality, relevant, and deduplicated documents get chunked, embedded, and stored in your vector database.
By using topic modeling (BERTopic) and clustering techniques, Cerberus helps you:
- Filter low-quality or irrelevant documents before they pollute your vector store
- Normalize content from various sources (web, text files, etc.)
- Cluster documents by semantic similarity to identify coherent topics
- Infer topic assignments for new documents against a trained seed model
- Seed Model Training: Build a topic model from a corpus of high-quality seed documents
- Document Classification: Classify new documents against your trained model
- Web Adapter: Fetch and normalize content from URLs
- Flexible Configuration: YAML-based policy configuration for quality gating thresholds
- Statistics & Metrics: Track cluster-level statistics and document quality metrics
- CLI Interface: Simple command-line tools for all operations
- Python >= 3.12, < 3.13
# Clone the repository
git clone https://github.com/arin-deloatch/Cerberus-DPL.git
# Install with uv
uv sync
Train a topic model on a corpus of seed documents:
uv run cerberus build-seed /path/to/seed/documents \
--output artifacts/ \
--model ibm-granite/granite-embedding-30m-english \
--min-cluster-size 5
Options:
input_path: Directory containing seed documents (txt, md, html)--output, -o: Output directory for the trained model--model, -m: Embedding model name (HuggingFace compatible)--min-cluster-size, -k: Minimum cluster size for HDBSCAN (default: 5)--random-state: Random seed for reproducibility (default: 42)--overwrite, -f: Overwrite existing model directory
Infer topic assignments for new documents:
uv run cerberus infer \
--model-path artifacts/bertopic_model_2025... \
--url https://example.com/article \
--show-terms 10 \
--json-out results.json
uv run cerberus infer \
--model-path artifacts/bertopic_model_2025... \
--text "Your document text here..." \
--show-terms 10
Options:
--model-path, -m: Path to trained BERTopic model--url, -u: URL to fetch and classify--text, -t: Direct text input to classify--min-words: Minimum word count warning threshold (default: 40)--show-terms: Show top-N c-TF-IDF terms for predicted topic (default: 10)--use-hf-tokenizer/--no-hf-tokenizer: Use HuggingFace tokenizer for token counting--json-out: Save full JSON results to file
Example output:
{
"source": "https://example.com/article",
"topic": 5,
"probability": 0.87,
"rejected": false,
"token_count": 1247,
"word_count": 982,
"text_length": 6543,
"top_terms": [
{"word": "machine", "score": 0.45},
{"word": "learning", "score": 0.42},
...
]
}Cerberus uses environment variables and YAML configuration files:
- Environment Settings: Configure via
config.pyor environment variables - Policy Configuration: Define quality gating thresholds in
policy.yaml
cerberus_dpl/
├── adapters/ # Input adapters (web, file, etc.)
├── seed/ # BERTopic model training
├── metrics/ # Statistics and inference
├── utils/ # Utility functions
├── models.py # Pydantic data models
├── config.py # Settings management
├── logging.py # Structured logging
└── cli.py # Command-line interface
- BERTopic: Topic modeling with HDBSCAN clustering and UMAP dimensionality reduction
- Sentence Transformers: Document embeddings
- Pydantic: Data validation and serialization
- Typer: CLI framework
Core dependencies:
bertopic- Topic modelingdocling- Document processinghdbscan- Density-based clusteringsentence-transformers- Embeddingsscikit-learn- Machine learning utilitiespydantic- Data validationtyper- CLI framework
Dev dependencies:
pyright- Static type checkingruff- Linting and formatting
- RAG Pipeline Gating: Filter documents before ingestion to maintain vector store quality
- Content Clustering: Organize large document collections by semantic topics
- Quality Control: Identify and reject low-quality or off-topic documents
- Topic Analysis: Understand the thematic structure of your document corpus
Apache-2.0
This repository is under active development.
