Detect and automatically merge duplicate Freshservice tickets using state-of-the-art sentence embeddings, FAISS Approximate-Nearest-Neighbours (ANN) search and department-aware filtering with domain-specific heuristics.
The repository contains everything needed to train a similarity index from your historical tickets and serve a low-latency API that can be wired to a Freshservice webhook.
β’ End-to-end pipeline: data harvest β embedding β ANN index β real-time duplicate detection
β’ Language-agnostic β powered by Sentence-Transformers
β’ β‘ <10 ms query latency with in-memory FAISS
β’ Hot-reload of new snapshots without downtime
β’ Smart merge workaround implemented via Freshservice REST API (adds notes + closes duplicate)
β’ Department-aware duplicate detection β only compares tickets within the same department
β’ Human-in-the-loop review console with Streamlit UI for manual validation
β’ Active learning with pair labeling and confidence tracking
β’ Immutable snapshots with atomic updates and rollback capability
β’ Prometheus metrics & structured, colourised logging
β’ 100% reproducible via Docker Compose β no local Python required
β’ Extensive test-suite (pytest) & modular design for rapid iteration
ββββββββββββββββββββββββββββββββββββββ
β Trainer (batch) β
β β
β 1. Fetch N days of tickets β
β 2. Embed unseen tickets β
β 3. Build FAISS IP index β
β 4. Write immutable snapshot β
ββββββββββββββββ¬ββββββββββββββββββββββ
β shared volume (/shared)
βΌ
ββββββββββββββββββββββββββββββββββββββ
β Serve (FastAPI) β
β β
β β’ /webhook β duplicate check β
β β’ /review β human review UI β
β β’ /candidates β active learning β
β β’ /label β pair feedback β
β β’ /reloadIndex β hot reload β
β β’ /metrics β Prometheus β
β β’ /healthz β readiness β
ββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββ
β Review Console (Streamlit) β
β β
β β’ Manual duplicate validation β
β β’ Confidence-based filtering β
β β’ Active learning feedback β
β β’ Batch review workflows β
ββββββββββββββββββββββββββββββββββββββ
Immutable Snapshots are laid out as:
/shared/
ββ snapshots/2024-06-01T12-00-00Z/
β ββ vectors.faiss # binary FAISS index
β ββ meta.json # tickets + metadata
ββ snapshots/2024-06-01T18-30-15Z/
β ββ vectors.faiss # newer snapshot
β ββ meta.json # with more tickets
ββ current β snapshots/2024-06-01T18-30-15Z/ # symlink flipped atomically
Serve mounts /shared read-only; the symlink ensures zero-downtime upgrades and easy rollbacks.
| Path | Purpose |
|---|---|
src/ |
Library code shared by trainer & API |
src/similarity/ |
Core similarity detection & active learning |
src/freshservice/ |
Freshservice API client & merge logic |
src/ann_index.py |
FAISS index loader with hot-reload |
src/review_console.py |
Streamlit-based manual review UI |
src/webhook_server.py |
FastAPI server with all endpoints |
trainer/ |
Batch job entry-point & Dockerfile |
serve/ |
Startup script & Dockerfile for FastAPI |
tests/ |
pytest test-suite |
requirements.txt |
Runtime & training dependencies |
dev-requirements.txt |
Development & testing dependencies |
docker-compose.yml |
One-command deployment |
- Copy
.env.exampleβ.envand fill at least:FS_DOMAIN=acme.freshservice.com FS_API_KEY=xxxxxxxxxxxxxxxx GROUP_IDS=1234,5678
- Run:
β’ The trainer runs once, writes the first snapshot and exits.
docker compose up --build
β’ The api container waits until the snapshot is present, then exposes HTTP :8000.
curl -X POST http://localhost:8000/webhook \
-H "Content-Type: application/json" \
-d '{"ticket_id": 809188}'Response:
{"merged": true}All options are supplied via environment variables (Docker reads them from .env).
| Variable | Description | Default |
|---|---|---|
| FS_DOMAIN | Freshservice sub-domain (without https) | β (required) |
| FS_API_KEY / FS_API_KEY_FILE | API key or path to file containing the key | β (required) |
| GROUP_IDS / GROUP_ID | Comma-separated list of ticket group IDs | β (required) |
| INDEX_PATH | Shared volume for snapshots | /shared |
| DAYS_BACK | Look-back window for trainer | 60 |
| EMBEDDING_MODEL | sentence-transformers model name | all-MiniLM-L6-v2 |
| SIMILARITY_THRESHOLD | Probability cut-off for automatic merge (serve) | 0.9 |
| RELOAD_TOKEN | Secret header for /reloadIndex |
(unset) |
| WEBHOOK_SECRET | HMAC secret for webhook authentication | (unset) |
| HF_TOKEN | Hugging Face access token (avoids rate limits) | (unset) |
| HF_TOKEN_FILE | Path to file containing HF token | (unset) |
| MODEL_CACHE_DIR | Directory for caching downloaded models | /app/models |
| FS_PER_PAGE | Pagination size when fetching tickets | 100 |
| FS_MAX_PAGES | Max pages per trainer run | 10 |
| UNMERGED_LOG | Path for tickets that were not auto-merged | /data/unmerged_tickets.log |
| REVIEW_USER | Username for review console attribution | anonymous |
| REVIEW_SIM_LOWER | Lower similarity threshold for review candidates | 0.65 |
| REVIEW_SIM_UPPER | Upper similarity threshold for review candidates | 0.90 |
| REVIEW_MAX | Maximum candidates to show in review console | 100 |
| REVIEW_DAYS_BACK | Days to look back for review console tickets | 60 |
To avoid rate limiting and enable offline model usage, configure Hugging Face authentication:
# Add to your .env file
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx# Store token in a file (useful for Docker secrets)
echo "hf_xxxxxxxxxxxxxxxxxxxxxxxxx" > /path/to/hf_token.txt
HF_TOKEN_FILE=/path/to/hf_token.txtModels are automatically cached after first download:
# Default cache directory (will be created if it doesn't exist)
MODEL_CACHE_DIR=/app/models
# In Docker, mount this as a volume for persistence:
# volumes:
# - model_cache:/app/modelsBenefits:
- Avoids rate limits when downloading models from Hugging Face
- Faster startup after first run (models loaded from cache)
- Offline usage once models are cached
- Bandwidth savings in production deployments
When WEBHOOK_SECRET is set, the /webhook endpoint requires HMAC-SHA256 authentication:
# Generate signature
payload='{"ticket_id": 12345}'
signature=$(echo -n "$payload" | openssl dgst -sha256 -hmac "$WEBHOOK_SECRET" -hex | cut -d" " -f2)
# Send authenticated request
curl -X POST http://localhost:8000/webhook \
-H "Content-Type: application/json" \
-H "X-Webhook-Signature: $signature" \
-d "$payload"If WEBHOOK_SECRET is unset, authentication is disabled (backward compatibility).
Run manually (outside Docker) with your local Python:
pip install -r requirements.txt
python -m trainer.train_freshserviceβ’ Only new tickets are embedded β the script loads previous metadata for incremental updates.
β’ After finishing, current symlink is updated atomically.
β’ Log output is written to console; failures are non-zero exit codes (ideal for CI schedulers).
Open src/freshservice/freshservice_client.py and tweak _calculate_duplicate_probability() or adjust ignore_subject_phrases in src/config.py.
| Method | Path | Body | Description |
|---|---|---|---|
| POST | /webhook |
{ "ticket_id": 123 } |
Fetches ticket, compares to ANN index within same department, merges if probability β₯ threshold. Returns { "merged": bool }. |
| GET | /review |
β | Human-in-the-loop review UI (HTML page). |
| GET | /candidates |
?max_pairs=25 |
Returns potential duplicate pairs for manual review (JSON). |
| POST | /label |
{ "ticket1_id": "123", "ticket2_id": "456", "is_duplicate": true } |
Label a ticket pair for active learning. |
| POST | /reloadIndex |
β | Reload snapshot & embedding model. Requires header X-Reload-Token. |
| GET | /healthz |
β | Readiness probe, returns vector count. |
| GET | /metrics |
β | Prometheus/OpenMetrics exposition. |
- Trainer writes new snapshot.
- External process calls
/reloadIndex(or rely on watchdog). - Serve clears internal lru-caches and memory-maps the new FAISS file β all in <200 ms.
The system automatically enforces department boundaries when detecting potential duplicates. This ensures data privacy and more relevant results:
- Automatic extraction: Department IDs are extracted from Freshservice API responses (
department_idfield) - Similarity filtering: Only tickets within the same department are compared for similarity
- Cross-department isolation: Tickets from different departments never influence each other's duplicate detection
- Backward compatibility: Tickets without department IDs fall back to global comparison
- π Data Privacy: Prevents cross-department information leakage
- π― Relevant Results: More accurate duplicate detection within organizational boundaries
- β‘ Performance: Reduced search space leads to faster similarity calculations
- π Better Accuracy: Department context improves probability calculations
Department filtering is automatic and requires no additional configuration. The system:
- Reads
department_idfrom Freshservice ticket API responses - Gracefully handles tickets with missing or null department IDs
- Maintains full backward compatibility with existing deployments
The system includes a Streamlit-based review console for manual validation of potential duplicates that fall below the automatic merge threshold. This enables continuous improvement through active learning.
- Confidence-based filtering: Shows tickets with similarity between configurable thresholds (default: 65%-90%)
- Session persistence: Tracks already-reviewed pairs within a session to avoid duplicates
- One-click actions: Approve merges or mark as not-duplicates with single button clicks
- Real-time feedback: Labels are immediately fed back to the training system
- Batch workflows: Process multiple candidates efficiently in sequence
# Install additional dependencies
pip install streamlit
# Set environment variables for the console
export REVIEW_USER="your-name"
export REVIEW_SIM_LOWER="0.65" # Lower similarity threshold
export REVIEW_SIM_UPPER="0.90" # Upper similarity threshold
export REVIEW_MAX="100" # Max candidates to show
export REVIEW_DAYS_BACK="60" # Days to look back for tickets
# Launch the console
streamlit run src/review_console.py --server.port 8501The console will be available at http://localhost:8501 and integrates with the same Freshservice configuration as the main system.
Yes, the system automatically merges tickets when configured with a Freshservice webhook. Here's how it works:
- Webhook Trigger: Freshservice sends a webhook when a new ticket is created
- Similarity Analysis: The system compares the new ticket against existing tickets in the same department
- Probability Calculation: Uses cosine similarity + domain heuristics (email matching, department context)
- 85% similarity score + 10% email bonus + 5% department bonus
- Automatic Merge: If probability β₯ threshold (default 90%), tickets are merged automatically
- Merge Implementation: Adds private notes to both tickets and closes the duplicate
Since Freshservice doesn't have a native merge API, the system implements a merge workaround:
- Adds explanatory private note to the winning (original) ticket
- Adds reference note to the duplicate ticket pointing to the master
- Closes the duplicate ticket (status = 5)
- All future correspondence happens on the winning ticket
- High Threshold: Default 90% confidence prevents false positives
- Department Isolation: Only merges tickets within the same department
- Audit Trail: All merges are logged with detailed probability scores
- Manual Override: Unmerged tickets (below threshold) are logged for human review
SIMILARITY_THRESHOLD=0.9 # 90% confidence required for auto-merge
WEBHOOK_SECRET=your_secret # Optional webhook authenticationNote: Automatic merging only occurs when tickets are submitted via the /webhook endpoint. The trainer and review console do not perform automatic merges.
β’ Logs β Unified stdlib + Loguru formatter. Colourised in console, JSON-less plain text in files.
β’ Metrics β Two counters + one histogram exported via prometheus_client; no background threads.
pip install -r requirements.txt -r dev-requirements.txt
pytest -qThe suite uses pytest + coverage and mocks HTTP calls β no network dependency.
-
Test Suite - Runs tests across Python 3.10, 3.11, and 3.12
- Installs dependencies with pip caching
- Executes pytest with coverage reporting (75% minimum)
- Uploads coverage to Codecov
-
Code Quality - Enforces code standards
- Runs pre-commit hooks (Black, isort, flake8)
- Type checking with mypy
- Security scanning with Bandit
-
Docker Build - Validates containerization
- Builds both trainer and serve images
- Uses Docker layer caching for speed
- Validates docker-compose configuration
-
Security Scan - Vulnerability detection
- Trivy filesystem scanning
- Results uploaded to GitHub Security tab
# Install pre-commit hooks (one-time setup)
pip install pre-commit
pre-commit install
# Run quality checks locally
pre-commit run --all-files
# Run tests with coverage
pytest --cov=src --cov=trainer --cov-report=html
# Type checking
mypy src/ trainer/
# Security scan
bandit -r src/ trainer/All tools are configured via pyproject.toml for consistency across environments.
- Fork & clone, create a branch.
- Adhere to PEP-8 and run
pytestbefore pushing. - For new modules write unit tests under
tests/. - Open a pull request β CI will run lint + tests.
Feel free to open issues for bugs or feature requests. PRs are warmly welcomed!
Apache 2.0 β see LICENSE. Commercial use, distribution and modification are permitted.
β’ Sentence-Transformers
β’ FAISS
β’ FastAPI & Uvicorn
β’ Loguru