Redacta is an AI-powered document anonymization and summarization platform designed to help organizations handle sensitive documents while ensuring compliance with privacy regulations. The platform combines intelligent document processing with user-friendly interfaces to provide an end-to-end solution for document privacy management.
Redacta aims to address critical challenges faced by organizations handling sensitive documents:
- Privacy Compliance: Organizations struggle to comply with GDPR, HIPAA, and other privacy regulations when sharing or processing documents containing personal information
- Manual Anonymization: Traditional manual redaction processes are time-consuming, error-prone, and inconsistent
- Document Processing Efficiency: Large volumes of documents require significant human resources to review and anonymize
- Information Sharing: Organizations need to share documents for collaboration while protecting sensitive information
- Audit Trail: Need for maintaining proper documentation of anonymization processes for compliance purposes
- Support for PDF document formats with drag-and-drop interface
- Automatic document parsing and text extraction
- Three-tier anonymization levels:
- Light: Names and direct identifiers only
- Medium: Names, contact details, and locations
- Heavy: All potential personal and sensitive information
- Automatic detection of PII (Personally Identifiable Information)
- Smart replacement with contextually appropriate placeholders
- Real-time preview of anonymized content
- Interactive document editor with highlighted sensitive sections
- Click-to-edit functionality for anonymized text
- Manual text selection for additional anonymization
- Visual differentiation between original (red) and anonymized (green) text
- AI-generated summaries with three length options (short, medium, long)
- Context-aware summarization that preserves key information
- Available for both original and anonymized documents
- Downloadable summary reports
- Complete document history tracking
- Search functionality by filename
- Status tracking (Original, Anonymized)
- OAuth2/JWT-based authentication via Keycloak
- Role-based access control
- Secure session management
- Multi-user support with isolated data
- Interactive chat widget for querying documents
- Context-aware responses based on document content
- Utilising RAG (Retrieval-Augmented Generation)
Redacta leverages OpenAI GPT-4 models through a LangGraph-based workflow:
- Structured Output Processing: Uses Pydantic models to ensure consistent anonymization results
- Context-Aware Detection: AI first analyzes document context to identify sensitive information beyond simple pattern matching
- Smart Replacement: Generates contextually appropriate replacements for the identified terms (e.g., "Person A", "Location B", "Date C")
- Multi-level Processing: Dynamically adjusts detection sensitivity based on selected anonymization level
- ChromaDB Vector Store: Documents are chunked and embedded for semantic search
- Context Retrieval: Relevant document sections are retrieved based on summarization requirements
- Conversation Chain: Maintains context across multiple interactions with documents
- Document Chat: Users can interact with documents through natural language queries via the chat interface
Note: Formal user stories from the initial project documentation can be found in the docs/system_overview.md file. Please consult this for various diagrams and images that illustrate the user flows as well as the architecture of the system.
- Upload: User drags PDF file to upload area
- Processing: System extracts text and displays document preview
- AI Analysis: User selects anonymization level (Light/Medium/Heavy)
- Anonymization: AI identifies and replaces sensitive information
- Review: User reviews highlighted changes in document editor
- Manual Editing: User can modify anonymized terms or add additional anonymizations
- Save: System stores anonymization mappings and metadata
- Export: User downloads anonymized PDF with proper formatting
- Source Selection: User chooses between original or anonymized document
- Summary Configuration: Select summary length (short/medium/long)
- AI Processing: System generates contextual summary
- Review: User reviews generated summary in dedicated panel
- Export: Download summary as standalone document
- Chat Interface: User opens chat widget for document interaction
- Query Input: User types natural language questions about the document
- Context Retrieval: System retrieves relevant document sections using RAG
- AI Response: AI generates context-aware answers based on document content
Hamza Chaouki:
-
Created the oops-ops Kubernetes namespace
-
Developed GitHub workflows to:
-
Build and push container images
-
Automatically deploy services to Kubernetes
-
-
Wrote unit, integration, and service application tests for:
-
authentication-service
-
document-service
-
-
Implemented the full authentication system:
-
Frontend UI for registration and login
-
authentication-service using Keycloak and OAuth2
-
-
Added JWT token propagation across all services for user identification and access control
-
Document Service
-
Built the complete document handling flow:
-
Implemented database setup, entities, DTOs, controller
-
Developed document upload and text extraction features
-
-
Database Setup
-
Configured PostgreSQL databases for all services
-
Set up pgAdmin for visual DB management and debugging
-
Siddharth Khattar:
- Initial project breakdown and architecture design along with tickets creation
- Creation of various stages and overall orchestration workflows for the CI/CD pipeline
- Test development for the complete GenAI service
- Design and mock data based implementation of the frontend pages Document Uploader, Editor, Archive and the Chat interface
- Connection of the Document Archive frontend to the springboot backend
- Implementation of the complete genai chat functionality backend (python + Langchain + Vector database + API)
- AWS deployment along with its CD github workflow
Yosr Nabli:
-
Created the "oopsops-test" GitHub Actions workflow to build and push container images automatically
-
Wrote unit tests, integration tests, and service application tests for the anonymization-service
-
Implemented the complete anonymization pipeline:
-
Frontend adjustments tailored to anonymization requirements
-
Backend development of the anonymization-service (including database setup, entities, DTOs, controller, and service layer)
-
Integrated flow with genai-service
-
-
Implemented the summarization pipeline in the genai-service
-
Integrated summarization with the frontend and adapted the UI/UX accordingly
-
Set up the entire monitoring stack, including:
-
Prometheus (metrics collection)
-
Grafana (dashboard provisioning)
-
Alertmanager (alerts and notifications)
-
- Docker and Docker Compose installed OR Docker Desktop (Mac/Windows) installed
- OpenAI API key
- 8GB+ RAM recommended
- Following ports available:
- 8000
- 8081
- 8085
- 8091
- 8092
- 8094
- 3000
- 5432
- 5050
- Clone the Repository
git clone https://github.com/AET-DevOps25/team-oopsops.git
cd team-oopsops- Environment Configuration
# Create environment file
cp .env.example .env
# Set required environment variables in the .env file
OPENAI_API_KEY="your-openai-api-key"- Start All Services
# Launch complete stack
docker-compose up -d
# View logs
docker-compose logs -f- Access the Platform
- Main Application: http://localhost:3000
- API Gateway: http://localhost:8081
- Database Admin: http://localhost:5050 (admin@admin.com / admin)
- Keycloak Admin: http://localhost:8085 (admin / admin)
| Service | Port | Description | OpenAPI Documentation | Health Check |
|---|---|---|---|---|
| API Gateway (Nginx) | 8081 | Routes requests to microservices | - | http://localhost:8081 |
| Client (React SPA) | 3000 | Frontend application | - | http://localhost:3000 |
| Document Service | 8091 | Document upload and management | http://localhost:8091/swagger-ui/index.html | http://localhost:8091/actuator/health |
| Authentication Service | 8092 | User auth and JWT management | http://localhost:8092/swagger-ui/index.html | http://localhost:8092/actuator/health |
| Anonymization Service | 8094 | Document anonymization logic | http://localhost:8094/swagger-ui/index.html | http://localhost:8094/actuator/health |
| GenAI Service | 8000 | AI processing and RAG | http://localhost:8000/docs | http://localhost:8000/health |
| PostgreSQL | 5432 | Primary database | - | pg_isready -U dev_user |
| PgAdmin | 5050 | Database administration | - | http://localhost:5050 |
| Keycloak | 8085 | Identity and access management | - | http://localhost:8085 |
POST /register- User registration with Keycloak integrationPOST /login- User authentication, returns JWT tokensPOST /refresh- Refresh access token using refresh token
GET /- List all user documents with metadataGET /{id}- Retrieve specific document by IDPOST /upload- Upload PDF file for processing
GET /- List user's anonymization recordsPOST /{documentId}/add- Save anonymization for documentPOST /replace- Process text anonymization with term replacementsGET /{id}/download- Download anonymized document as PDF
POST /anonymize– AI-powered anonymization with level selection. Returns a list of terms to replace (e.g., { original: "John", replacement: "Person A" }) rather than generating a fully anonymized text.POST /summarize- Generate document summariesPOST /chat- Interactive document chat using RAGPOST /documents/upload- Upload documents to vector store
Note: More detailed diagrams and images can be found in the docs/system_overview.md file.
┌─────────────────┐ ┌──────────────────┐
│ React Client │────│ Nginx Gateway │
│ (Port 3000) │ │ (Port 8081) │
└─────────────────┘ └──────────────────┘
│
┌───────────┼───────────┐
│ │ │
┌───────▼──┐ ┌──────▼──┐ ┌──────▼──────┐
│Document │ │Auth │ │Anonymization│
│Service │ │Service │ │Service │
│(8091) │ │(8092) │ │(8094) │
└──────────┘ └─────────┘ └─────────────┘
│ │ │
└───────────┼───────────┘
│
┌───────────▼───────────┐
│ PostgreSQL │
│ (Port 5432) │
│ (Multi-database) │
└───────────────────────┘
┌─────────────────┐ ┌──────────────────┐
│ GenAI Service │────│ ChromaDB │
│ (Port 8000) │ │ Vector Store │
│ (FastAPI) │ │ (Embedded) │
└─────────────────┘ └──────────────────┘
- Request Flow: Client → Nginx → Microservice → Database
- Authentication Flow: Client → Auth Service → Keycloak → JWT
- Document Processing: Upload → Document Service → Text Extraction → Storage
- AI Processing: Document → GenAI Service → OpenAI API → ChromaDB → Response
- Anonymization Flow: Extracted Text + Level of Anonymization → GenAI Service → Anonymization Service to replace → GenAI → Response
- Framework: React 19.1.0 with TypeScript
- Styling: Tailwind CSS 3.4.17 with custom components
- UI Library: Radix UI components with shadcn/ui
- State Management: TanStack Query for API state management
- Routing: React Router DOM 7.6.2
- Build Tool: Vite 6.3.5
- Form Handling: React Hook Form with Zod validation
- Framework: Spring Boot 3.5.0 with Java 21
- Security: Spring Security with OAuth2 Resource Server
- Database: PostgreSQL with JPA/Hibernate
- Documentation: OpenAPI/Swagger integration
- Monitoring: Spring Actuator with Prometheus metrics and and visualization in Grafana dashboards, Alerts Definition is in Prometheus and managed via Alertmanager
- Build Tool: Gradle 8.14
- Framework: FastAPI 0.115.12 with Python 3.9+
- AI/ML Libraries:
- LangChain 0.3.25 for LLM orchestration
- LangGraph 0.4.8 for workflow management
- ChromaDB 1.0.15 for vector storage
- LLM Integration: OpenAI GPT-4 via langchain-openai
- Document Processing: PyPDF 5.7.0, ReportLab 4.4.2
- Containerization: Docker with multi-stage builds
- Orchestration: Docker Compose for local, Kubernetes for production
- Reverse Proxy: Nginx for API Gateway
- Database: PostgreSQL 14 Alpine
- Identity Provider: Keycloak 22.0.3
- Monitoring: Prometheus + Grafana + Alertmanager in Kubernetes (configured but disabled in docker-compose)
- Container Registry: GitHub Container Registry (GHCR)
- CI/CD: GitHub Actions with automated testing
- Deployment: AWS EC2 + Rancher Kubernetes
- Load Balancing: Traefik for production deployments
- SSL: Let's Encrypt certificate management
# Key services configuration
services:
api-gateway: # Nginx routing
postgres: # Multi-database setup
document-service, authentication-service, anonymization-service
genai-service: # AI processing
client: # React SPA
keycloak: # Identity management- Infrastructure: AWS EC2 + Rancher Kubernetes
- Load Balancing: Traefik with automatic SSL
- DNS: Custom domains with automated certificate management
- Scaling: Horizontal pod autoscaling based on CPU/memory
- Monitoring: Prometheus metrics collection + Grafana Dashboards + Alertmanager
- Logging: Structured logging with log aggregation
The project implements a step by step CI/CD pipeline with the following stages:
-
Continuous Integration
- Automated testing for Spring Boot services
- Python GenAI service testing with pytest
- Docker image building and pushing to GHCR
-
Continuous Deployment
- Automated deployment to AWS EC2 environment
- Kubernetes deployment via Rancher
- Environment-specific configuration management
Note: The test results as well as the coverage reports are available in the GitHub Actions workflow run logs in the "Actions" tab of the repository.
-
Post-Deployment Validation
- Health check verification
- API endpoint testing
- Service integration validation
-
Pipeline Orchestration
# Main workflow stages - test-springboot # Unit/integration tests - test-genai # Python service testing - build-and-push # Docker image creation - deploy-aws # EC2 deployment - deploy-rancher # Kubernetes deployment - validate-deployment # Post-deployment checks
- OAuth2/JWT: Keycloak-based identity management
- Role-Based Access: Service account roles with proper permissions
- API Security: All endpoints require valid JWT tokens
- CORS: Properly configured cross-origin resource sharing
- Encryption: TLS/HTTPS for all communications
- Database Security: Isolated databases per service
- File Storage: Secure file handling with proper validation
- API Key Management: Secure environment variable handling
- Spring Actuator endpoints for service health
- Custom health checks for all components
- Dependency health validation
- Prometheus metrics integration
- Custom application metrics (e.g., request latency, error rates)
- Performance monitoring
- Error Rate Dashboard: Tracks 4xx/5xx HTTP error rates across all services
- GenAI Latency Dashboard: Monitors response times for summarization and anonymization requests
- Traffic Summary Dashboard: Aggregates and visualizes traffic for all service endpoints
- Prometheus alert rules for critical thresholds (e.g., high error rate, slow response time)
- AlertManager handles alert routing and notification (via email)
- Note: Email notifications via Gmail SMTP are currently not functional due to authentication issues — despite using an app password, Gmail rejects the credentials with a “username and password not accepted” error.
- Structured logging across all services
- Centralized log aggregation capability
- Error tracking and alerting
- Microservice separation by domain
- Clean architecture patterns
- Comprehensive testing strategies
- API-first development approach
- Unit and integration testing
- Code coverage reporting
- Automated quality gates
- Documentation as code
To deploy the monitoring stack (Prometheus, Grafana, Alertmanager) via Helm:
cd helm/monitoringhelm upgrade --install oopsops-monitoring-app . \
--namespace oopsops-monitoring \
--create-namespace-
Prometheus UI: You can create and run custom PromQL queries directly in Prometheus: https://prometheus.monitoring.student.k8s.aet.cit.tum.de/
-
Grafana Dashboards: Pre-configured dashboards are available in Grafana: https://grafana.monitoring.student.k8s.aet.cit.tum.de/
-
The following dashboards are currently available:
-
Error Rate Dashboard:
-
Shows 5xx and 4xx error rates for:
-
Spring Boot services
-
genai-service (FastAPI)
Note: Prometheus queries such as through promql:
rate(http_server_requests_seconds_count{status=~"5.."}[5m]) rate(http_request_duration_seconds_count{status=~"5..", job="kubernetes-genai-service"}[5m]) -
-
GenAI Service Latency:
-
Shows average request latency for the GenAI service's endpoints:
/api/v1/genai/anonymize /api/v1/genai/summarize
Calculated as:
rate(http_request_duration_seconds_sum{...}) / rate(http_request_duration_seconds_count{...})
-
-
Traffic Summary by Service & Endpoint:
-
Visualizes total request count per service and endpoint in the last 24 hours.
-
Colored bars grouped by microservice:
-
Green → auth
-
Orange → document
-
Purple → anonymization
-
Red → genai
-
-
-
- Port Conflicts: Ensure all required ports are available
- OpenAI API: Verify API key is properly set in environment
- Database Connection: Check PostgreSQL service startup
- Keycloak must be ready before authentication service
- PostgreSQL must be available before Spring Boot services
- All services should have proper health checks