University Course Project - Data Science
An advanced NLP toolkit based on State-of-the-Art Transformer architectures for document summarization and mental health analysis using PEFT (LoRA) techniques.
Faboulous-Interpretr is a production-ready NLP platform designed to address two complex natural language processing tasks: the summarization of extensive technical documentation and the identification of mental health-related patterns in text.
The project stands out for its adoption of advanced optimization techniques such as Map-Reduce for managing long texts and LoRA (Low-Rank Adaptation) for efficient model fine-tuning.
- 📄 Structured Summarization: Intelligent synthesis of technical documents (PDF, API Specs, Web) while maintaining logical coherence through recursive chunking.
- 🧠 Mental Health Analysis: Text classification for identifying emotional and psychological states (e.g., Anxiety, Depression, Stress) using XLM-RoBERTa models adapted with LoRA.
The system is modular and designed to scale, with a clear separation between data ingestion, inference logic, and user interface.
To overcome the context window limits of standard Transformers, we implemented a custom pipeline:
- Agnostic Ingestion: Specific adapters for PDF (
PyMuPDF), Web (Trafilatura), and JSON/YAML files (OpenAPI). - Recursive Chunking: Semantic text segmentation that preserves sentence boundaries to avoid brutal truncation.
- Map-Reduce Strategy: Each segment is summarized individually (Map) and results are structurally aggregated (Reduce), ensuring no technical detail is lost.
- Backbone:
it5-base-summarization, fine-tuned specifically for the Italian language.
A highly specialized classification module:
- Model Architecture:
XLM-RoBERTa Baseenhanced with LoRA adapters. This allows for a high-performance model with a reduced memory footprint, updating less than 1% of total parameters during training. - Fine-Tuning Pipeline: Dedicated training script (
train_sentiment.py) managing the model lifecycle, from dataset preprocessing to adapter saving. - Target Classes: Configured to detect complex nuances (e.g., Normal, Depression, Anxiety) beyond classic positive/negative sentiment.
Faboulous-Interpretr/
├── app.py # Streamlit Entry point (UI & Orchestration)
├── requirements.txt # Production dependencies
├── data/
│ ├── external/ # Data from external sources
│ ├── processed/ # Cleaned datasets ready for training
│ └── raw/ # Raw data (CSV, PDF, JSON)
├── docs/ # Technical and academic documentation
├── models/ # Local Model Registry (LoRA Checkpoints, HF Cache)
├── notebooks/ # Jupyter Notebooks for EDA and experimentation
│ ├── 1_EDA_and_Baseline.ipynb
│ └── sentiment_analysis_nn.ipynb
└── src/ # Source Code
├── data_ingestion.py # Loaders for PDF, URL, and OpenAPI
├── preprocessing.py # Text Cleaning and Recursive Token Chunker
├── summarization.py # Summarization inference logic
├── sentiment.py # Sentiment inference logic (LoRA Loading)
├── train_sentiment.py # PEFT/LoRA training pipeline
├── evaluation.py # Metrics validation script (ROUGE)
└── utils.py # Hardware detection and centralized Logging
- Frontend: Streamlit
- Modeling: PyTorch, Hugging Face Transformers, PEFT (Parameter-Efficient Fine-Tuning)
- Data Processing: Pandas, Scikit-learn
- NLP Utils: PyMuPDF (Fitz), Trafilatura
- Hardware Acceleration: Automatic support for CUDA (NVIDIA) and MPS (Apple Silicon).
- Python 3.9+
- Virtual Environment (recommended)
-
Clone the repository:
git clone https://github.com/DataScience-Golddiggers/Faboulous-Interpretr.git cd Faboulous-Interpretr -
Activate the virtual environment:
# Windows python -m venv .venv .venv\Scripts\activate # Unix/MacOS python3 -m venv .venv source .venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Start the Web App:
streamlit run app.py
The project includes a complete pipeline for fine-tuning. To train a new adapter on your own data:
python src/train_sentiment.py \
--data_path "data/processed/mental_balanced.csv" \
--text_col "text" \
--label_col "label" \
--epochs 5 \
--batch_size 16 \
--output_dir "models/my_custom_lora"The system will automatically save the adapters in the specified folder, ready to be loaded by the inference module.
Model performances are monitored via quantitative metrics:
- Summarization: ROUGE-1, ROUGE-2, ROUGE-L.
- Classification: Accuracy, F1-Score (Weighted).
To run the evaluation suite:
python -m src.evaluationAuthors: Data Science Golddiggers Team

