Multi-task learning pipeline to analyze participatory democracy data, developed as part of the Participedia capstone project.
This project develops a multi-task learning framework to perform classification and text analysis on participatory democracy datasets. It uses a pretrained DistilBERT language model as the base encoder and applies task-specific heads for classification and embedding generation. The pipeline includes data preprocessing, fine-tuning, evaluation and deployment, with all experiments tracked using DVC and MLflow.
- Data preparation: Cleans and preprocesses datasets from Participedia, performing tokenization and formatting for transformer inputs.
- Multi-task learning: Fine-tunes a shared DistilBERT model across classification tasks to extract contextual embeddings and classify input texts.
- Experiment tracking & versioning: Uses DVC and MLflow to version data and models, log metrics and manage experiments.
- Deployment: Provides deployment scripts for serving the model on Vertex AI and Kubernetes.
- Reproducible infrastructure: Includes containerization via Docker and infrastructure-as-code for consistent environments.
This repository contains only a high-level description. The complete codebase and dataset remain proprietary. To learn more about the project or discuss collaboration opportunities, please contact me or explore the summary on my portfolio site.