This project was developed during the NitroNLP Hackathon for the “Mickey & Donald” challenge. The goal was to classify shuffled Romanian/Moldovan text messages along two dimensions:
- Dialect classification – distinguish Romanian from Moldovan.
- Semantic category classification – assign text to one of six topics (culture, finance, politics, science, sports, technology).
The challenge was made harder by two constraints:
- Shuffled words within each sentence (destroying normal word order).
- Entity masking (names replaced by
$NE$).
- Provided data: train/validation/test splits from the hackathon (~8,000 samples).
- Augmentation: additional samples from MOROCO (Moldavian and Romanian Dialectal Corpus).
- Preprocessing steps:
- Standardized
$NE$tokens. - Minimal text cleaning (e.g., whitespace).
- Stratified splits to preserve class balance.
- Standardized
Goal: Decide whether a given shuffled text comes from the Romanian or Moldovan dialect.
-
Random Forest + TF-IDF + handcrafted features
- Focused on orthographic cues (â/î usage, letter frequencies).
- Provided interpretability and a strong baseline.
- Accuracy: 86.7%, F1 = 0.86.
-
Fine-tuned RoBERT
- Leveraged the Romanian BERT model pre-trained on millions of Romanian sentences.
- Captured deeper contextual and subword-level dialect patterns.
- Accuracy: 89.2%, F1 = 0.89.
-
Multi-Channel CNN
- Modeled text at different granularities (character n-grams of 1–7).
- Effective at handling shuffled words since intra-word patterns remain intact.
- Best results: Accuracy = 89.8%, F1 = 0.90.
Goal: Assign each sample to one of six semantic classes (culture, finance, politics, science, sports, technology).
- Imbalanced categories (e.g., sports had far more examples than finance).
- Overlapping vocabulary between categories like politics & culture or science & technology.
-
Fine-tuned RoBERT with focal loss
- Adapted loss function to give more weight to minority classes and harder examples.
- Solid baseline: Accuracy = 79.6%, F1 = 0.80.
-
RoBERT + Random Forest ensemble
- Combined three types of features: TF-IDF vectors, RoBERT embeddings, and cosine similarities with category labels.
- Random Forest used to integrate these heterogeneous features.
- Improved performance: Accuracy = 81.6%, F1 = 0.82.
-
RoBERT + XGBoost ensemble
- Similar feature setup, but with XGBoost for more sophisticated boosting and feature interaction modeling.
- Outperformed other approaches across most categories.
- Best results: Accuracy = 82.1%, F1 = 0.82.