Skip to content

Dialect and semantic classification of shuffled Romanian/Moldovan text — Hackathon project exploring ML, transformers, and ensembles.

Notifications You must be signed in to change notification settings

CapitanuAndreea/NitroNLP-Hackathon

Repository files navigation

Automatic Dialect & Semantic Detection

This project was developed during the NitroNLP Hackathon for the “Mickey & Donald” challenge. The goal was to classify shuffled Romanian/Moldovan text messages along two dimensions:

  1. Dialect classification – distinguish Romanian from Moldovan.
  2. Semantic category classification – assign text to one of six topics (culture, finance, politics, science, sports, technology).

The challenge was made harder by two constraints:

  • Shuffled words within each sentence (destroying normal word order).
  • Entity masking (names replaced by $NE$).

Dataset

  • Provided data: train/validation/test splits from the hackathon (~8,000 samples).
  • Augmentation: additional samples from MOROCO (Moldavian and Romanian Dialectal Corpus).
  • Preprocessing steps:
    • Standardized $NE$ tokens.
    • Minimal text cleaning (e.g., whitespace).
    • Stratified splits to preserve class balance.

Task 1: Dialect Classification

Goal: Decide whether a given shuffled text comes from the Romanian or Moldovan dialect.

Approaches

  • Random Forest + TF-IDF + handcrafted features

    • Focused on orthographic cues (â/î usage, letter frequencies).
    • Provided interpretability and a strong baseline.
    • Accuracy: 86.7%, F1 = 0.86.
  • Fine-tuned RoBERT

    • Leveraged the Romanian BERT model pre-trained on millions of Romanian sentences.
    • Captured deeper contextual and subword-level dialect patterns.
    • Accuracy: 89.2%, F1 = 0.89.
  • Multi-Channel CNN

    • Modeled text at different granularities (character n-grams of 1–7).
    • Effective at handling shuffled words since intra-word patterns remain intact.
    • Best results: Accuracy = 89.8%, F1 = 0.90.

Task 2: Semantic Category Classification

Goal: Assign each sample to one of six semantic classes (culture, finance, politics, science, sports, technology).

Challenges

  • Imbalanced categories (e.g., sports had far more examples than finance).
  • Overlapping vocabulary between categories like politics & culture or science & technology.

Approaches

  • Fine-tuned RoBERT with focal loss

    • Adapted loss function to give more weight to minority classes and harder examples.
    • Solid baseline: Accuracy = 79.6%, F1 = 0.80.
  • RoBERT + Random Forest ensemble

    • Combined three types of features: TF-IDF vectors, RoBERT embeddings, and cosine similarities with category labels.
    • Random Forest used to integrate these heterogeneous features.
    • Improved performance: Accuracy = 81.6%, F1 = 0.82.
  • RoBERT + XGBoost ensemble

    • Similar feature setup, but with XGBoost for more sophisticated boosting and feature interaction modeling.
    • Outperformed other approaches across most categories.
    • Best results: Accuracy = 82.1%, F1 = 0.82.

About

Dialect and semantic classification of shuffled Romanian/Moldovan text — Hackathon project exploring ML, transformers, and ensembles.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published