Automatic Dialect & Semantic Detection

This project was developed during the NitroNLP Hackathon for the “Mickey & Donald” challenge. The goal was to classify shuffled Romanian/Moldovan text messages along two dimensions:

Dialect classification – distinguish Romanian from Moldovan.
Semantic category classification – assign text to one of six topics (culture, finance, politics, science, sports, technology).

The challenge was made harder by two constraints:

Shuffled words within each sentence (destroying normal word order).
Entity masking (names replaced by $NE$ ).

Dataset

Provided data: train/validation/test splits from the hackathon (~8,000 samples).
Augmentation: additional samples from MOROCO (Moldavian and Romanian Dialectal Corpus).
Preprocessing steps:
- Standardized $NE$ tokens.
- Minimal text cleaning (e.g., whitespace).
- Stratified splits to preserve class balance.

Task 1: Dialect Classification

Goal: Decide whether a given shuffled text comes from the Romanian or Moldovan dialect.

Approaches

Random Forest + TF-IDF + handcrafted features
- Focused on orthographic cues (â/î usage, letter frequencies).
- Provided interpretability and a strong baseline.
- Accuracy: 86.7%, F1 = 0.86.
Fine-tuned RoBERT
- Leveraged the Romanian BERT model pre-trained on millions of Romanian sentences.
- Captured deeper contextual and subword-level dialect patterns.
- Accuracy: 89.2%, F1 = 0.89.
Multi-Channel CNN
- Modeled text at different granularities (character n-grams of 1–7).
- Effective at handling shuffled words since intra-word patterns remain intact.
- Best results: Accuracy = 89.8%, F1 = 0.90.

Task 2: Semantic Category Classification

Goal: Assign each sample to one of six semantic classes (culture, finance, politics, science, sports, technology).

Challenges

Imbalanced categories (e.g., sports had far more examples than finance).
Overlapping vocabulary between categories like politics & culture or science & technology.

Approaches

Fine-tuned RoBERT with focal loss
- Adapted loss function to give more weight to minority classes and harder examples.
- Solid baseline: Accuracy = 79.6%, F1 = 0.80.
RoBERT + Random Forest ensemble
- Combined three types of features: TF-IDF vectors, RoBERT embeddings, and cosine similarities with category labels.
- Random Forest used to integrate these heterogeneous features.
- Improved performance: Accuracy = 81.6%, F1 = 0.82.
RoBERT + XGBoost ensemble
- Similar feature setup, but with XGBoost for more sophisticated boosting and feature interaction modeling.
- Outperformed other approaches across most categories.
- Best results: Accuracy = 82.1%, F1 = 0.82.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Augumentare_dataset_MAROCO.ipynb		Augumentare_dataset_MAROCO.ipynb
NLP_Identificare_dialect_MOvsRO_TASK1_.ipynb		NLP_Identificare_dialect_MOvsRO_TASK1_.ipynb
NLP_clasificare_clase_TASK2_.ipynb		NLP_clasificare_clase_TASK2_.ipynb
Proiect_NLP_paper.pdf		Proiect_NLP_paper.pdf
README.md		README.md
Task_Description.docx		Task_Description.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automatic Dialect & Semantic Detection

Dataset

Task 1: Dialect Classification

Approaches

Task 2: Semantic Category Classification

Challenges

Approaches

About

Uh oh!

Releases

Packages

Languages

CapitanuAndreea/NitroNLP-Hackathon

Folders and files

Latest commit

History

Repository files navigation

Automatic Dialect & Semantic Detection

Dataset

Task 1: Dialect Classification

Approaches

Task 2: Semantic Category Classification

Challenges

Approaches

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages