This repository contains my solution for the Kaggle competition Automated Essay Scoring 2.0. The goal is to develop an automated system that evaluates essays based on their content and quality using advanced machine learning techniques.
Multiple approaches were considered, including:
- Fine-tuning DeBERTa, a transformer-based language model.
- Ensembling multiple DeBERTa models trained across different folds.
- Combining LightGBM & XGBoost with feature engineering, model optimization, and hyperparameter tuning.
The best Quadratic Weighted Kappa (QWK) score was achieved using LightGBM + XGBoost, with more weight assigned to LightGBM’s predictions. The details of each approach and their results are available in the Inference section.
- Data Loading & Preprocessing
- Feature Engineering
- Feature Selection
- Model Building & Training
- Inference
- Results & Performance
- Conclusion
- Acknowledgements
This phase involves preparing the dataset for further analysis and model training.
- Loading Data: Essays are loaded using
pandas
and stored in a structured format. - Text Cleaning: A
dataPreprocessing
function is applied to:- Convert text to lowercase.
- Remove HTML tags, URLs, mentions (@user), and numeric values.
- Replace consecutive spaces, commas, and periods with single instances.
- Trim whitespace for a structured output.
- Handling Missing Values: Any missing data is handled to maintain data integrity.
Feature engineering plays a crucial role in improving model performance. Multiple text-based features were extracted at different levels.
- Number of paragraphs per essay.
- Average paragraph length.
- Coherence score between paragraphs.
- Number of sentences per essay.
- Average sentence length.
- Sentence complexity, calculated using grammatical structure.
- Vocabulary richness.
- Word frequency distribution.
- Stop word usage analysis.
- Sentiment polarity of the essay.
- Spelling errors detected using NLTK’s WordNet Lemmatizer and an English vocabulary set.
- Grammar mistakes identified using Python’s LanguageTool.
- Count of adjectives, adverbs, and grammatical errors using POS tagging.
- TF-IDF Vectorizer: Assigns weights to words based on frequency & importance.
- Count Vectorizer: Captures word frequency in essays.
- DeBERTa Transformer Model generates predictions for essay scores.
- These predictions are fed into LightGBM as additional features.
To enhance model efficiency, only the most important features are selected:
- A 10-fold Stratified CV trains a LightGBM regressor with a custom QWK objective.
- Feature importance scores are accumulated across folds.
- The top 13,000 most important features are retained.
Two ensemble models are used: LightGBM and XGBoost.
- Stratified K-Fold (n_splits=20) ensures class balance across training & validation sets.
-
LightGBM Regressor:
- Initialized with optimized hyperparameters (learning rate, depth, regularization).
- Trained using quadratic weighted kappa (QWK) loss.
-
XGBoost Regressor:
- Uses early stopping & QWK-based loss function.
- Pre-tuned learning rate, depth, and colsample parameters.
-
Model Ensembling:
- Final prediction = 76% LightGBM + 24% XGBoost.
- Predictions are adjusted using a constant
a
and clipped between 1 and 6.
-
Performance Metrics:
- Evaluated using F1 Score and Cohen's Kappa.
- Memory optimized using garbage collection.
- Data Transformation:
- New essays undergo the same preprocessing & feature engineering pipeline.
- Prediction:
- Trained LightGBM + XGBoost model predicts essay scores.
- Post-Processing:
- Scores rounded & clipped to valid range.
- Output:
- Final predictions are saved for submission.
Method | Description | Leader Board Score (QWK) | Validation Score (QWK) |
---|---|---|---|
1 | DeBERTa only | 0.7507 | 0.77816 |
2 | DeBERTa only (5 fold CV) | 0.7900 | 0.8201 |
3 | LightGBM + XGBoost + Feature Engineering (Spelling errors, Word count etc.) | 0.81434 | 0.82712 |
4 | LightGBM + XGBoost + Feature Engineering (DeBERTa predictions, Spelling errors, Word count etc.) + Vectorization (TF-IDF) | 0.8169 | 0.8315 |
5 | LightGBM + XGBoost + Feature Engineering (DeBERTa predictions, Spelling errors, Word count etc.) + Vectorization (TF-IDF)+ Standardscaler | 0.8175 | 0.8318 |
6 | LightGBM + XGBoost + Feature Engineering (DeBERTa predictions, Spelling errors, Word count etc.) + Vectorization (TF-IDF, Count)+ Standardscaler | 0.8178 | 0.8320 |
7 | LightGBM + XGBoost + Feature Engineering (DeBERTa predictions, Spelling errors, Word count, Grammar, Adjectives, Pronouns etc.) + Vectorization (TF-IDF, Count)+ Standardscaler | 0.8182 | 0.83269 |
8 | LightGBM(LR 0.1) + XGBoost(LR 0.05↓) + Feature Engineering (DeBERTa predictions, Spelling errors, Word count, Grammar, Adjectives, Pronouns etc.) + Vectorization (TF-IDF, Count)+ Standardscaler | 0.8199 | 0.8324 |
9 | LightGBM(ngram change) + XGBoost(ngram change) + Feature Engineering (DeBERTa predictions, Spelling errors, Word count, Grammar, Adjectives, Pronouns etc.) + Vectorization (TF-IDF, Count)+ Standardscaler | 0.8019 | 0.8124 |
10 | LightGBM + XGBoost + Feature Engineering (DeBERTa predictions, Spelling errors, Word count, Grammar, Adjectives, Pronouns etc.) + Vectorization (TF-IDF, Count) + Standardscaler + CV 10↓ | 0.8165 | 0.8122 |
11 | LightGBM(LR 0.1, Max Depth 10) + XGBoost(LR 0.05, Max Depth 10) + Feature Engineering (DeBERTa predictions, Spelling errors, Word count, Grammar, Adjectives, Pronouns etc.) + Vectorization (TF-IDF, Count)+ Standardscaler + CV 20 ↑ | 0.8224 | 0.8275 |
12 | LightGBM(LR 0.1, Max Depth 8) + XGBoost(LR 0.05, Max Depth 8) + Feature Engineering (DeBERTa predictions, Spelling errors, Word count, Grammar, Adjectives, Pronouns etc.) + Vectorization (TF-IDF, Count)+ Standardscaler + CV 20 | 0.8243 | 0.8299 |
This project presents a comprehensive approach to automated essay scoring by combining:
- State-of-the-art transformers (DeBERTa)
- Tree-based models (LightGBM & XGBoost)
- Advanced feature engineering
- Custom optimization strategies for QWK metric
By leveraging multiple models, ensembling techniques, and rigorous evaluation, this approach achieves high accuracy & robustness in essay scoring.
Special thanks to Learning Agency Lab for providing the dataset and hosting the competition. Additional gratitude to the open-source community for developing tools that enabled this work.
🔗 Competition Link: Kaggle: Automated Essay Scoring 2.0