Beyond Grading: Automated Answer Correctness Evaluation

Course Project: Deep Learning & NLP | Focus: Educational Data Mining, Semantic Textual Similarity (STS), & Synthetic Data Generation.

Overview

Beyond Grading is an NLP framework designed to automate the evaluation of open-ended student answers in STEM domains (Computer Science, Algorithms, Data Structures). Unlike traditional Keyword Matching or simple binary classification, our system leverages Deep Learning (Transformer-based Cross-Encoders) and GenAI Augmentation to assess the semantic correctness of a student's response.

The goal is to solve the "Scalability vs. Quality" dilemma in modern education by providing automated, accurate, and consistent grading scores (0-1) that reflect deep understanding.

Key Features & Novelty

1. Cross-Encoder Architecture (Triple Context)

We move beyond simple cosine similarity. Our model processes three inputs simultaneously to capture nuance:

Context: The Exam Question.
Ground Truth: The Expert/Reference Answer.
Target: The Student's Response.

2. Adversarial Synthetic Data Strategy

Addressing the lack of labeled educational data and class imbalance, we developed a GenAI Data Pipeline:

Hard Negatives: We use GPT-4o to deliberately generate "plausible but wrong" answers (responses that use correct terminology but contain logical flaws).
Implicit Labeling: Labels are assigned automatically based on the generation prompt (e.g., specific misconception prompts = low score), removing the need for manual annotation.

3. Fine-Tuned Regression Head

The model is fine-tuned on a hybrid dataset (Real + Synthetic) to predict a continuous correctness score (0.0 - 1.0) using MSE Loss, optimized for educational nuances.

Tech Stack & Pipeline

[Image of machine learning pipeline diagram]

(Place your pipeline image here in the assets folder)

Language: Python
DL Framework: PyTorch, Hugging Face Transformers
Backbone Models: microsoft/deberta-v3-base, bert-base-uncased
Data Generation: OpenAI API (GPT-4o / Llama-3)
Experiment Tracking: WandB (Weights & Biases) / TensorBoard

Dataset Specification

We utilize a hybrid data strategy:

Dataset Source	Description	Size	Role
MohlerASAG	Benchmark Short Answer Grading dataset.	~2,300 pairs	Training & Testing (Real World)
GenAI Augmented	Custom synthetic dataset focusing on "Hard Negatives".	~1,000+ pairs	Training (Robustness)

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
Code		Code
Data		Data
Results		Results
Slides		Slides
Visuals		Visuals
.gitignore		.gitignore
README.md		README.md
VisualBoxPlot_Image (3).png		VisualBoxPlot_Image (3).png
download_dataset.py		download_dataset.py
load_data.py		load_data.py
synthetic_data.csv		synthetic_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Beyond Grading: Automated Answer Correctness Evaluation

Overview

Key Features & Novelty

1. Cross-Encoder Architecture (Triple Context)

2. Adversarial Synthetic Data Strategy

3. Fine-Tuned Regression Head

Tech Stack & Pipeline

Dataset Specification

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

OrelCohenTech/Automated_Feedback

Folders and files

Latest commit

History

Repository files navigation

Beyond Grading: Automated Answer Correctness Evaluation

Overview

Key Features & Novelty

1. Cross-Encoder Architecture (Triple Context)

2. Adversarial Synthetic Data Strategy

3. Fine-Tuned Regression Head

Tech Stack & Pipeline

Dataset Specification

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages