danish-edu-llm-classifier

A comprehensive toolkit designed to enhance multilingual AI capabilities by analyzing and modeling educational content, particularly within a low-resource language context (Danish) using the FineWeb datasets. This project focuses on understanding human annotation quality and inter-annotator agreement and developing scalable, consistent content classification pipelines through the fine-tuning of encoder models and investigation of cross-lingual transfer strategies. For the paper on this git, go to docs/FP_25.

Project Structure

danish-edu-llm-classifier/
├── data/                  # Raw, interim, and processed datasets
│   ├── raw/
│   ├── interim/
│   └── processed/
├── docs/ FP_25.pdf        # The paper using this code
│   └── visualizations/    # Generated plots and figures for reports
├── notebooks/             # Jupyter notebooks for analysis and exploration
├── src/
│   ├── annotation/        # Annotation tools, annotation data, and scripts
│   ├── data_processing/   # Data processing, merging, and dataloader scripts
│   ├── evaluation/        # Evaluation utilities and metrics
│   └── training/          # Model training scripts and configs
├── results/               # Model predictions, evaluation results, and summaries
├── archive/               # Old/legacy scripts, notebooks, and results
├── requirements.txt       # Python dependencies
└── README.md              # This file

Data

All datasets are in data/ (with subfolders for raw/, interim/, and processed/).
Use scripts in src/data_processing/ to prepare and merge datasets.

Training

Main training scripts are in src/training/.
Configurations are in src/training/config/.
Example: python src/training/train.py src/training/config/base.yaml

Annotation

Manual annotation tool: src/annotation/annotation.py (Streamlit app)
Annotation guidelines and example files are in src/annotation/
Run with: streamlit run src/annotation/annotation.py

Evaluation & Inference

Evaluation scripts: src/evaluation/
Model predictions and evaluation results: results/
Notebooks for analysis: notebooks/

Visualization & Reporting

All generated figures and summary tables for reports are in docs/visualizations/.

Setup

Install dependencies:
```
pip install -r requirements.txt
```
Prepare data as needed using scripts in src/data_processing/.
Train models using scripts in src/training/.
Annotate or evaluate as needed.

Notes

For legacy scripts and results, see the archive/ directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

danish-edu-llm-classifier

Project Structure

Data

Training

Annotation

Evaluation & Inference

Visualization & Reporting

Archive

Setup

Notes

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
archive		archive
data		data
docs		docs
notebooks		notebooks
results		results
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

DavidLindahl/danish-edu-llm-classifier

Folders and files

Latest commit

History

Repository files navigation

danish-edu-llm-classifier

Project Structure

Data

Training

Annotation

Evaluation & Inference

Visualization & Reporting

Archive

Setup

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages