An evaluation framework designed to measure how effectively Large Language Models (LLMs) reason and translate across low-resource languages. This project explores whether structured prompting strategies and linguistic guidance improve model performance on languages with limited training data.
The framework evaluates model outputs using automated scoring techniques and compares reasoning quality through translation and question-answering tasks.
- Python
- LLaMA 3.1
- Large Language Models (LLMs)
- Natural Language Processing (NLP)
- Prompt Engineering
- Chain-of-Thought Prompting
- SentenceTransformers
- BERTScore
- Cosine Similarity
- Semantic Similarity Evaluation
- Pandas
- Matplotlib
- Regular Expressions (Regex)
- Low-Resource Languages
- Translation Evaluation
- Interlinear Glossed Text (IGT)
- Morphological Analysis
Evaluates model performance across languages with limited available training data.
Provides linguistic gloss information to guide translation and reasoning tasks.
Measures performance through:
- Multiple-choice grammar questions
- Open-ended translation tasks
Generates evaluation metrics automatically using:
- Accuracy
- Cosine Similarity
- BERTScore
Creates plots and CSV summaries to compare language performance.
Runs the same evaluation process across multiple datasets and languages.
The project was built as a structured evaluation pipeline:
Language datasets were prepared using:
- Multiple-choice grammar questions
- Interlinear glossed text (IGT)
- Ground-truth translations
Two prompting approaches were implemented:
- Direct multiple-choice prompting
- Glossary-guided Chain-of-Thought prompting
Prompts were sent to an LLM to generate predictions and translations.
Outputs were evaluated using:
- Accuracy for multiple-choice tasks
- Cosine similarity for semantic comparison
- BERTScore for translation quality
Results were exported into:
- CSV summary files
- Visualization charts
- Aggregate evaluation reports
- How to design evaluation frameworks for LLMs
- Applying prompt engineering techniques to structured reasoning tasks
- Measuring language understanding beyond simple accuracy
- Working with embedding-based similarity metrics
- Building automated benchmarking pipelines
- Understanding challenges associated with low-resource languages
- Compare additional LLM architectures
- Expand evaluation across more languages
- Add human evaluation for translation quality
- Introduce retrieval-augmented prompting
- Support multilingual embeddings
- Add experiment tracking and dashboards
- Build an interactive interface for running evaluations
git clone <repo-url>
cd low-resource-language-reasoningpip install -r requirements.txtCreate an environment file:
API_KEY=your_api_keyPlace language datasets in the project directory:
language_mcq.txt
language_igt.txt
python main.pyGenerated results will appear inside:
outputs/
Including:
- CSV summaries
- Evaluation metrics
- Performance plots
This project focuses on evaluating reasoning behavior rather than training models. Results may vary across models, prompts, and language families.