This project builds a exam generation pipeline that turns long PDF documents into structured exam questions.
It uses gemma-3-4b-it for both keyword/keyphrase extraction and question generation, producing various question types such as multiple-choice, short-answer, and essay.
The exam generation pipeline proceeds in four major stages:
- Extract text from input
.pdfor.txtfiles - The text is tokenized and segmented into manageable chunks for later processing.
- Use TextRank to identify important concepts and representative phrases.
- Default extraction limits: 15 keywords, 30 keyphrases.
- A predefined XML-based prompt template is used to instruct the language model.
- Extracted keywords
{KEYS}and keyphrases{PHRS}are injected into the prompt. - The prompt enforces strict output formatting and generation rules
(5 exam problems: 2 multiple-choice, 2 short-answer, 1 essay). - Rules ensure XML-only responses, special character escaping, and non-repetitive concepts.
λ€μμ κ·μΉμ μ€μνμ¬ λνκ΅ μν λ¬Έμ 5κ°λ₯Ό XML νμμΌλ‘ μμ±νμΈμ.
**μλ΅ νμ (λ°λμ μ€μ):**
<problems>
<problem>
<number>1</number>
<type>κ°κ΄μ</type>
<content>λ¬Έμ λ΄μ©</content>
<description>νμ΄κ³Όμ </description>
<answer>λ΅</answer>
</problem>
<problem>
<number>2</number>
<type>κ°κ΄μ</type>
<content>λ¬Έμ λ΄μ©</content>
<description>νμ΄κ³Όμ </description>
<answer>λ΅</answer>
</problem>
<problem>
<number>3</number>
<type>λ¨λ΅ν</type>
<content>λ¬Έμ λ΄μ©</content>
<description>νμ΄κ³Όμ </description>
<answer>λ΅</answer>
</problem>
<problem>
<number>4</number>
<type>λ¨λ΅ν</type>
<content>λ¬Έμ λ΄μ©</content>
<description>νμ΄κ³Όμ </description>
<answer>λ΅</answer>
</problem>
<problem>
<number>5</number>
<type>μ£Όκ΄μ</type>
<content>λ¬Έμ λ΄μ©</content>
<answer>λ΅</answer>
</problem>
</problems>
**μ λ κ·μΉ (μλ° μ μλ΅ λ¬΄ν¨):**
1. XML νκ·Έ κ΅¬μ‘°λ§ μΆλ ₯ν©λλ€. λ€λ₯Έ ν
μ€νΈ, μ€λͺ
, μ£Όμμ ν¬ν¨νμ§ μμ΅λλ€.
2. λͺ¨λ λ΄μ©μ CDATA μΉμ
μμ΄ μΌλ° ν
μ€νΈλ‘ μμ±ν©λλ€.
3. νΉμλ¬Έμλ XML μν°ν°λ‘ μμ±ν©λλ€. (<, >, &, ", ')
**λ¬Έμ μμ± κ·μΉ:**
- μ΄ 5λ¬Έμ λ₯Ό μμ±νλ©°, λ¬Έμ μ νμ λ€μ λΉμ¨μ λ°λμ μ§ν΅λλ€: κ°κ΄μ 2λ¬Έμ , λ¨λ΅ν 2λ¬Έμ , μ£Όκ΄μ 1λ¬Έμ .
- κ° λ¬Έμ μ <type>μ μ μλ΅ νμμμ μ΄λ―Έ μ§μ λ κ°μ κ·Έλλ‘ μ¬μ©ν©λλ€.
- κ°κ΄μ λ¬Έμ λ 보기 κΈ°νΈλ₯Ό β , β‘, β’, β£, β€ νμμΌλ‘ μμ±ν©λλ€.
- λͺ¨λ λ¬Έμ λ μλ‘ λ€λ₯Έ μ£Όμ κ°λ
μ μ¬μ©ν΄μΌ νλ©°, λμΌ κ°λ
μ΄λ λμΌ μΈλ¬Ό, λμΌ μ¬κ±΄μ λ€λ₯Έ λ¬Έμ μμ μ¬μ¬μ©νμ§ μμ΅λλ€.
- νμ΄κ³Όμ κ³Ό λ΅μ ꡬ체μ μΌλ‘ μμ±ν©λλ€.
- λ¬Έμ λ΄μ©μ λ°μ΄ν, μμ, νΉμλ¬Έμ λ±μ μμ λ‘κ² μ¬μ©ν μ μμ΅λλ€.
- λ¬Έμ λ λμ΄λμ νν λ°©μμ λ€μνκ² κ΅¬μ±ν©λλ€.
**μ€μν ν€μλ:**
{KEYS}
**μ€μν λ¬Έμ₯λ€:**
{PHRS}
- The prompt is passed to gemma-3-4b-it, which generates five exam questions following the given schema.
- The output XML is parsed into structured JSON objects containing:
{number, type, content, description (optional), answer}
project_root/
βββ app/
β βββ api/ # inference endpoints
β βββ parser/ # PDF/text parser
β βββ processor/ # prompt pipeline logic
β βββ Dockerfile
β βββ requirements.txt
β
βββ dataset/
β βββ crawl/ # crawling wiki or pdf text
β βββ extract/ # extracting keyword/keyphrase
β βββ generate.py # question generation script (Llama-70B)
β
βββ train/
β βββ data/ # process training data
β βββ results/ # training results
β βββ reward.py # reward logic (GRPO)
β βββ train_sft.py # SFT trainer
β βββ train_dpo.py # DPO trainer
β βββ train_grpo.py # GRPO trainer
β
βββ README.md
βββ requirements.txt
The model training process consists of three sequential stages:
(1) Supervised Fine-Tuning (SFT),
(2) Direct Preference Optimization (DPO), and
(3) Group Relative Policy Optimization (GRPO).
All stages were trained on datasets derived from historical text sources such as Wikipedia, Yale open materials, and public textbooks.
To construct high-quality training data, text materials were collected from three primary sources:
- Open-access Yale University course materials,
- Publicly available PDF-format textbooks, and
- Wikipedia articles related to world history.
Since general topics were too broad, the data collection was restricted to the domain of history for consistency and focus.
From the collected Wikipedia corpus, keywords and keyphrases were extracted using the TextRank algorithm.
These were inserted into a fixed prompt template, which was then passed to the Llama-3.3-70B-Instruct model to generate exam-style problems.
Each generated response was manually reviewed:
- High-quality responses were converted into SFT (Supervised Fine-Tuning) pairs.
- Incorrect or partially flawed responses were used as negative samples to construct DPO datasets, enabling contrastive preference learning.
The model was first fine-tuned on the high-quality, instruction-following questionβanswer pairs derived from Wikipedia data.
This step taught the model to follow the prompt format and generate well-structured, domain-relevant questions and answers.
Next, the model underwent DPO training using pairs of βgoodβ vs βbadβ generations.
This alignment stage optimized the model to prefer answers that adhered more closely to factual correctness, formatting rules, and logical completeness.
Finally, GRPO training was applied to improve format accuracy, language regulation, and content diversity.
For this, a new dataset was built from Yale and open-source textbook materials, again following the same keyword/keyphrase β prompt β generation pipeline.
A custom reward function was implemented to measure:
- XML format validity,
- Language clarity and variety, and
- Topical diversity across questions.
The GRPO phase refined the model to produce more varied and precise exam questions, completing the overall training pipeline.
Three crawlers were implemented to collect historical text data for model training.
| Script | Source | Description |
|---|---|---|
crawl_wiki.py |
Wikipedia | Fetches article sections by category (EN/KR) via MediaWiki API. |
crawl_book.py |
PDF textbooks | Splits large PDF files into section-level documents using JSON-based page mapping. |
crawl_yale.py |
Yale Open Courses | Scrapes full lecture transcripts using Selenium and BeautifulSoup. |
Each script saves extracted texts under the courses/ directory:
Run:
python crawl_wiki.py # Wikipedia
python crawl_book.py # PDF textbooks
python crawl_yale.py # Yale lecturesOutput:
courses/
βββ WIKI/
βββ Books/
βββ Yale/
After collecting and cleaning historical texts, the next step is to generate exam-style problems using Llama-3.3-70B-Instruct.
This script extracts key terms from each text file, builds prompts, and saves generated problems in JSONL format.
Run:
python generate_questions.py \
--input_root ./courses/WIKI \
--prompt_file ./extract/prompt.txt \
--out ./answers/results.jsonl \
--temps 0.3,0.7,1.0Output:
answers/
βββ results.T0.3.jsonl
βββ results.T0.7.jsonl
βββ results.T1.0.jsonl
Run:
python pre_sup_sc.py
python pre_grpo.pyOutput:
data/
βββ sft/
β βββ sft_train.jsonl
β βββ sft_eval.jsonl
βββ dpo/
β βββ dpo_train.jsonl
β βββ dpo_eval.jsonl
βββ grpo
βββ grpo_train.jsonl
βββ grpo_eval.jsonl
Once the questionβanswer datasets are prepared, model alignment is performed through three stages:
Supervised Fine-Tuning (SFT) β Direct Preference Optimization (DPO) β Grouped Reward Policy Optimization (GRPO).
Each script below trains the model with progressively stronger alignment objectives.
1. Supervised Fine-Tuning
Fine-tunes gemma-3-4b-it (or any compatible instruction model) on high-quality questionβanswer pairs.
Input:
data/sft_train.jsonl, data/sft_eval.jsonl
Run:
python train_sft.py \
--model google/gemma-3-4b-it \
--dataset_path ./data/sft \
--use_loraOutput:
results_sft/
βββ sft-4b-lora/
βββ checkpoints/
βββ sft-final/
2. Direct Preference Optimization
Aligns the modelβs preferences using good vs. bad generations from Wikipedia-derived data.
Input:
data/dpo_train.jsonl, data/dpo_eval.jsonl
Run:
python train_dpo.py \
--model <sft_model> \
--dataset_path ./data/dpo \
--use_loraOutput:
results_dpo/
βββ dpo-4b-lora/
βββ checkpoints/
βββ dpo-final/
3. Group Relative Policy Optimization
Final alignment phase applying custom reward functions to refine format accuracy, language regulation, and content diversity.
Input:
data/grpo_train.jsonl, data/grpo_eval.jsonl
Run:
python train_grpo.py \
--model <dpo_model> \
--dataset_path ./data/grpo \
--use_loraOutput:
results_grpo/
βββ grpo-4b-lora/
βββ checkpoints/
βββ grpo-final/
Despite promising results, the current system has several practical and methodological limitations:
1. Domain Constraint
- The dataset is limited to historical materials (Yale OYC lectures, open-source textbooks, Wikipedia).
- Generalization to other academic or technical domains has not yet been verified.
2. Evaluation Gap
- Quantitative evaluation metrics (e.g., RQUGE, NACo, factuality/coverage tests) are not yet integrated.
- Generated exams are only validated qualitatively.
3. Prompt Dependency
- The model relies heavily on a single handcrafted XML prompt template.
- Even small deviations or missing tokens may lead to invalid XML outputs.
To address the above constraints and extend the project, the following directions are planned:
1. Automated Evaluation Pipeline
- Integrate RQUGE and NACo metrics for reference-free exam quality scoring,
- and develop an evaluation suite that scores factuality, coverage, and answer correctness.
2. Domain Expansion
- Extend crawling and generation to other subjects (e.g., economics, biology, computer science) to test domain adaptability and dataset diversity.
3. Adaptive Prompting
- Explore dynamic prompt generation or retrieval-augmented prompts that automatically adapt to input topic and context length.
-
Model:
The fine-tuned model is based on Google Gemma-3-4B-it,
and thus inherits the original Gemma License. -
Code & Training Pipeline:
All scripts, prompts, and reward logic in this repository are released under the Apache License 2.0.

