PDF-to-Exam using Gemma-3

This project builds a exam generation pipeline that turns long PDF documents into structured exam questions.

It uses gemma-3-4b-it for both keyword/keyphrase extraction and question generation, producing various question types such as multiple-choice, short-answer, and essay.

🧐 Pipeline Overview

The exam generation pipeline proceeds in four major stages:

1. Text Extraction

Extract text from input .pdf or .txt files
The text is tokenized and segmented into manageable chunks for later processing.

2. Keyword and Keyphrase Extraction

Use TextRank to identify important concepts and representative phrases.
Default extraction limits: 15 keywords, 30 keyphrases.

3. Prompt Construction

A predefined XML-based prompt template is used to instruct the language model.
Extracted keywords {KEYS} and keyphrases {PHRS} are injected into the prompt.
The prompt enforces strict output formatting and generation rules (5 exam problems: 2 multiple-choice, 2 short-answer, 1 essay).
Rules ensure XML-only responses, special character escaping, and non-repetitive concepts.

다음의 규칙을 준수하여 대학교 시험 문제 5개를 XML 형식으로 생성하세요.
    
**응답 형식 (반드시 준수):**
<problems>
    <problem>
        <number>1</number>
        <type>객관식</type>
        <content>문제내용</content>
        <description>풀이과정</description>
        <answer>답</answer>
    </problem>
    <problem>
        <number>2</number>
        <type>객관식</type>
        <content>문제내용</content>
        <description>풀이과정</description>
        <answer>답</answer>
    </problem>
    
    <problem>
        <number>3</number>
        <type>단답형</type>
        <content>문제내용</content>
        <description>풀이과정</description>
        <answer>답</answer>
    </problem>
    <problem>
        <number>4</number>
        <type>단답형</type>
        <content>문제내용</content>
        <description>풀이과정</description>
        <answer>답</answer>
    </problem>
    
    <problem>
        <number>5</number>
        <type>주관식</type>
        <content>문제내용</content>
        <answer>답</answer>
    </problem>
</problems>
    
**절대 규칙 (위반 시 응답 무효):**
1. XML 태그 구조만 출력합니다. 다른 텍스트, 설명, 주석은 포함하지 않습니다.
2. 모든 내용은 CDATA 섹션 없이 일반 텍스트로 작성합니다.
3. 특수문자는 XML 엔티티로 작성합니다. (&lt;, &gt;, &amp;, &quot;, &apos;)

**문제 생성 규칙:**
- 총 5문제를 생성하며, 문제 유형은 다음 비율을 반드시 지킵니다: 객관식 2문제, 단답형 2문제, 주관식 1문제.
- 각 문제의 <type>은 위 응답 형식에서 이미 지정된 값을 그대로 사용합니다.
- 객관식 문제는 보기 기호를 ①, ②, ③, ④, ⑤ 형식으로 작성합니다.
- 모든 문제는 서로 다른 주요 개념을 사용해야 하며, 동일 개념이나 동일 인물, 동일 사건을 다른 문제에서 재사용하지 않습니다.
- 풀이과정과 답을 구체적으로 작성합니다.
- 문제 내용에 따옴표, 수식, 특수문자 등을 자유롭게 사용할 수 있습니다.
- 문제는 난이도와 표현 방식을 다양하게 구성합니다.

**중요한 키워드:**
{KEYS}
**중요한 문장들:**
{PHRS}

4. Exam Generation

The prompt is passed to gemma-3-4b-it, which generates five exam questions following the given schema.
The output XML is parsed into structured JSON objects containing: {number, type, content, description (optional), answer}

📂 Project Structure

project_root/
├── app/
│   ├── api/            # inference endpoints
│   ├── parser/         # PDF/text parser
│   ├── processor/      # prompt pipeline logic
│   ├── Dockerfile
│   └── requirements.txt
│
├── dataset/
│   ├── crawl/          # crawling wiki or pdf text
│   ├── extract/        # extracting keyword/keyphrase
│   └── generate.py     # question generation script (Llama-70B)
│
├── train/
│   ├── data/           # process training data
│   ├── results/        # training results
│   ├── reward.py       # reward logic (GRPO)
│   ├── train_sft.py    # SFT trainer
│   ├── train_dpo.py    # DPO trainer
│   └── train_grpo.py   # GRPO trainer
│
├── README.md
└── requirements.txt

💪 Training Overview

The model training process consists of three sequential stages:
(1) Supervised Fine-Tuning (SFT),
(2) Direct Preference Optimization (DPO), and
(3) Group Relative Policy Optimization (GRPO).

All stages were trained on datasets derived from historical text sources such as Wikipedia, Yale open materials, and public textbooks.

1. Data Collection

To construct high-quality training data, text materials were collected from three primary sources:

Open-access Yale University course materials,
Publicly available PDF-format textbooks, and
Wikipedia articles related to world history.

Since general topics were too broad, the data collection was restricted to the domain of history for consistency and focus.

2. Synthetic Data Generation

From the collected Wikipedia corpus, keywords and keyphrases were extracted using the TextRank algorithm.
These were inserted into a fixed prompt template, which was then passed to the Llama-3.3-70B-Instruct model to generate exam-style problems.

Each generated response was manually reviewed:

High-quality responses were converted into SFT (Supervised Fine-Tuning) pairs.
Incorrect or partially flawed responses were used as negative samples to construct DPO datasets, enabling contrastive preference learning.

3. Supervised Fine-Tuning (SFT)

The model was first fine-tuned on the high-quality, instruction-following question–answer pairs derived from Wikipedia data.
This step taught the model to follow the prompt format and generate well-structured, domain-relevant questions and answers.

4. Direct Preference Optimization (DPO)

Next, the model underwent DPO training using pairs of “good” vs “bad” generations.
This alignment stage optimized the model to prefer answers that adhered more closely to factual correctness, formatting rules, and logical completeness.

5. Grouped Reward Policy Optimization (GRPO)

Finally, GRPO training was applied to improve format accuracy, language regulation, and content diversity.
For this, a new dataset was built from Yale and open-source textbook materials, again following the same keyword/keyphrase → prompt → generation pipeline.
A custom reward function was implemented to measure:

XML format validity,
Language clarity and variety, and
Topical diversity across questions.

The GRPO phase refined the model to produce more varied and precise exam questions, completing the overall training pipeline.

✍️ Training Scripts

Data Crawling

Three crawlers were implemented to collect historical text data for model training.

Script	Source	Description
`crawl_wiki.py`	Wikipedia	Fetches article sections by category (EN/KR) via MediaWiki API.
`crawl_book.py`	PDF textbooks	Splits large PDF files into section-level documents using JSON-based page mapping.
`crawl_yale.py`	Yale Open Courses	Scrapes full lecture transcripts using Selenium and BeautifulSoup.

Each script saves extracted texts under the courses/ directory:

Run:

python crawl_wiki.py       # Wikipedia
python crawl_book.py       # PDF textbooks
python crawl_yale.py       # Yale lectures

Output:

courses/
├── WIKI/
├── Books/
└── Yale/

Data Generation

After collecting and cleaning historical texts, the next step is to generate exam-style problems using Llama-3.3-70B-Instruct.
This script extracts key terms from each text file, builds prompts, and saves generated problems in JSONL format.

Run:

python generate_questions.py \
  --input_root ./courses/WIKI \
  --prompt_file ./extract/prompt.txt \
  --out ./answers/results.jsonl \
  --temps 0.3,0.7,1.0

Output:

answers/
├── results.T0.3.jsonl
├── results.T0.7.jsonl
└── results.T1.0.jsonl

Data Preprocessing

Run:

python pre_sup_sc.py
python pre_grpo.py

Output:

data/
├── sft/
│   ├── sft_train.jsonl
│   └── sft_eval.jsonl
├── dpo/
│   ├── dpo_train.jsonl
│   └── dpo_eval.jsonl
└── grpo
    ├── grpo_train.jsonl
    └── grpo_eval.jsonl

Overall Training

Once the question–answer datasets are prepared, model alignment is performed through three stages:
Supervised Fine-Tuning (SFT) → Direct Preference Optimization (DPO) → Grouped Reward Policy Optimization (GRPO).

Each script below trains the model with progressively stronger alignment objectives.

1. Supervised Fine-Tuning
Fine-tunes gemma-3-4b-it (or any compatible instruction model) on high-quality question–answer pairs.

Input:
data/sft_train.jsonl, data/sft_eval.jsonl

Run:

python train_sft.py \
  --model google/gemma-3-4b-it \
  --dataset_path ./data/sft \
  --use_lora

Output:

results_sft/
└── sft-4b-lora/
    ├── checkpoints/
    └── sft-final/

2. Direct Preference Optimization
Aligns the model’s preferences using good vs. bad generations from Wikipedia-derived data.

Input:
data/dpo_train.jsonl, data/dpo_eval.jsonl

Run:

python train_dpo.py \
  --model <sft_model> \
  --dataset_path ./data/dpo \
  --use_lora

Output:

results_dpo/
└── dpo-4b-lora/
    ├── checkpoints/
    └── dpo-final/

3. Group Relative Policy Optimization
Final alignment phase applying custom reward functions to refine format accuracy, language regulation, and content diversity.

Input:
data/grpo_train.jsonl, data/grpo_eval.jsonl

Run:

python train_grpo.py \
  --model <dpo_model> \
  --dataset_path ./data/grpo \
  --use_lora

Output:

results_grpo/
└── grpo-4b-lora/
    ├── checkpoints/
    └── grpo-final/

⚠️ Limitations

Despite promising results, the current system has several practical and methodological limitations:

1. Domain Constraint

The dataset is limited to historical materials (Yale OYC lectures, open-source textbooks, Wikipedia).
Generalization to other academic or technical domains has not yet been verified.

2. Evaluation Gap

Quantitative evaluation metrics (e.g., RQUGE, NACo, factuality/coverage tests) are not yet integrated.
Generated exams are only validated qualitatively.

3. Prompt Dependency

The model relies heavily on a single handcrafted XML prompt template.
Even small deviations or missing tokens may lead to invalid XML outputs.

✨ Future Work

To address the above constraints and extend the project, the following directions are planned:

1. Automated Evaluation Pipeline

Integrate RQUGE and NACo metrics for reference-free exam quality scoring,
and develop an evaluation suite that scores factuality, coverage, and answer correctness.

2. Domain Expansion

Extend crawling and generation to other subjects (e.g., economics, biology, computer science) to test domain adaptability and dataset diversity.

3. Adaptive Prompting

Explore dynamic prompt generation or retrieval-augmented prompts that automatically adapt to input topic and context length.

🎫 License

Model:
The fine-tuned model is based on Google Gemma-3-4B-it,
and thus inherits the original Gemma License.
Code & Training Pipeline:
All scripts, prompts, and reward logic in this repository are released under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF-to-Exam using Gemma-3

🧐 Pipeline Overview

1. Text Extraction

2. Keyword and Keyphrase Extraction

3. Prompt Construction

4. Exam Generation

📂 Project Structure

💪 Training Overview

1. Data Collection

2. Synthetic Data Generation

3. Supervised Fine-Tuning (SFT)

4. Direct Preference Optimization (DPO)

5. Grouped Reward Policy Optimization (GRPO)

✍️ Training Scripts

Data Crawling

Data Generation

Data Preprocessing

Overall Training

⚠️ Limitations

✨ Future Work

🎫 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
app		app
dataset		dataset
train		train
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

K-UniLab/Improfessor-AI

Folders and files

Latest commit

History

Repository files navigation

PDF-to-Exam using Gemma-3

🧐 Pipeline Overview

1. Text Extraction

2. Keyword and Keyphrase Extraction

3. Prompt Construction

4. Exam Generation

📂 Project Structure

💪 Training Overview

1. Data Collection

2. Synthetic Data Generation

3. Supervised Fine-Tuning (SFT)

4. Direct Preference Optimization (DPO)

5. Grouped Reward Policy Optimization (GRPO)

✍️ Training Scripts

Data Crawling

Data Generation

Data Preprocessing

Overall Training

⚠️ Limitations

✨ Future Work

🎫 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages