Skip to content

K-UniLab/Improfessor-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF-to-Exam using Gemma-3

This project builds a exam generation pipeline that turns long PDF documents into structured exam questions.

It uses gemma-3-4b-it for both keyword/keyphrase extraction and question generation, producing various question types such as multiple-choice, short-answer, and essay.

Project framework


🧐 Pipeline Overview

The exam generation pipeline proceeds in four major stages:

1. Text Extraction

  • Extract text from input .pdf or .txt files
  • The text is tokenized and segmented into manageable chunks for later processing.

2. Keyword and Keyphrase Extraction

  • Use TextRank to identify important concepts and representative phrases.
  • Default extraction limits: 15 keywords, 30 keyphrases.

Extracting framework

3. Prompt Construction

  • A predefined XML-based prompt template is used to instruct the language model.
  • Extracted keywords {KEYS} and keyphrases {PHRS} are injected into the prompt.
  • The prompt enforces strict output formatting and generation rules (5 exam problems: 2 multiple-choice, 2 short-answer, 1 essay).
  • Rules ensure XML-only responses, special character escaping, and non-repetitive concepts.
λ‹€μŒμ˜ κ·œμΉ™μ„ μ€€μˆ˜ν•˜μ—¬ λŒ€ν•™κ΅ μ‹œν—˜ 문제 5개λ₯Ό XML ν˜•μ‹μœΌλ‘œ μƒμ„±ν•˜μ„Έμš”.
    
**응닡 ν˜•μ‹ (λ°˜λ“œμ‹œ μ€€μˆ˜):**
<problems>
    <problem>
        <number>1</number>
        <type>객관식</type>
        <content>λ¬Έμ œλ‚΄μš©</content>
        <description>풀이과정</description>
        <answer>λ‹΅</answer>
    </problem>
    <problem>
        <number>2</number>
        <type>객관식</type>
        <content>λ¬Έμ œλ‚΄μš©</content>
        <description>풀이과정</description>
        <answer>λ‹΅</answer>
    </problem>
    
    <problem>
        <number>3</number>
        <type>λ‹¨λ‹΅ν˜•</type>
        <content>λ¬Έμ œλ‚΄μš©</content>
        <description>풀이과정</description>
        <answer>λ‹΅</answer>
    </problem>
    <problem>
        <number>4</number>
        <type>λ‹¨λ‹΅ν˜•</type>
        <content>λ¬Έμ œλ‚΄μš©</content>
        <description>풀이과정</description>
        <answer>λ‹΅</answer>
    </problem>
    
    <problem>
        <number>5</number>
        <type>주관식</type>
        <content>λ¬Έμ œλ‚΄μš©</content>
        <answer>λ‹΅</answer>
    </problem>
</problems>
    
**μ ˆλŒ€ κ·œμΉ™ (μœ„λ°˜ μ‹œ 응닡 무효):**
1. XML νƒœκ·Έ ꡬ쑰만 좜λ ₯ν•©λ‹ˆλ‹€. λ‹€λ₯Έ ν…μŠ€νŠΈ, μ„€λͺ…, 주석은 ν¬ν•¨ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
2. λͺ¨λ“  λ‚΄μš©μ€ CDATA μ„Ήμ…˜ 없이 일반 ν…μŠ€νŠΈλ‘œ μž‘μ„±ν•©λ‹ˆλ‹€.
3. νŠΉμˆ˜λ¬ΈμžλŠ” XML μ—”ν‹°ν‹°λ‘œ μž‘μ„±ν•©λ‹ˆλ‹€. (&lt;, &gt;, &amp;, &quot;, &apos;)

**문제 생성 κ·œμΉ™:**
- 총 5문제λ₯Ό μƒμ„±ν•˜λ©°, 문제 μœ ν˜•μ€ λ‹€μŒ λΉ„μœ¨μ„ λ°˜λ“œμ‹œ μ§€ν‚΅λ‹ˆλ‹€: 객관식 2문제, λ‹¨λ‹΅ν˜• 2문제, 주관식 1문제.
- 각 문제의 <type>은 μœ„ 응닡 ν˜•μ‹μ—μ„œ 이미 μ§€μ •λœ 값을 κ·ΈλŒ€λ‘œ μ‚¬μš©ν•©λ‹ˆλ‹€.
- 객관식 λ¬Έμ œλŠ” 보기 기호λ₯Ό β‘ , β‘‘, β‘’, β‘£, β‘€ ν˜•μ‹μœΌλ‘œ μž‘μ„±ν•©λ‹ˆλ‹€.
- λͺ¨λ“  λ¬Έμ œλŠ” μ„œλ‘œ λ‹€λ₯Έ μ£Όμš” κ°œλ…μ„ μ‚¬μš©ν•΄μ•Ό ν•˜λ©°, 동일 κ°œλ…μ΄λ‚˜ 동일 인물, 동일 사건을 λ‹€λ₯Έ λ¬Έμ œμ—μ„œ μž¬μ‚¬μš©ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
- 풀이과정과 닡을 ꡬ체적으둜 μž‘μ„±ν•©λ‹ˆλ‹€.
- 문제 λ‚΄μš©μ— λ”°μ˜΄ν‘œ, μˆ˜μ‹, 특수문자 등을 자유둭게 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
- λ¬Έμ œλŠ” λ‚œμ΄λ„μ™€ ν‘œν˜„ 방식을 λ‹€μ–‘ν•˜κ²Œ κ΅¬μ„±ν•©λ‹ˆλ‹€.

**μ€‘μš”ν•œ ν‚€μ›Œλ“œ:**
{KEYS}
**μ€‘μš”ν•œ λ¬Έμž₯λ“€:**
{PHRS}

4. Exam Generation

  • The prompt is passed to gemma-3-4b-it, which generates five exam questions following the given schema.
  • The output XML is parsed into structured JSON objects containing: {number, type, content, description (optional), answer}

πŸ“‚ Project Structure

project_root/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ api/            # inference endpoints
β”‚   β”œβ”€β”€ parser/         # PDF/text parser
β”‚   β”œβ”€β”€ processor/      # prompt pipeline logic
β”‚   β”œβ”€β”€ Dockerfile
β”‚   └── requirements.txt
β”‚
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ crawl/          # crawling wiki or pdf text
β”‚   β”œβ”€β”€ extract/        # extracting keyword/keyphrase
β”‚   └── generate.py     # question generation script (Llama-70B)
β”‚
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ data/           # process training data
β”‚   β”œβ”€β”€ results/        # training results
β”‚   β”œβ”€β”€ reward.py       # reward logic (GRPO)
β”‚   β”œβ”€β”€ train_sft.py    # SFT trainer
β”‚   β”œβ”€β”€ train_dpo.py    # DPO trainer
β”‚   └── train_grpo.py   # GRPO trainer
β”‚
β”œβ”€β”€ README.md
└── requirements.txt

πŸ’ͺ Training Overview

Training framework The model training process consists of three sequential stages:
(1) Supervised Fine-Tuning (SFT),
(2) Direct Preference Optimization (DPO), and
(3) Group Relative Policy Optimization (GRPO).

All stages were trained on datasets derived from historical text sources such as Wikipedia, Yale open materials, and public textbooks.

1. Data Collection

To construct high-quality training data, text materials were collected from three primary sources:

  • Open-access Yale University course materials,
  • Publicly available PDF-format textbooks, and
  • Wikipedia articles related to world history.

Since general topics were too broad, the data collection was restricted to the domain of history for consistency and focus.

2. Synthetic Data Generation

From the collected Wikipedia corpus, keywords and keyphrases were extracted using the TextRank algorithm.
These were inserted into a fixed prompt template, which was then passed to the Llama-3.3-70B-Instruct model to generate exam-style problems.

Each generated response was manually reviewed:

  • High-quality responses were converted into SFT (Supervised Fine-Tuning) pairs.
  • Incorrect or partially flawed responses were used as negative samples to construct DPO datasets, enabling contrastive preference learning.

3. Supervised Fine-Tuning (SFT)

The model was first fine-tuned on the high-quality, instruction-following question–answer pairs derived from Wikipedia data.
This step taught the model to follow the prompt format and generate well-structured, domain-relevant questions and answers.

4. Direct Preference Optimization (DPO)

Next, the model underwent DPO training using pairs of β€œgood” vs β€œbad” generations.
This alignment stage optimized the model to prefer answers that adhered more closely to factual correctness, formatting rules, and logical completeness.

5. Grouped Reward Policy Optimization (GRPO)

Finally, GRPO training was applied to improve format accuracy, language regulation, and content diversity.
For this, a new dataset was built from Yale and open-source textbook materials, again following the same keyword/keyphrase β†’ prompt β†’ generation pipeline.
A custom reward function was implemented to measure:

  • XML format validity,
  • Language clarity and variety, and
  • Topical diversity across questions.

The GRPO phase refined the model to produce more varied and precise exam questions, completing the overall training pipeline.


✍️ Training Scripts

Data Crawling

Three crawlers were implemented to collect historical text data for model training.

Script Source Description
crawl_wiki.py Wikipedia Fetches article sections by category (EN/KR) via MediaWiki API.
crawl_book.py PDF textbooks Splits large PDF files into section-level documents using JSON-based page mapping.
crawl_yale.py Yale Open Courses Scrapes full lecture transcripts using Selenium and BeautifulSoup.

Each script saves extracted texts under the courses/ directory:

Run:

python crawl_wiki.py       # Wikipedia
python crawl_book.py       # PDF textbooks
python crawl_yale.py       # Yale lectures

Output:

courses/
β”œβ”€β”€ WIKI/
β”œβ”€β”€ Books/
└── Yale/

Data Generation

After collecting and cleaning historical texts, the next step is to generate exam-style problems using Llama-3.3-70B-Instruct.
This script extracts key terms from each text file, builds prompts, and saves generated problems in JSONL format.

Run:

python generate_questions.py \
  --input_root ./courses/WIKI \
  --prompt_file ./extract/prompt.txt \
  --out ./answers/results.jsonl \
  --temps 0.3,0.7,1.0

Output:

answers/
β”œβ”€β”€ results.T0.3.jsonl
β”œβ”€β”€ results.T0.7.jsonl
└── results.T1.0.jsonl

Data Preprocessing

Run:

python pre_sup_sc.py
python pre_grpo.py

Output:

data/
β”œβ”€β”€ sft/
β”‚   β”œβ”€β”€ sft_train.jsonl
β”‚   └── sft_eval.jsonl
β”œβ”€β”€ dpo/
β”‚   β”œβ”€β”€ dpo_train.jsonl
β”‚   └── dpo_eval.jsonl
└── grpo
    β”œβ”€β”€ grpo_train.jsonl
    └── grpo_eval.jsonl

Overall Training

Once the question–answer datasets are prepared, model alignment is performed through three stages:
Supervised Fine-Tuning (SFT) β†’ Direct Preference Optimization (DPO) β†’ Grouped Reward Policy Optimization (GRPO).

Each script below trains the model with progressively stronger alignment objectives.

1. Supervised Fine-Tuning
Fine-tunes gemma-3-4b-it (or any compatible instruction model) on high-quality question–answer pairs.

Input:
data/sft_train.jsonl, data/sft_eval.jsonl

Run:

python train_sft.py \
  --model google/gemma-3-4b-it \
  --dataset_path ./data/sft \
  --use_lora

Output:

results_sft/
└── sft-4b-lora/
    β”œβ”€β”€ checkpoints/
    └── sft-final/

2. Direct Preference Optimization
Aligns the model’s preferences using good vs. bad generations from Wikipedia-derived data.

Input:
data/dpo_train.jsonl, data/dpo_eval.jsonl

Run:

python train_dpo.py \
  --model <sft_model> \
  --dataset_path ./data/dpo \
  --use_lora

Output:

results_dpo/
└── dpo-4b-lora/
    β”œβ”€β”€ checkpoints/
    └── dpo-final/

3. Group Relative Policy Optimization
Final alignment phase applying custom reward functions to refine format accuracy, language regulation, and content diversity.

Input:
data/grpo_train.jsonl, data/grpo_eval.jsonl

Run:

python train_grpo.py \
  --model <dpo_model> \
  --dataset_path ./data/grpo \
  --use_lora

Output:

results_grpo/
└── grpo-4b-lora/
    β”œβ”€β”€ checkpoints/
    └── grpo-final/

⚠️ Limitations

Despite promising results, the current system has several practical and methodological limitations:

1. Domain Constraint

  • The dataset is limited to historical materials (Yale OYC lectures, open-source textbooks, Wikipedia).
  • Generalization to other academic or technical domains has not yet been verified.

2. Evaluation Gap

  • Quantitative evaluation metrics (e.g., RQUGE, NACo, factuality/coverage tests) are not yet integrated.
  • Generated exams are only validated qualitatively.

3. Prompt Dependency

  • The model relies heavily on a single handcrafted XML prompt template.
  • Even small deviations or missing tokens may lead to invalid XML outputs.

✨ Future Work

To address the above constraints and extend the project, the following directions are planned:

1. Automated Evaluation Pipeline

  • Integrate RQUGE and NACo metrics for reference-free exam quality scoring,
  • and develop an evaluation suite that scores factuality, coverage, and answer correctness.

2. Domain Expansion

  • Extend crawling and generation to other subjects (e.g., economics, biology, computer science) to test domain adaptability and dataset diversity.

3. Adaptive Prompting

  • Explore dynamic prompt generation or retrieval-augmented prompts that automatically adapt to input topic and context length.

🎫 License

  • Model:
    The fine-tuned model is based on Google Gemma-3-4B-it,
    and thus inherits the original Gemma License.

  • Code & Training Pipeline:
    All scripts, prompts, and reward logic in this repository are released under the Apache License 2.0.

Releases

No releases published

Packages

No packages published