CIE A-Level Past Paper Question Extractor & Organizer
Automatically downloads CIE A-Level past papers, extracts individual questions, and provides a UI for organizing them by topic.
- Automatically Download Past Papers
- Automatically Slice Questions into Images
- Manually sort the questions in the
ugly asfbeautiful UI
- Python 3.8+
- Chrome/Chromium browser (for Selenium)
- ChromeDriver (matching your Chrome version)
# Ubuntu/Debian/Linux Mint
sudo apt update
sudo apt install chromium-chromedriver
# Or download manually from:
# https://chromedriver.chromium.org/downloads
Do this if ur on windowspip install -r requirements.txtInstalls packages:
selenium- Web scrapingPillow- Image processingPyMuPDF- PDF parsingpdf2image- PDF to image conversionwatchdog- File system monitoring
python3 main.pySelect option 1 to run all workers together.
Mode 1: Full Pipeline (Recommended for first run)
python3 main.py
# Select: 1Downloads papers, slices them up, and asks you to sort them all in parallel.
Mode 2: Download Only
python3 main.py
# Select: 2Just downloads PDFs (useful for batch downloading overnight)
Mode 3: Slice Only
python3 main.py
# Select: 3Processes existing PDFs to extract questions (useful for sociopaths who don't like saving time by using option 1)
Mode 4: Sort Only
python3 main.py
# Select: 4Opens UI to tag already-extracted questions
Mode 5: Slice + Sort (Recommended if you have PDFs)
python3 main.py
# Select: 5Process PDFs and tag questions (skips downloading)
Each worker runs independently:
- Downloader saves PDFs to
data/pdfs/ - Slicer watches
pdfs/and extracts questions as images toraw_questions/ - Sorter displays questions from
raw_questions/for tagging
- 1-9: Quick assign to topic (based on current subject's topic list)
- Left Arrow or Space: Skip question
- Right Arrow: Go back to previous question
- Question displays with metadata (subject, session, paper, question number)
- Select the subject using radio buttons (if needed)
- Click a topic button or press number key
- Question auto-saves and moves to next
- Progress percentage in the top right shows completion percentage
- All tagged questions automatically saved to database (it's js a json file lol)
- Images copied to topic-specific folders
Edit config.py to customize:
# Year range for downloads
START_YEAR = 2015
END_YEAR = 2024
# Paper types
SESSIONS = ["s", "w"]
PAPER_TYPES = ["qp", "ms"]
# Browser mode
HEADLESS_MODE = False # Set True for background operation
# Add more topics per subject
SUBJECTS = {
"subject code": {
"name": "name of sub",
"label": "human readable name",
"papers": ["1", "2", "3", "4", "5" ...],
"topics": [
"topic1",
"topic2",
"topic3",
"topic4",
...
],
},All metadata stored in data/questions_db.json:
{
"9702_s24_qp_42.pdf": {
"processed": true,
"questions": [
{
"id": "9702_s24_qp_42_1",
"subject_code": "9702",
"session": "s24",
"paper_num": "42",
"question_num": "1",
"page_num": 1,
"image_path": "data/raw_questions/9702_s24_qp_42_1.png",
"topic": "Mechanics",
"marks": null
}
],
"total_questions": 12
}
}Found a bug? Have a feature request? Want to add support for more subjects?
Open an issue or submit a pull request!
Happy studying!
Made with love for CIE students worldwide
please star the project ts took way too long