CAPM PDF → JSON Extractor

A specialized Python tool that automates the extraction and transformation of CAPM (Certified Associate in Project Management) practice test questions and answers from PDF files into standardized, machine-readable JSON format for exam preparation applications.

Overview

This tool eliminates the manual, error-prone process of extracting practice test data from PDFs by:

Parsing dual PDF sources: Extracting questions/options from one PDF and answers/explanations from another
Intelligent matching: Automatically combining questions with their corresponding answers and rationales
Smart classification: Categorizing questions into CAPM knowledge domains and estimating difficulty levels
Quality validation: Ensuring output integrity through comprehensive validation checks
Batch processing: Organizing questions into configurable batches for efficient consumption

OpenSpec is used to plan and track changes. See openspec/.

Key Features

PDF Processing

Dual PDF extraction: Processes separate question and answer PDFs
Multi-line text handling: Correctly accumulates questions, options, and explanations spanning multiple lines
Extended option support: Handles up to 7 answer choices (A-G), not just standard 4-option formats
Multi-select questions: Supports questions with multiple correct answers (e.g., "A,B,C")
Robust parsing: Uses pdfplumber as primary parser with pypdf as fallback for reliability
Artifact cleaning: Removes page numbers and formatting artifacts automatically

Intelligence Layer

Topic classification: Automatically categorizes questions into 4 CAPM ECO domains:
- Project Management Fundamentals and Core Concepts
- Predictive, Plan-Based Methodologies
- Agile Frameworks/Methodologies
- Business Analysis Frameworks
Difficulty estimation: Analyzes question complexity, length, and keywords to estimate difficulty (easy/medium/hard)
Special case detection: Flags questions with images, drag-and-drop, or other special requirements

Quality & Reliability

Comprehensive validation: Checks for required fields, answer consistency, explanation quality, and more
Path auto-resolution: Intelligently finds missing PDF files based on naming conventions
Batch organization: Configurable batch sizes (default: 15 questions per batch)
Error handling: Detailed logging and graceful failure recovery
Testing: Unit tests with mocked PDF fixtures (no real PDFs required)

Production Ready

CI/CD integration: GitHub Actions workflow with automated testing
Dependency management: Automated updates via Dependabot
Flexible configuration: JSON-based config supporting multiple practice tests
Combining outputs: Tools to merge batches into single or multi-test collections

Quickstart

(Optional) Create/activate the Python venv
Install dependencies
Place PDFs under input/questions/ and input/answers/
Run the extractor

# 1) (Optional) create venv
# python3 -m venv .venv && source .venv/bin/activate

# 2) install deps
pip install -r requirements.txt

# 3) run extractor (uses config.json)
python scripts/extract_questions.py --config config.json

# or use the provided local config pointing to the included sample files
python scripts/extract_questions.py --config config.local.json --skip-tests

# to avoid auto-resolving missing paths (only process tests whose paths exist)
python scripts/extract_questions.py --config config.json --strict-paths

Config

See config.json. Update tests array with your actual question/answer PDF paths.

A convenience config.local.json is provided that points to the sample Practice Test 2 PDFs that are present in input/.

Naming tips for best auto-resolution when paths are missing:

Place question PDFs under input/questions/ with names containing practice_test or test and the test number.
Place answer PDFs under input/answers/ with names containing answers or rationale and the test number.

Output Format

Each extracted question is formatted as a JSON object with the following structure:

{
  "text": "[Practice Test 2] Q18) What is the primary benefit of using an Agile approach?",
  "options": {
    "A": "Reduced project costs",
    "B": "Early and continuous delivery of value",
    "C": "Elimination of all project risks",
    "D": "Guaranteed project success"
  },
  "correctAnswer": "B",
  "explanation": "Agile methodologies focus on early and continuous delivery of valuable software through iterative development cycles. This allows stakeholders to see working increments regularly and provide feedback, ensuring the product meets their evolving needs...",
  "topic": "Agile Frameworks/Methodologies",
  "difficulty": "medium"
}

Output Files:

Batches are saved to output/<test_name>/<test_name>_batch_N.json
Combined outputs can be generated with combine_outputs.py
All output files include comprehensive validation metadata

How It Works

The extraction pipeline follows these steps:

Configuration Loading: Reads test definitions from config.json (PDF paths, batch sizes)
Question Extraction:
- Opens question PDF using pdfplumber (or pypdf fallback)
- Uses regex patterns to identify question numbers, options (A-G), and text
- Accumulates multi-line content until next question boundary
- Cleans formatting artifacts and page numbers
Answer Extraction:
- Parses answer PDF with format: "NUMBER LETTER(s) Explanation"
- Handles single answers (e.g., "51 C") and multiple answers (e.g., "56 A,B,C")
- Accumulates multi-line explanations
Matching: Combines questions with corresponding answers and validates consistency
Classification:
- Categorizes questions into CAPM knowledge domains using keyword matching
- Estimates difficulty based on length, complexity, and multi-answer factors
JSON Generation: Creates batched output files with standardized structure
Validation: Performs comprehensive checks on required fields, answer consistency, and explanation quality

Configuration & Customization

Extraction Parameters

Fine-tune extraction behavior by editing extraction_config.py:

# Validation thresholds
MIN_EXPLANATION_LENGTH = 100      # Minimum explanation length in characters
LONG_QUESTION_THRESHOLD = 200     # Question length for difficulty scoring
LONG_EXPLANATION_THRESHOLD = 500  # Explanation length for difficulty scoring

# Batch processing
DEFAULT_BATCH_SIZE = 15           # Questions per batch file

# Difficulty scoring
DIFFICULTY_EASY_MAX = 1           # Max score for 'easy' (0-1)
DIFFICULTY_MEDIUM_MAX = 3         # Max score for 'medium' (2-3)
# Scores 4+ are classified as 'hard'

Topic Classification

Customize keyword lists for better classification accuracy:

Agile keywords: scrum, sprint, kanban, retrospective, etc.
Predictive keywords: waterfall, wbs, gantt, critical path, etc.
Business Analysis keywords: requirements, business case, roi, swot, etc.

Edit keyword lists in extraction_config.py to match your content.

Advanced Options

Strict path mode (skip tests with missing PDFs):

python scripts/extract_questions.py --config config.json --strict-paths

Skip quick tests (faster execution):

python scripts/extract_questions.py --config config.json --skip-tests

Custom batch size (override config.json): Edit batch_size value in your config file for each test.

Technical Stack

Python 3.9+: Core language
pdfplumber 0.11.8: Primary PDF text extraction library
pypdf 6.2.0: Fallback PDF parser for reliability
pytest 8.3.3+: Testing framework with fixture support
GitHub Actions: CI/CD automation
OpenSpec: Spec-driven development methodology

OpenSpec

Proposal and tasks: openspec/changes/add-capm-pdf-extraction/
Project context: openspec/project.md

Use openspec validate --strict to validate proposals/specs.

Repository Structure

PDFtoJSON/
├── scripts/
│   ├── extract_questions.py      # Main extraction engine (628 lines)
│   ├── combine_outputs.py        # Batch combiner utility
│   └── ci_validate_outputs.py    # CI validation helper
├── tests/
│   └── test_parsing.py           # Unit tests with mocked PDFs
├── input/
│   ├── questions/                # Question PDFs (gitignored)
│   └── answers/                  # Answer PDFs (gitignored)
├── output/                       # Generated JSON batches (gitignored)
├── openspec/                     # Spec-driven development docs
│   ├── project.md                # Project context
│   └── changes/                  # Change proposals and specs
├── config.json                   # Multi-test configuration
├── config.local.json             # Local development config
├── requirements.txt              # Python dependencies
└── README.md                     # This file

Testing

Run unit tests (uses monkeypatched pdfplumber, no real PDFs required):

pytest -q

The test suite includes:

Multi-option parsing tests (A-G format)
Multi-line explanation handling
Various answer format validation
All tests use mocked PDF fixtures for fast, reproducible testing

Combining Outputs

After extraction, you can combine batch files in several ways:

Combine all batches for a single test:

python scripts/combine_outputs.py --test "Practice Test 2"

Output: output/combined_Practice_Test_2.json

Auto-detect and combine all tests:

python scripts/combine_outputs.py --all-tests

This automatically detects all tests in output/ and creates a combined file with testName field added to each question.

Specify multiple tests explicitly:

python scripts/combine_outputs.py --tests "Practice Test 1" "Practice Test 2" "Practice Test 3"

Custom output directory:

python scripts/combine_outputs.py --all-tests --output-dir /path/to/output

Advanced Usage

Command Line Arguments

extract_questions.py:

--config PATH: Path to configuration file (default: config.json)
--skip-tests: Skip running pytest validation after extraction
--strict-paths: Disable auto-resolution; only process tests with existing PDF paths

combine_outputs.py:

--test NAME: Combine batches for a single test
--tests NAME [NAME ...]: Combine batches for multiple specific tests
--all-tests: Auto-detect and combine all tests
--output-dir PATH: Custom output directory (default: output/)

Configuration Options

Edit config.json to customize extraction behavior:

{
  "tests": [
    {
      "name": "Practice Test 1",
      "questions_pdf": "input/questions/practice_test_1_questions.pdf",
      "answers_pdf": "input/answers/practice_test_1_answers.pdf",
      "batch_size": 15
    }
  ]
}

name: Identifier for the test (used in output filenames and question text)
questions_pdf: Path to PDF containing questions and options
answers_pdf: Path to PDF containing correct answers and explanations
batch_size: Number of questions per batch file (default: 15)

CI/CD Integration

The repository includes GitHub Actions workflow (.github/workflows/ci.yml) that:

Installs dependencies
Runs pytest test suite
Validates outputs automatically
Triggers on push and pull requests

Dependabot is configured for weekly dependency updates.

Troubleshooting

Common Issues

"No files matched pattern" error:

Cause: No batch files found for the specified test name
Solution: Check that extraction ran successfully and output files exist in output/ directory
Verify: ls output/CAPM_*_Questions_*.json

"Explanation too short" warnings:

Cause: Source PDFs have terse rationale (<100 chars) for some questions
Solution: This is a warning, not an error. Items are still included in output
Fix: Adjust MIN_EXPLANATION_LENGTH in extraction_config.py if needed

Missing questions in output:

Cause: Batching by count instead of question number ranges
Solution: The code already handles this correctly (batches by number ranges)
Verify: Check that all 150 questions are present in combined output

Import errors in CI:

Cause: scripts module not in PYTHONPATH
Solution: GitHub Actions workflow sets PYTHONPATH: ${{ github.workspace }}
Local fix: Run tests with PYTHONPATH=. pytest -q

PDF parsing fails:

Cause: pdfplumber cannot parse specific PDF format
Solution: Script automatically falls back to pypdf
Manual: Check PDF isn't corrupted or password-protected

Answer letter not in options:

Cause: Mismatch between extracted answer and question options
Solution: Review source PDFs for OCR errors or formatting issues
Log: Check extraction output for specific question numbers

Validation Failures

The validation step checks for:

✓ All required fields present and non-empty
✓ Text starts with [Practice Test X]
✓ Answer letters exist in question options
⚠️ Explanation length >= 100 characters (warning only)
✓ Topic is one of 4 valid CAPM domains
✓ Difficulty is 'easy', 'medium', or 'hard'

To adjust validation:

Edit thresholds in extraction_config.py
Modify validate_json_output() function in extract_questions.py

Debug Mode

For verbose output during extraction:

# Add debug logging (future enhancement)
# Currently use print statements in code for debugging

Getting Help

Check GitHub Issues for similar problems
Review openspec/project.md for project context
Consult .github/copilot-instructions.md for code patterns
Run validation: python scripts/ci_validate_outputs.py

Notes

Main extraction engine: scripts/extract_questions.py (628 lines)
Lazy imports: PDF parsers are imported lazily; the script works even without PDF libs when using sample fixtures or dry runs
Auto-resolution: Missing PDF paths are automatically resolved using intelligent naming conventions
Strict mode: Use --strict-paths flag to disable auto-resolution and only process tests with valid paths
Batch configuration: Default batch size is 15 questions; configurable per test in config.json

Publishing to GitHub

Create a new GitHub repository and copy the HTTPS URL.
Initialize git (if not already), add remote, and push:

git init
git add .
git commit -m "Initial CAPM PDF→JSON extractor"
git branch -M main
git remote add origin <YOUR_REPO_URL>
git push -u origin main

CI will run on push (runs pytest). Large PDFs and outputs are ignored by .gitignore.

Contributing

Contributions are welcome! Please open issues or pull requests for:

Improvements:

Enhanced PDF parsing for edge cases (tables, complex layouts, special characters)
Additional CAPM topic classification keywords
Better difficulty estimation algorithms
Performance optimizations (caching, parallel processing)
Support for other exam formats beyond CAPM

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
input		input
openspec		openspec
scripts		scripts
tests		tests
utils		utils
# 🤖 AI Coding Assistant Automation Prom.md		# 🤖 AI Coding Assistant Automation Prom.md
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
config.json		config.json
config.local.json		config.local.json
config.local.practice3.json		config.local.practice3.json
extraction_config.py		extraction_config.py
requirements.txt		requirements.txt

License

zulkifly-anawi/PDFtoJSON

Folders and files

Latest commit

History

Repository files navigation

CAPM PDF → JSON Extractor

Overview

Key Features

PDF Processing

Intelligence Layer

Quality & Reliability

Production Ready

Quickstart

Config

Output Format

How It Works

Configuration & Customization

Extraction Parameters

Topic Classification

Advanced Options

Technical Stack

OpenSpec

Repository Structure

Testing

Combining Outputs

Advanced Usage

Command Line Arguments

Configuration Options

CI/CD Integration

Troubleshooting

Common Issues

Validation Failures

Debug Mode

Getting Help

Notes

Publishing to GitHub

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages