CTFKnow is a comprehensive research framework designed to measure and enhance Large Language Models (LLMs) capabilities in solving Capture-the-Flag (CTF) cybersecurity challenges. This project provides a complete pipeline for automated data collection, knowledge extraction, question generation, and model evaluation in the cybersecurity domain.
- Automated CTF Write-up Collection: Scrapes high-quality write-ups from CTFtime.org
- Intelligent Knowledge Extraction: Uses LLMs to extract universal cybersecurity knowledge from write-ups
- Automated Question Generation: Creates both multiple-choice and open-ended questions
- Comprehensive Model Evaluation: Evaluates LLM performance on cybersecurity tasks
- Vulnerable Code Dataset: Builds datasets with vulnerable code snippets and exploitation scenarios
- Multi-Model Support: Compatible with various LLM providers (OpenAI, Replicate, etc.)
CTFKnow/
βββ scraper/ # Data collection module
β βββ scraper.py # Main scraper orchestration
β βββ ctft/ # CTFtime.org specific scrapers
β β βββ get_writeup_url.py
β β βββ ctftime_scrape.py
β β βββ souper.py
β βββ all.md # Competition list
βββ dataset/ # Data storage
β βββ raw/ # Raw write-up files
β βββ list.json # Challenge metadata
β βββ list_knwoledge_question.json
β βββ list_knwoledge_key.json
βββ run.py # Main processing pipeline
βββ prompts.py # LLM prompt templates
βββ paper.pdf # Research paper
- Python 3.8+
- OpenAI API key (or other LLM provider)
- Required Python packages (see installation section)
-
Clone the repository
git clone https://github.com/tszdanger/CTFKnow.git cd CTFKnow -
Install dependencies
pip install -r requirements.txt
-
Set up API keys
export OPENAI_API_KEY="your-openai-api-key" # For other providers, set appropriate environment variables
cd scraper
python scraper.pypython run.py K -i dataset/list.json -o dataset/knowledge.jsonpython run.py Q -i dataset/knowledge.json -o dataset/questions.json -q dataset/question_list.json# Multiple choice evaluation
python run.py E -M single -l gpt-4-0125-preview -o evaluation_log.json
# Open-ended question evaluation
python run.py E -M open -l gpt-4-0125-preview -o evaluation_log.jsonThe CTFKnow dataset includes:
- 13,000+ CTF Challenges from various competitions
- 6 Challenge Categories: Web, Pwn, Reverse, Crypto, Forensics, Misc
- 2019-2024 Time Span: Covers challenges from multiple years
- Difficulty Distribution: Normalized difficulty scores
- Quality Filtered: Only high-rated write-ups included
| Category | Count | Percentage |
|---|---|---|
| Web | ~2,500 | 19% |
| Pwn | ~2,000 | 15% |
| Reverse | ~2,200 | 17% |
| Crypto | ~2,800 | 22% |
| Forensics | ~1,800 | 14% |
| Misc | ~1,700 | 13% |
The scraper module automatically collects CTF write-ups from CTFtime.org:
- Competition Discovery: Automatically finds CTF competitions
- Write-up Selection: Chooses highest-rated write-ups for each challenge
- Content Processing: Converts HTML to clean Markdown format
- Metadata Extraction: Captures challenge type, difficulty, and competition info
# Example: Scraping a specific competition
from scraper.ctft.get_writeup_url import list_writeups
writeups = await list_writeups("https://ctftime.org/event/1234/tasks/")Extracts universal cybersecurity knowledge from write-ups:
- LLM-Powered Extraction: Uses GPT-3.5-turbo for knowledge identification
- Universal Knowledge: Focuses on transferable security concepts
- Structured Output: Generates standardized knowledge representations
- Payload Examples: Includes practical exploitation examples
# Example: Knowledge extraction
from run import Knowledge
extractor = Knowledge('dataset/list.json', 'dataset/knowledge.json')
extractor.extract()Generates assessment questions from extracted knowledge:
- Multiple Choice Questions: Creates 4-option questions with distractors
- Open-Ended Questions: Generates short-answer questions
- Difficulty Scaling: Questions match original challenge difficulty
- Quality Control: Ensures question clarity and correctness
# Example: Question generation
from run import Question
generator = Question('dataset/knowledge.json', 'dataset/questions.json', 'dataset/question_list.json')
generator.generate()Comprehensive evaluation of LLM performance:
- Multiple Metrics: Accuracy, precision, recall, F1-score
- Batch Processing: Efficient evaluation of large question sets
- Detailed Logging: Comprehensive evaluation logs
- Model Comparison: Easy comparison between different LLMs
# Example: Model evaluation
from run import Evaluation
evaluator = Evaluation('gpt-4-0125-preview', 'dataset/question_list.json')
results = evaluator.envaluate()Evaluate how well different LLMs perform on cybersecurity tasks:
# Compare multiple models
python run.py E -M single -l gpt-4-0125-preview -o gpt4_results.json
python run.py E -M single -l claude-3-opus -o claude_results.json
python run.py E -M single -l llama-3-70b -o llama_results.jsonGenerate educational content for cybersecurity training:
# Generate questions for specific categories
python run.py Q -i dataset/web_knowledge.json -o web_questions.json -q web_question_list.jsonUse as a standardized benchmark for security AI research:
# Full pipeline for research
python run.py K -i dataset/list.json -o dataset/knowledge.json
python run.py Q -i dataset/knowledge.json -o dataset/questions.json -q dataset/question_list.json
python run.py E -M single -l your-model -o research_results.jsonBuild datasets with vulnerable code for security research:
python run.py B -l dataset/list.json -o dataset/vulnerable_code.jsonCTFKnow provides comprehensive evaluation metrics:
- Accuracy: Overall correct answer rate
- Category-wise Performance: Performance breakdown by challenge type
- Difficulty Analysis: Performance across different difficulty levels
- Question Type Analysis: Multiple choice vs. open-ended performance
- Confidence Analysis: Model confidence vs. accuracy correlation
Modify prompts.py to customize LLM interactions:
# Example: Custom knowledge extraction prompt
CUSTOM_EXTRACTION_PROMPT = """
You are an expert cybersecurity analyst. Extract the core security concepts from this CTF write-up.
Focus on universal principles that apply across different scenarios.
"""Add support for new LLM providers in run.py:
# Example: Adding new model support
def query_custom_model(input, system_prompt):
# Implement your model API call here
response = your_model_api(input, system_prompt)
return response, 1Customize the data processing workflow:
# Example: Custom data preprocessing
class CustomKnowledge(Knowledge):
def preprocess_writeup(self, writeup_content):
# Add your preprocessing logic
return processed_contentInitializes the dataset and generates statistics.
init = Buildinit('dataset/list.json', 'dataset/raw', 'dataset/data.json')
init.build() # Build initial dataset
init.draw_graph() # Generate statisticsExtracts cybersecurity knowledge from write-ups.
extractor = Knowledge('dataset/list.json', 'dataset/knowledge.json')
extractor.extract() # Extract knowledge
extractor.save_data() # Save resultsGenerates questions from extracted knowledge.
generator = Question('dataset/knowledge.json', 'dataset/questions.json', 'dataset/question_list.json')
generator.generate() # Generate questions
generator.convert_questions() # Convert to open-endedEvaluates LLM performance on generated questions.
evaluator = Evaluation('model-name', 'dataset/question_list.json')
evaluator.envaluate() # Multiple choice evaluation
evaluator.envaluate_short_answer() # Open-ended evaluationWe welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
# Clone and setup development environment
git clone https://github.com/tszdanger/CTFKnow.git
cd CTFKnow
pip install -r requirements-dev.txt
pre-commit installThis project is licensed under the MIT License - see the LICENSE file for details.
If you use CTFKnow in your research, please cite our paper:
@article{ji2025measuring,
title={Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges},
author={Ji, Zimo and Wu, Daoyuan and Jiang, Wenyuan and Ma, Pingchuan and Li, Zongjie and Wang, Shuai},
journal={arXiv preprint arXiv:2506.17644},
year={2025}
}