A comprehensive archive of FBI documents related to the infamous D.B. Cooper skyjacking case, extracted from the FBI Vault and converted to text.
FBI Vault link: https://vault.fbi.gov/D-B-Cooper%20
The D.B. Cooper Files Text repository provides a dataset of FBI case files related to the 1971 D.B. Cooper hijacking incident. It includes raw PDF documents obtained from the FBI Vault, along with scripts to convert these PDFs into plain text for research, analysis, and natural language processing.
.
├── download_script/ # Script to download D.B. Cooper PDFs from the FBI Vault
├── extraction_scripts/ # PDF-to-text conversion scripts
│ ├── linux/ # Linux-specific OCR (Tesseract)
│ └── macOS/ # macOS-specific OCR (Apple Vision)
├── extracted_text/ # Extracted text files from PDFs
├── azure/ # Azure AI Document Intelligence outputs
│ ├── pdf/ # PDF files used for Azure extraction
│ └── json/ # JSON outputs: extracted text, schema, and search backup
└── web-chat-ui/ # Chatbot frontend (Cloudflare Pages + Workers)
Date | Status | Extraction Method | Files Downloaded | Size | Total Files Listed |
---|---|---|---|---|---|
2025-05-12 | ✅ Complete | Apple Vision OCR | 106 | 1.86GB | 106 |
- Python 3.8+
- Tesseract OCR (for Linux scripts)
- macOS 10.15 (Catalina) or later (for Apple Vision scripts)
- Node.js 14+ (for web-chat-ui)
- Wrangler CLI (
npm install -g wrangler
)
- Clone the repository:
git clone https://github.com/noops888/db-cooper-files-text.git cd db-cooper-files-text
- Install dependencies:
pip install -r requirements.txt
To fetch all PDF documents from the FBI Vault:
python download_script/download_cooper_docs.py
PDFs will be saved in download_script/pdfs/
.
Convert PDFs to text:
-
Linux (Tesseract):
python extraction_scripts/linux/tesseract_pdf_to_text.py \ --input download_script/pdfs \ --output extracted_text
-
macOS (Apple Vision):
python extraction_scripts/macOS/apple_vision_ocr/apple_vision_pdf_to_text_parallel.py \ --input download_script/pdfs \ --output extracted_text
Plain text files are generated in extracted_text/
, named after their source PDF.
JSON outputs from Azure AI Document Intelligence are stored in azure/json/
, including the index schema and search backup.
The web-chat-ui/
directory contains a Cloudflare Pages site and API Functions for your chatbot.
- Navigate to the directory and install dependencies:
cd web-chat-ui
npm install
- Run locally:
wrangler pages dev
# or
npm run dev
-
Open your browser to http://localhost:8787
-
Modify
functions/api/autoragConfig.js
to adjust AI search parameters as needed.
Contributions are welcome! Please open issues or submit pull requests.
- Fork the repo.
- Create a branch:
git checkout -b feature/YourFeature
- Commit your changes.
- Push and open a PR.
The azure/
directory contains outputs from Azure AI Document Intelligence:
azure/pdf/
: Original PDF files supplied for extraction.azure/json/
: JSON files containing extracted text, index schema, and a backup of search results.
All FBI documents are in the public domain. Scripts and code are licensed under the MIT License. See LICENSE for details.
- FBI Vault: D.B. Cooper
- Scripts from jfk-files-text