DB Cooper Files Text

A comprehensive archive of FBI documents related to the infamous D.B. Cooper skyjacking case, extracted from the FBI Vault and converted to text.

FBI Vault link: https://vault.fbi.gov/D-B-Cooper%20

Overview

The D.B. Cooper Files Text repository provides a dataset of FBI case files related to the 1971 D.B. Cooper hijacking incident. It includes raw PDF documents obtained from the FBI Vault, along with scripts to convert these PDFs into plain text for research, analysis, and natural language processing.

Project Structure

.
├── download_script/        # Script to download D.B. Cooper PDFs from the FBI Vault
├── extraction_scripts/     # PDF-to-text conversion scripts
│   ├── linux/              # Linux-specific OCR (Tesseract)
│   └── macOS/              # macOS-specific OCR (Apple Vision)
├── extracted_text/         # Extracted text files from PDFs
├── azure/                  # Azure AI Document Intelligence outputs
│   ├── pdf/                # PDF files used for Azure extraction
│   └── json/               # JSON outputs: extracted text, schema, and search backup
└── web-chat-ui/           # Chatbot frontend (Cloudflare Pages + Workers)

Current Status

Date	Status	Extraction Method	Files Downloaded	Size	Total Files Listed
2025-05-12	✅ Complete	Apple Vision OCR	106	1.86GB	106

Prerequisites

Python 3.8+
Tesseract OCR (for Linux scripts)
macOS 10.15 (Catalina) or later (for Apple Vision scripts)
Node.js 14+ (for web-chat-ui)
Wrangler CLI (npm install -g wrangler)

Installation

Clone the repository:

git clone https://github.com/noops888/db-cooper-files-text.git
cd db-cooper-files-text

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Download Script

To fetch all PDF documents from the FBI Vault:

python download_script/download_cooper_docs.py

PDFs will be saved in download_script/pdfs/.

Extraction Scripts

Convert PDFs to text:

Linux (Tesseract):

python extraction_scripts/linux/tesseract_pdf_to_text.py \
  --input download_script/pdfs \
  --output extracted_text

macOS (Apple Vision):

python extraction_scripts/macOS/apple_vision_ocr/apple_vision_pdf_to_text_parallel.py \
  --input download_script/pdfs \
  --output extracted_text

Extracted Text

Plain text files are generated in extracted_text/, named after their source PDF.

Azure JSON

JSON outputs from Azure AI Document Intelligence are stored in azure/json/, including the index schema and search backup.

Web Chat UI

The web-chat-ui/ directory contains a Cloudflare Pages site and API Functions for your chatbot.

Navigate to the directory and install dependencies:

cd web-chat-ui
npm install

Run locally:

wrangler pages dev
# or
npm run dev

Open your browser to http://localhost:8787
Modify functions/api/autoragConfig.js to adjust AI search parameters as needed.

Contributing

Contributions are welcome! Please open issues or submit pull requests.

Fork the repo.
Create a branch: git checkout -b feature/YourFeature
Commit your changes.
Push and open a PR.

Azure Extraction

The azure/ directory contains outputs from Azure AI Document Intelligence:

azure/pdf/: Original PDF files supplied for extraction.
azure/json/: JSON files containing extracted text, index schema, and a backup of search results.

License

All FBI documents are in the public domain. Scripts and code are licensed under the MIT License. See LICENSE for details.

Acknowledgements

FBI Vault: D.B. Cooper
Scripts from jfk-files-text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DB Cooper Files Text

Table of Contents

Overview

Project Structure

Current Status

Prerequisites

Installation

Usage

Download Script

Extraction Scripts

Extracted Text

Azure JSON

Web Chat UI

Contributing

Azure Extraction

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
azure		azure
download_script		download_script
extracted_text		extracted_text
extraction_scripts		extraction_scripts
web-chat-ui		web-chat-ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

noops888/db-cooper-files-text

Folders and files

Latest commit

History

Repository files navigation

DB Cooper Files Text

Table of Contents

Overview

Project Structure

Current Status

Prerequisites

Installation

Usage

Download Script

Extraction Scripts

Extracted Text

Azure JSON

Web Chat UI

Contributing

Azure Extraction

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages