A powerful and reusable Python scraper framework designed for efficiently crawling documentation websites and preparing content for AI training. It features async operations, structured output formats, and intelligent content processing.
- 🔁 Recursive crawling from a single starting URL
- 🧠 Intelligent content extraction and structuring
- 📄 Multiple export formats (JSON, TXT, PDF)
- ⚡ Async operations for improved performance
- 🔄 Automatic retry mechanism
- 📊 Token counting and chunking
- 📁 Organized output structure
- Python 3.8+
- Required packages listed in
requirements.txt
-
Clone the repository:
git clone [repository-url] cd AIContextScraper -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: .\venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the scraper:
python main.py
-
Follow the prompts:
- Enter the documentation website URL
- Specify a project name (or use default)
- Choose whether to export PDFs
D:/AI_Training_Corpora/PROJECT_NAME/
├── raw_html/ # Original HTML content
├── json/ # Structured content with metadata
├── txt/ # Chunked text content
├── pdf/ # PDF exports (optional)
├── logs/ # Execution logs
└── metadata.json # Run statistics and summary
{
"title": "Page Title",
"url": "https://example.com/docs/page",
"content": "Extracted and cleaned content...",
"tokens": 184,
"timestamp": "2023-12-25T20:15:23Z"
}Adjust settings in config.py:
- HTTP request parameters
- Crawling limits
- Content processing options
- Output formatting
- Direct embedding export for vector databases
- Automatic content classification
- Markdown export support
- Browser-based crawling for JS-heavy sites
- Scheduled updates for documentation sites
MIT License - feel free to use and modify for your needs.