Web Resource Scraper

A Python tool for automatically downloading PDF and PowerPoint files from websites while maintaining the site's hierarchical folder structure.

Features

🔍 Smart Resource Detection

Automatically finds PDF, PPT, and PPTX files on web pages
Detects both direct links and embedded resources
Handles various URL formats and edge cases

📁 Intelligent Folder Organization

Creates nested folder structures based on website hierarchy
Uses page titles and URL paths for meaningful folder names
Maintains logical organization that mirrors the source website

🏷️ Smart File Naming

Generates descriptive filenames from link text
Cleans up filenames to be filesystem-compatible
Handles duplicate files automatically

🛡️ Respectful Scraping

Includes delays between requests to avoid overwhelming servers
Uses appropriate headers to identify the scraper
Tracks downloaded files to avoid duplicates

📋 Preview Mode

See what files will be downloaded before actually downloading
Preview the folder structure that will be created
Verify file names and locations

Installation

Prerequisites

Python 3.7 or higher
pip (Python package installer)

Step 1: Clone the Repository

git clone https://github.com/yourusername/web-resource-scraper.git
cd web-resource-scraper

Step 2: Create a Virtual Environment (Recommended)

# Create virtual environment
python -m venv scraper_env

# Activate virtual environment
# On Windows:
scraper_env\Scripts\activate
# On macOS/Linux:
source scraper_env/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

For a minimal installation with just the essential dependencies:

pip install -r requirements-minimal.txt

Or install dependencies manually:

pip install requests>=2.31.0 beautifulsoup4>=4.12.0

Quick Start

Basic Usage

Edit the script to add your target URLs:

pages_to_scrape = [
    "https://example.com/resources",
    "https://example.com/documents",
    # Add more URLs here
]

Run in preview mode to see what will be downloaded:

python resource_scraper.py

Enable actual downloading by uncommenting the download section in the script.

Example Usage

from resource_scraper import ResourceScraper

# Initialize scraper
scraper = ResourceScraper("https://example.com", download_folder="my_downloads")

# URLs to scrape
urls = [
    "https://example.com/page1",
    "https://example.com/page2"
]

# Preview what will be downloaded
scraper.scrape_multiple_pages(urls, preview_only=True)

# Actually download the files
downloaded = scraper.scrape_multiple_pages(urls)

Configuration Options

Scraper Parameters

scraper = ResourceScraper(
    base_url="https://example.com",           # Base URL for the website
    download_folder="downloaded_resources"    # Folder to save files
)

Customizing File Types

By default, the scraper looks for PDF, PPT, and PPTX files. To modify this, edit the find_resources() method:

# Look for additional file types
if any(full_url.lower().endswith(ext) for ext in [".pdf", ".ppt", ".pptx", ".doc", ".docx"]):

Adjusting Delays

To be more or less aggressive with scraping speed, modify the delay values:

# In download_file method
time.sleep(0.5)  # Delay between individual file downloads

# In scrape_multiple_pages method  
time.sleep(1)    # Delay between pages

Folder Structure

The scraper creates folders based on the website's structure:

downloaded_resources/
├── special-sabbaths/
│   ├── advent_hope_pdf.pdf
│   └── mission_emphasis_pptx.pptx
├── evangelism/
│   ├── public_evangelism_guide.pdf
│   └── personal_witnessing_slides.pptx
└── health-ministry/
    ├── health_seminar_materials.pdf
    └── cooking_class_presentation.pptx

Advanced Usage

Custom Folder Naming

Override the folder naming logic by modifying the get_folder_path() method:

def custom_folder_naming(self, page_url, page_title):
    # Your custom logic here
    return custom_folder_path

Filtering Resources

Add custom filtering logic in the find_resources() method:

# Example: Only download files with specific keywords
if any(keyword in link_text.lower() for keyword in ['guide', 'manual', 'presentation']):
    resources.append(resource_info)

Handling Authentication

For sites requiring authentication, modify the session:

# Add authentication
scraper.session.auth = ('username', 'password')

# Or use cookies
scraper.session.cookies.update({'session_id': 'your_session_id'})

Troubleshooting

Common Issues

Permission Errors

# Make sure you have write permissions to the download folder
chmod 755 downloaded_resources/

SSL Certificate Errors

# Add to scraper initialization if needed
scraper.session.verify = False  # Not recommended for production

Rate Limiting

# Increase delays if you're being rate limited
time.sleep(2)  # Increase delay between requests

Debug Mode

Enable verbose output by adding print statements or using Python's logging module:

import logging
logging.basicConfig(level=logging.DEBUG)

File Structure

web-resource-scraper/
├── resource_scraper.py          # Main scraper class
├── requirements.txt             # Full dependencies
├── requirements-minimal.txt     # Essential dependencies only
├── README.md                   # This file
├── examples/                   # Example scripts
│   ├── basic_usage.py
│   └── advanced_usage.py
└── tests/                      # Unit tests (optional)
    └── test_scraper.py

Dependencies

Core Dependencies

requests (>=2.31.0) - HTTP library for making web requests
beautifulsoup4 (>=4.12.0) - HTML/XML parser for extracting data

Optional Dependencies (in full requirements.txt)

lxml (>=4.9.0) - Faster XML/HTML parser backend
urllib3 (>=2.0.0) - Low-level HTTP client
certifi (>=2023.7.22) - SSL certificate bundle
chardet (>=5.2.0) - Character encoding detection

Built-in Dependencies

pathlib - Object-oriented filesystem paths (Python 3.4+)
time - Time-related functions
os - Operating system interface
re - Regular expressions

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Create a Pull Request

Ethics and Legal Considerations

Responsible Usage

Respect robots.txt: Check the website's robots.txt file before scraping
Rate limiting: Don't overwhelm servers with too many requests
Terms of service: Review and comply with website terms of service
Copyright: Ensure you have permission to download copyrighted content
Personal use: This tool is designed for personal/educational use

Best Practices

Always test with a small number of pages first
Use preview mode to verify what will be downloaded
Implement appropriate delays between requests
Monitor your network usage and server response times
Be prepared to stop scraping if requested by site administrators

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

If you encounter issues or have questions:

Check the troubleshooting section above
Search existing issues on GitHub
Create a new issue with detailed information about your problem
Include error messages, URLs being scraped, and your Python version

Changelog

v1.0.0

Initial release
Basic PDF/PPT scraping functionality
Nested folder structure creation
Preview mode
Smart filename generation

⚠️ Disclaimer: This tool is for educational and personal use only. Users are responsible for ensuring compliance with website terms of service, copyright laws, and ethical scraping practices.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scraper.py		scraper.py

webpoint-solutions-llc/web-resource-scraper

Folders and files

Latest commit

History

Repository files navigation