A Python tool for automatically downloading PDF and PowerPoint files from websites while maintaining the site's hierarchical folder structure.
🔍 Smart Resource Detection
- Automatically finds PDF, PPT, and PPTX files on web pages
- Detects both direct links and embedded resources
- Handles various URL formats and edge cases
📁 Intelligent Folder Organization
- Creates nested folder structures based on website hierarchy
- Uses page titles and URL paths for meaningful folder names
- Maintains logical organization that mirrors the source website
🏷️ Smart File Naming
- Generates descriptive filenames from link text
- Cleans up filenames to be filesystem-compatible
- Handles duplicate files automatically
🛡️ Respectful Scraping
- Includes delays between requests to avoid overwhelming servers
- Uses appropriate headers to identify the scraper
- Tracks downloaded files to avoid duplicates
📋 Preview Mode
- See what files will be downloaded before actually downloading
- Preview the folder structure that will be created
- Verify file names and locations
- Python 3.7 or higher
- pip (Python package installer)
git clone https://github.com/yourusername/web-resource-scraper.git
cd web-resource-scraper# Create virtual environment
python -m venv scraper_env
# Activate virtual environment
# On Windows:
scraper_env\Scripts\activate
# On macOS/Linux:
source scraper_env/bin/activatepip install -r requirements.txtFor a minimal installation with just the essential dependencies:
pip install -r requirements-minimal.txtOr install dependencies manually:
pip install requests>=2.31.0 beautifulsoup4>=4.12.0- Edit the script to add your target URLs:
pages_to_scrape = [
"https://example.com/resources",
"https://example.com/documents",
# Add more URLs here
]- Run in preview mode to see what will be downloaded:
python resource_scraper.py- Enable actual downloading by uncommenting the download section in the script.
from resource_scraper import ResourceScraper
# Initialize scraper
scraper = ResourceScraper("https://example.com", download_folder="my_downloads")
# URLs to scrape
urls = [
"https://example.com/page1",
"https://example.com/page2"
]
# Preview what will be downloaded
scraper.scrape_multiple_pages(urls, preview_only=True)
# Actually download the files
downloaded = scraper.scrape_multiple_pages(urls)scraper = ResourceScraper(
base_url="https://example.com", # Base URL for the website
download_folder="downloaded_resources" # Folder to save files
)By default, the scraper looks for PDF, PPT, and PPTX files. To modify this, edit the find_resources() method:
# Look for additional file types
if any(full_url.lower().endswith(ext) for ext in [".pdf", ".ppt", ".pptx", ".doc", ".docx"]):To be more or less aggressive with scraping speed, modify the delay values:
# In download_file method
time.sleep(0.5) # Delay between individual file downloads
# In scrape_multiple_pages method
time.sleep(1) # Delay between pagesThe scraper creates folders based on the website's structure:
downloaded_resources/
├── special-sabbaths/
│ ├── advent_hope_pdf.pdf
│ └── mission_emphasis_pptx.pptx
├── evangelism/
│ ├── public_evangelism_guide.pdf
│ └── personal_witnessing_slides.pptx
└── health-ministry/
├── health_seminar_materials.pdf
└── cooking_class_presentation.pptx
Override the folder naming logic by modifying the get_folder_path() method:
def custom_folder_naming(self, page_url, page_title):
# Your custom logic here
return custom_folder_pathAdd custom filtering logic in the find_resources() method:
# Example: Only download files with specific keywords
if any(keyword in link_text.lower() for keyword in ['guide', 'manual', 'presentation']):
resources.append(resource_info)For sites requiring authentication, modify the session:
# Add authentication
scraper.session.auth = ('username', 'password')
# Or use cookies
scraper.session.cookies.update({'session_id': 'your_session_id'})Permission Errors
# Make sure you have write permissions to the download folder
chmod 755 downloaded_resources/SSL Certificate Errors
# Add to scraper initialization if needed
scraper.session.verify = False # Not recommended for productionRate Limiting
# Increase delays if you're being rate limited
time.sleep(2) # Increase delay between requestsEnable verbose output by adding print statements or using Python's logging module:
import logging
logging.basicConfig(level=logging.DEBUG)web-resource-scraper/
├── resource_scraper.py # Main scraper class
├── requirements.txt # Full dependencies
├── requirements-minimal.txt # Essential dependencies only
├── README.md # This file
├── examples/ # Example scripts
│ ├── basic_usage.py
│ └── advanced_usage.py
└── tests/ # Unit tests (optional)
└── test_scraper.py
- requests (>=2.31.0) - HTTP library for making web requests
- beautifulsoup4 (>=4.12.0) - HTML/XML parser for extracting data
- lxml (>=4.9.0) - Faster XML/HTML parser backend
- urllib3 (>=2.0.0) - Low-level HTTP client
- certifi (>=2023.7.22) - SSL certificate bundle
- chardet (>=5.2.0) - Character encoding detection
- pathlib - Object-oriented filesystem paths (Python 3.4+)
- time - Time-related functions
- os - Operating system interface
- re - Regular expressions
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Create a Pull Request
- Respect robots.txt: Check the website's robots.txt file before scraping
- Rate limiting: Don't overwhelm servers with too many requests
- Terms of service: Review and comply with website terms of service
- Copyright: Ensure you have permission to download copyrighted content
- Personal use: This tool is designed for personal/educational use
- Always test with a small number of pages first
- Use preview mode to verify what will be downloaded
- Implement appropriate delays between requests
- Monitor your network usage and server response times
- Be prepared to stop scraping if requested by site administrators
This project is licensed under the MIT License - see the LICENSE file for details.
If you encounter issues or have questions:
- Check the troubleshooting section above
- Search existing issues on GitHub
- Create a new issue with detailed information about your problem
- Include error messages, URLs being scraped, and your Python version
- Initial release
- Basic PDF/PPT scraping functionality
- Nested folder structure creation
- Preview mode
- Smart filename generation