A respectful Python scraper for downloading academic papers from economics conference websites for research purposes.
- Downloads PDFs from economics conference pages
- Respects robots.txt automatically with built-in compliance checking
- Maintains detailed logs
- Progress tracking with tqdm
- Fallback from Selenium to requests-based scraping
- Automatic paper ID extraction from JavaScript
- Clone the repository:
git clone https://github.com/yourusername/econ-conference-scraper.git
cd econ-conference-scraper- Create a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt./run_scraper.shfrom conference_scraper import ConferenceScraper
scraper = ConferenceScraper(
conference_url='https://example-conference.org/your-conference-url',
download_dir='downloads',
log_dir='logs',
delay=2.0 # Seconds between requests
)
# Extract and download papers
papers = scraper.extract_papers_from_page()
scraper.download_papers()Edit the conference URL in run_scraper.sh or run directly:
python conference_scraper.pydelay: Time between requests (default: 2 seconds, automatically respects robots.txt crawl-delay if specified)download_dir: Where PDFs are saved (default:downloads/)log_dir: Where logs are saved (default:logs/)
This scraper is designed for academic research purposes only:
- robots.txt checking: Automatically fetches and respects robots.txt rules
- Rate limiting: Enforces delays between requests
- User-Agent identification: Clearly identifies itself as educational/research tool
- Crawl-delay: Automatically respects crawl-delay if specified in robots.txt
- Cite properly: Always cite papers appropriately
- Personal use: Download papers for personal research only
- Don't redistribute: Papers are copyrighted by their respective institutions and authors
- Respect server resources: Don't run multiple instances simultaneously
- PDFs: Saved in
downloads/directory asPaper_[id].pdf - Logs: Detailed logs in
logs/directory - Summary CSV: Download summary with status for each paper
- Python 3.7+
- Safari/Chrome browser (optional, falls back to requests if unavailable)
- Dependencies in
requirements.txt
- Some papers may not be available for download yet
- Papers are identified by conference-specific IDs
- JavaScript-rendered content requires Selenium/WebDriver
- Papers not downloading: Some papers may not be uploaded yet to the conference system
- Safari WebDriver on macOS: Enable "Allow Remote Automation" in Safari's Developer menu
- Chrome not found: The scraper will fall back to requests-based extraction
- Rate limiting: The scraper automatically respects rate limits
econ-conference-scraper/
├── conference_scraper.py # Main scraper implementation
├── run_scraper.sh # Quick start script
├── requirements.txt # Python dependencies
├── README.md # This file
├── .gitignore # Git ignore rules
├── downloads/ # Downloaded PDFs (git-ignored)
└── logs/ # Scraping logs (git-ignored)
Pull requests are welcome. For major changes, please open an issue first.
MIT License - See LICENSE file for details
This tool is for educational and research purposes only. Users are responsible for complying with the target website's terms of service and copyright laws. Always respect intellectual property rights and use downloaded content appropriately.
- Economics research institutions for providing open access to conference papers
- All paper authors for their valuable research contributions