A Python-based Tumblr scraper that collects posts containing specified keywords from Tumblr blogs using the official Tumblr API. Supports saving results to CSV or JSON formats with configurable filtering options.
- Scrape posts by tag or from specific blogs
- Keyword filtering with case sensitivity options
- Multiple output formats (CSV, JSON)
- API rate limiting and error handling
- Progress tracking and comprehensive logging
- Configurable via command-line arguments
git clone https://github.com/enigmatronix13/Tumblr-Keyword-Scraper.gitpip install -r requirements.txt- Register at Tumblr OAuth Apps
- Edit
config_template.py:
TUMBLR_CONSUMER_KEY = "your_consumer_key"
TUMBLR_CONSUMER_SECRET = "your_consumer_secret"
TUMBLR_OAUTH_TOKEN = "your_oauth_token"
TUMBLR_OAUTH_TOKEN_SECRET = "your_oauth_token_secret"python tumblr_scraper.py| Option | Description | Example |
|---|---|---|
--tag TAG |
Scrape posts with specific tag | --tag photography |
--blog BLOG |
Target specific blog | --blog myblog.tumblr.com |
--keywords KW1 KW2 |
Filter by keywords | --keywords cat dog |
--limit N |
Maximum posts to fetch (default: 50) | --limit 100 |
--output FILE |
Output filename | --output results.csv |
--format FORMAT |
Output format: csv or json (default: csv) | --format json |
--case-sensitive |
Enable case-sensitive matching | |
--exact-match |
Require exact keyword matches | |
--min-notes N |
Minimum engagement threshold | --min-notes 10 |
--exclude-reblogs |
Skip reblogged content |
Scrape posts by tag:
python tumblr_scraper.py --tag "digital art" --limit 100 --format jsonScrape specific blog with keywords:
python tumblr_scraper.py --blog coolblog.tumblr.com --keywords "travel" "photography" --limit 50Advanced filtering:
python tumblr_scraper.py --tag music --exact-match --min-notes 25 --exclude-reblogs --limit 200Use default configuration:
python tumblr_scraper.pyEdit these constants in the script for default behavior:
KEYWORDS = [
"keyword1", "keyword2", "keyword3"
]BLOGS = {
"blog1.tumblr.com": KEYWORDS,
"blog2.tumblr.com": KEYWORDS,
}
- Files saved in
output/directory - Auto-generated filenames:
tumblr_posts_YYYYMMDD_HHMMSS.csv - Tag-based:
tag_photography_20240625.json - Blog-based:
blog_name_posts.csv
Each post includes:
id,blog_name,post_url,type,timestamp,datetitle,body,caption,summary,tagsnote_count,matched_keywords,scraped_atis_reblog,reblog_key
Authentication Issues
- Verify API credentials
- Check if OAuth tokens are valid
- Ensure app registration is active
Rate Limiting
- Script automatically handles API limits
- Increase delay with custom rate limiting if needed
- Default delay: 1.2 seconds between requests
Empty Results
- Verify blog names include
.tumblr.com - Check keyword specificity
- Try broader search terms
Network Errors
- Check internet connection
- Verify firewall settings
- Review
tumblr_scraper.logfor details
- Respect Tumblr's rate limits (300 requests per hour for most endpoints)
- Script includes automatic rate limiting
- Monitor API usage to avoid quota exhaustion
- Consider running scrapes during off-peak hours
- Use specific keywords to reduce noise
- Batch processing with reasonable limits (50-200 posts)
- Regular backups of important datasets
- Follow Tumblr's Terms of Service
- Respect content creators' rights
tumblr_scraper.log- Activity and error logoutput/*.csvoroutput/*.json- Scraped dataconfig_template.py- API configuration (create if missing)
- Python 3.7+
- Valid Tumblr API credentials
- Internet connection
- Required packages: pytumblr, requests