An intelligent Facebook scraper that monitors artist pages for event announcements and extracts detailed information from event flyers using OCR technology.
- π Smart Event Detection: Automatically identifies event-related posts on Facebook pages
- πΈ OCR Flyer Analysis: Extracts dates, times, venues, and details from event flyers
- π― Artist-Specific Monitoring: Configurable artist pages with category assignments
- π N8N Integration: Sends structured event data to N8N workflows
- β° Automated Scheduling: Runs on configurable intervals
- π³ Docker Ready: Containerized for easy deployment
git clone https://github.com/yourusername/facebook-scraper.git
cd facebook-scrapercp .env.example .env
nano .env # Edit with your settingschmod +x deploy.sh
./deploy.shdocker exec -it facebook-event-scraper python scraper.pyCopy .env.example to .env and configure:
# N8N Webhook URL - where to send event data
N8N_WEBHOOK_URL=https://your-n8n-instance.com/webhook/facebook-events
# Artist Configuration (JSON format)
FACEBOOK_ARTISTS_CONFIG={
"facebook.com/artist-page": {
"name": "Artist Name",
"category_id": 4
}
}# Scraping interval (seconds, default: 21600 = 6 hours)
SCRAPE_INTERVAL=21600
# Venue list for detection in flyers
FACEBOOK_VENUES_CONFIG=["venue1", "venue2", "venue3"]Two methods to configure artists:
FACEBOOK_ARTISTS_CONFIG={
"facebook.com/mike.broussard.627050": {
"name": "Mike Broussard",
"category_id": 4
},
"facebook.com/dustinsonniermusic": {
"name": "Dustin Sonnier",
"category_id": 6
}
}FACEBOOK_ARTIST_COUNT=2
FACEBOOK_ARTIST_1_URL=facebook.com/mike.broussard.627050
FACEBOOK_ARTIST_1_NAME=Mike Broussard
FACEBOOK_ARTIST_1_CATEGORY_ID=4
FACEBOOK_ARTIST_2_URL=facebook.com/dustinsonniermusic
FACEBOOK_ARTIST_2_NAME=Dustin Sonnier
FACEBOOK_ARTIST_2_CATEGORY_ID=6- Facebook Scraper: Uses Playwright to navigate Facebook pages
- OCR Engine: Tesseract extracts text from event flyers
- Event Parser: Intelligent parsing of dates, times, and venues
- N8N Integration: Sends structured data to workflow automation
- Docker Container: Isolated, reproducible environment
Facebook Pages β Playwright Scraper β OCR Analysis β Event Parser β N8N Webhook β WordPress
The scraper sends this JSON structure to your N8N webhook:
{
"artist": "Artist Name",
"category_id": 4,
"title": "Artist Name at Venue Name",
"post_text": "Original Facebook post text",
"flyer_text": "OCR extracted text from flyer",
"extracted_date": "Saturday, July 8th",
"extracted_time": "8:00 PM",
"extracted_venue": "Blue Moon Saloon",
"cover_charge": "$10",
"description": "Combined description with all details"
}docker exec -it facebook-event-scraper python scraper.pydocker-compose logs -f facebook-event-scraperdocker-compose restart facebook-event-scrapernano .env
docker-compose restart facebook-event-scraperdocker-compose psdocker stats facebook-event-scraperdocker exec -it facebook-event-scraper tesseract --version# Install Python dependencies
pip install -r requirements.txt
# Install Playwright browsers
playwright install chromium
# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng# Set environment variables
export N8N_WEBHOOK_URL="http://localhost:5678/webhook/facebook-events"
export FACEBOOK_ARTISTS_CONFIG='{"facebook.com/test": {"name": "Test", "category_id": 1}}'
# Run scraper
python scraper.py- Create webhook node in N8N
- Set URL in
N8N_WEBHOOK_URLenvironment variable - Process incoming event data
- Create WordPress events with extracted information
Webhook β Process Data β Create WordPress Event β Send Notification
Edit .env file:
FACEBOOK_ARTISTS_CONFIG={
"existing-artists": "...",
"facebook.com/new-artist-page": {
"name": "New Artist",
"category_id": 7
}
}Restart container:
docker-compose restart facebook-event-scraperUpdate venue list in .env:
FACEBOOK_VENUES_CONFIG=["existing venues", "new venue name"]Change interval in .env:
SCRAPE_INTERVAL=7200 # 2 hours- No sensitive data in repository: All configuration via environment variables
- Private .env file: Never committed to Git
- Minimal permissions: Container runs with non-root user
- Rate limiting: Built-in delays to respect Facebook's servers
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Container won't start
- Check
.envfile exists and is configured - Verify Docker and docker-compose are installed
No events detected
- Check artist Facebook URLs are accessible
- Verify N8N webhook URL is reachable
- Check logs:
docker-compose logs facebook-event-scraper - Test manually:
docker exec -it facebook-event-scraper python scraper.py
OCR not working
- Verify Tesseract is installed:
docker exec -it facebook-event-scraper tesseract --version - Check image URLs in logs
- Ensure images contain readable text
Facebook blocking requests
- Scraper uses realistic delays to avoid blocking
- If blocked, wait and try again later
- Consider adjusting scraping frequency
# Check container status
docker ps | grep facebook-event-scraper
# View recent logs
docker-compose logs --tail=50 facebook-event-scraper
# Interactive container access
docker exec -it facebook-event-scraper /bin/bash
# Test OCR functionality
docker exec -it facebook-event-scraper python -c "import pytesseract; print('OCR working')"
# Test webhook connectivity
curl -X POST $N8N_WEBHOOK_URL -H "Content-Type: application/json" -d '{"test": "data"}'- Adjust scraping interval based on artist posting frequency
- Monitor resource usage with
docker stats - Limit concurrent processing for stability
- Use SSD storage for better Docker performance
git pull origin main
./deploy.sh# Backup your .env file
cp .env .env.backup.$(date +%Y%m%d)
# Store in secure location outside repositoryModify event_keywords in scraper.py to detect different types of posts:
self.event_keywords = [
'tonight', 'show', 'live', 'performance', 'gig', 'concert',
'playing', 'music', 'venue', 'bar', 'club', 'festival',
# Add custom keywords here
'acoustic', 'unplugged', 'showcase'
]Add local venues to your configuration:
FACEBOOK_VENUES_CONFIG=[
"your local venue",
"another music spot",
"community center"
]Modify scraper to send to different endpoints based on artist or event type.
- Music Venue Websites: Automatically populate event calendars
- Artist Management: Track all artists' events in one place
- Music Blogs: Generate content about upcoming shows
- Fan Notifications: Alert fans about new events
- Event Aggregation: Combine multiple sources into unified calendar
- Issues: Create GitHub issue with logs and configuration details
- Feature Requests: Open GitHub discussion
- Security Issues: Email privately (don't create public issues)
- Playwright Team: For excellent browser automation
- Tesseract OCR: For powerful text recognition
- Python Community: For amazing libraries and tools
- Acadiana Music Scene: For the inspiration
Made with β€οΈ for the music community