A intelligent scraper that extracts company contact information using official Google APIs and advanced validation, with built-in Indian phone number verification and persistent history.
- ✅ Smart Deduplication - Auto-skips previously processed companies
- 📁 Multi-Format Reports - CSV & JSON outputs with timestamps
- 📞 Indian Number Validation - Strict TRAI-compliant phone verification
- 🔒 Secure Operations - Proxy support & SSL verification
- 📊 Priority Tagging - Customizable lead priority levels
- ⏳ Rate Limited - Google API-friendly pacing
- 🚫 No Duplicates - Persistent scrape history
# Clone repository
git clone https://github.com/yourusername/company-scraper.git
cd company-scraper
# Install dependencies
pip install -r requirements.txt
# Create environment file
cp .env.example .env
-
Get Google API Credentials:
- Create project at Google Cloud Console
- Enable "Custom Search JSON API"
- Create API key and Custom Search Engine (CX)
-
Edit
.env
:
GOOGLE_API_KEY="your_api_key_here"
GOOGLE_CX="your_search_engine_id"
# Basic usage
python scraper.py -i companies.txt
# Custom priority & notes
python scraper.py -i clients.txt --priority 75 --notes "Q4 Leads"
# Force re-scrape existing companies
python scraper.py -i list.txt --force
# Custom output directory
python scraper.py -i input.txt -o ./custom_reports
Options:
-i, --input Input file with company names (required)
-o, --output Output directory (default: reports)
-p, --priority Default priority percentage (60-100)
--force Force re-scrape of existing companies
--notes Additional notes for all entries
--remarks Custom remarks column content
Sample CSV Output:
Client Name,Position,Client Company,Contact Details,Email,Priority,Notes,Found in,Remarks
,,ORB Energy,"+91 80 4123 4567;080-41234567",contact@orb.com,60%,Q4 Leads,https://orb.com/contact,
File Organization:
reports/
├── 2023-10-05_14-30-22/
│ ├── contacts.csv
│ └── contacts.json
└── scraped_history.json
- ✔️ Complies with Google API Terms of Service
- ✔️ Respects website
robots.txt
directives - ✔️ Rate limited to 1 request/second
- ✔️ Data validation for accuracy
- ❌ Not for scraping protected/personal data
Q: SSL certificate errors?
A: Run pip install --upgrade certifi
and ensure system certificates are updated.
Q: No results found?
A: Check Google API quota and custom search engine configuration.
Q: How to reset history?
A: Delete reports/scraped_history.json
Q: Customize phone validation?
A: Modify validate_indian_phones()
in scraper.py
This project is licensed under the MIT License - see LICENSE file for details.
- Fork the repository
- Create feature branch (
git checkout -b feature/AmazingFeature
) - Commit changes (
git commit -m 'Add AmazingFeature'
) - Push to branch (
git push origin feature/AmazingFeature
) - Open Pull Request
Disclaimer: Use this tool responsibly. Always verify scraping legality for target websites and respect data privacy regulations.