A highly efficient and multithreaded web scraper to collect and process nurse data from a targeted nurse registry website.
- Introduction
- Features
- Installation
- Usage
- Project Structure
- Detailed Overview of uc_scraper.py
- Detailed Overview of cloud_scraper.py
- Detailed Overview of ptm_format.py
- Configuration
The project is designed to scrape data from a specific nurse registry website. Using advanced multithreading techniques, the project aims to efficiently scrape and store data in a CSV file, breaking down the task into smaller parallel tasks for optimized performance.
- Multithreaded Scraping: Leverages Python's
concurrent.futures.ThreadPoolExecutorfor efficient multithreading. - Data Storage: Stores scraped data into CSV files for easy access and analysis.
- Dynamic User-Agent: Uses temporary directories for unique web scraping sessions.
- Keyword Generation: Automatically generates keyword combinations to search through the nurse registry.
- Retry Mechanism: Implements retry logic in case of failures during data scraping.
Install required dependencies Make sure you have Python 3.7+ and pip installed.
pip install -r requirements.txt-
Run the Scraper Simply execute the main.py script to start the scraping process:
python main.py
-
Configuration You can adjust the number of threads and other settings within the main.py script to fit your requirements.
nurse-web-scraper/
│
├── uc_scraper.py
├── cloud_scraper.py
├── main.py
├── ptm_format.py
├── requirements.txt
└── README.mdmain.py: Main script that orchestrates the entire scraping process.
uc_scraper.py: Script for scraping using the undetected Chromedriver, focusing on extracting data from the registry tables.
cloud_scraper.py: Script for scraping using a cloud-based scraper.
ptm_format.py: Utility functions for name splitting and phone formatting.
requirements.txt: File containing all the required Python packages.
README.md: Documentation
The uc_scraper.py file contains the WebScraper class, which utilizes the undetected Chromedriver to scrape nurse data. Here’s a breakdown of its functionality:
__init__: Sets up the WebScraper instance with a WebDriver and WebDriverWait.initialize_driver: Configures and initializes the undetected Chromedriver with specific options.
scrape_table: Extracts data from the HTML table in the webpage.scrape_website: Navigates through the pages of the website to perform searches and collect data, implementing retry logic in case of failures.
close_driver: Closes the WebDriver instance.
The cloud_scraper.py file contains the Cloud_Scraper class, which uses cloudscraper to bypass anti-bot measures and scrape nurse data. Here's a breakdown of its functionality:
__init__: Sets up the Cloud_Scraper instance with a cloudscraper instance and URL.Dynamic User-Agent: Uses the cloudscraper library to create a scraper instance that dynamically changes the User-Agent to avoid detection.
get_last_seq_number: Reads the last non-empty line from a file to determine the current sequence number for appending data.- Uses nested helper functions read_last_non_empty_line and extract_seq_from_line for reading and processing the sequence number.
scrape_data: Scrapes data from the specified URL and processes the HTML content using BeautifulSoup to extract relevant data fields. The extracted data is then stored in a dictionary and written to a CSV file. The method handles a variety of data points including name, specialty certificate, status, and employment details:- Extracts table data.
- Scrapes name, specialty, and status information.
- Extracts employment details with address, phone number, and dates.
- Writes the data to
scrape_data.csv.
run: Initiates the scraping process, typically wrapped with a lock for thread-safety.close_scraper: Closes the cloudscraper instance.
The ptm_format.py file contains utility functions for processing names and phone numbers. Here's a breakdown of its functionality:
-
Function
splitfullname(fullname):- Splits the provided full name into first name, last name, and suffix.
- Handles various prefixes (e.g., "Dr.", "Mr.") and suffixes (e.g., "Jr", "III").
- Accounts for multiple word last names (e.g., "Van", "De La").
-
Function
fit_phone(Phone):- Normalizes and formats phone numbers to a consistent digit-only format.
- Removes extraneous characters and normalizes additional occurrences of country code prefixes (e.g., "+1", "1").
per_thread: Number of pages a single thread will process.max_workers: Maximum number of worker threads for website scraping.max_scrapeworkers: Maximum number of worker threads for data scraping.
Logs are generated to provide insight into the scraping process, errors, and overall progress.
Scraped data is saved into a file named scrape_data.csv. Make sure to delete or rename this file if you want to start a fresh scrape to avoid appending existing data.