This repository contains two Python scripts for web scraping, each designed to extract product information from a website. The scripts cater to different types of content rendering: static and dynamic JavaScript rendering. The scraped data, including product names, sale prices, and regular prices, is saved into a CSV file for further analysis or use.
- Overview
- Setup and Installation
- Usage
- Output
- Error Handling & Logging
- Contributing
- License
- Acknowledgments
This script uses BeautifulSoup to parse a locally stored HTML file (Accessories And Gadgets.html). It extracts the product names, sale prices, and regular prices from the HTML content and stores the data in a CSV file.
The dynamic scraper utilizes Selenium WebDriver to interact with a live website. It is designed to handle JavaScript-rendered content by waiting for elements to load and extracting relevant product information, which is then saved in a CSV file.
- Python 3.x
pip(Python package installer)
For the static web scraper:
pip install beautifulsoup4 pandasFor the dynamic web scraper:
pip install selenium pandasThe dynamic scraper requires a browser-specific WebDriver. For instance, if using Microsoft Edge, you need the Edge WebDriver. Ensure the WebDriver is available in your system's PATH.
Example for Edge:
pip install msedge-selenium-toolsEnsure the WebDriver executable is compatible with the installed browser version.
- Place the HTML file (
Accessories And Gadgets.html) in the same directory as the script. - Run the script to parse the HTML file and save the extracted data to a CSV file.
python static_scraper.py- Ensure the WebDriver is correctly set up and accessible.
- Update the URL in the script to the desired webpage.
- Run the script to navigate the webpage, extract product data, and save it to a CSV file.
python dynamic_scraper.pyBoth scripts will generate a CSV file (scraped_products.csv) containing the scraped product data with the following columns:
Product NameSale PriceRegular Price
- The script includes try-except blocks to handle cases where specific elements, such as sale prices or regular prices, may not be present.
- Errors during processing are logged to the console for troubleshooting.
- The script waits for specific elements to load using WebDriverWait, ensuring dynamic content is fully rendered before extraction.
- Errors encountered during scraping are caught and printed to the console, and the browser is closed properly in the
finallyblock.
Contributions are welcome! If you have suggestions for improvements or find bugs, please fork the repository and create a pull request with your changes. For major changes, please open an issue first to discuss what you would like to change.
This project is open-source and available under the MIT License.
- BeautifulSoup: For providing a powerful HTML/XML parsing library.
- Selenium: For enabling browser automation and web scraping of dynamically rendered content.
- Pandas: For easy data manipulation and storage in CSV format.
Feel free to customize this README file as needed to better suit your project's specific details or requirements.