Web Scraping with Python: Houzz.com Scraper

Introduction

Welcome to the Web Scraping with Python: Houzz.com Scraper project. This repository contains a Python web scraping script that extracts data from business websites on www.houzz.com. The scraper is built on Scrapy and BeautifulSoup and is designed to collect information from business sites on Houzz.com, allowing you to store the data in a CSV file for further analysis or usage.

Output

Business Name: The name of the business.
Location: The location of the business.
Phone Number: The contact phone number of the business.
Website URL: The website URL of the business.
Email: If emails available on website

Installation

Follow these steps to get started with the Houzz.com Scraper:

Prerequisites

Python 3.x
Pip (Python Package Installer)

Instructions

Clone this repository to your local machine using Git:

git clone https://github.com/adil6572/houzz-scraper.git

Navigate to the project directory:
```
cd houzz-scraper
```
Install the required Python packages:
```
pip install scrapy beautifulsoup4
```

Usage

To use the Houzz.com Scraper, follow these steps:

To modify the start_urls and custom_settings in the Houzz.com scraper, follow these instructions:

Changing `start_urls`:

Open the houzz_scraper/spiders/houzz_spider.py file in your project directory.
Locate the start_urls variable, which is defined as a list of URLs. You can change the URL to the one you want to scrape.

Replace the existing URL with the new URL you want to scrape (The URL should be similar to Example URL). For example:

start_urls = ["https://www.houzz.com/professionals/interior-designer/carter-lake-ia-us-probr0-bo~t_11785~r_4850531"]

Changing `custom_settings`:

In the same houzz_scraper/spiders/houzz_spider.py file, find the custom_settings dictionary.

Within the custom_settings dictionary, you can customize various settings related to the scraper's behavior. To change the output file format and overwrite behavior, modify the values accordingly.

To change the output file format to JSON:

'FEEDS': {
    'output.json': {
        'format': 'json',
        'overwrite': True,  # Set to True to overwrite the file if it already exists
    },
}

To set the scraper to append data to the existing file instead of overwriting:

'FEEDS': {
    'output.csv': {
        'format': 'csv',
        'overwrite': False,  # Set to False to append data to the existing file
    },
}

Save the houzz_scraper/spiders/houzz_spider.py file with your changes.

Now, your scraper will start with the modified start_urls and follow the settings you've configured in the custom_settings dictionary.

Start the scraper using the following command:
```
scrapy crawl houzz_scraper
```
The scraper will begin extracting information from Houzz.com business websites and store it in a CSV file.

You can now use this data for your intended purposes, such as analysis, data processing, or any other creative project.

Contributing

If you'd like to contribute to this project, please follow these steps:

Fork the repository to your own GitHub account.
Clone the forked repository to your local machine.
Create a new branch with a descriptive name for your feature or bug fix.
Make your changes and commit them.
Push your branch to your GitHub repository.
Create a pull request to the main repository, explaining your changes and improvements.

We welcome your contributions and ideas to make this project even better!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Thank you for using the Houzz.com Scraper! Happy web scraping and data extraction! If you have any questions or need assistance, feel free to open an issue or contact the maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
houzz		houzz
Houzz.png		Houzz.png
LICENSE		LICENSE
README.md		README.md
output.csv		output.csv
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraping with Python: Houzz.com Scraper

Table of Contents

Introduction

Output

Installation

Prerequisites

Instructions

Usage

Changing `start_urls`:

Changing `custom_settings`:

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

adil6572/houzz-scraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraping with Python: Houzz.com Scraper

Table of Contents

Introduction

Output

Installation

Prerequisites

Instructions

Usage

Changing start_urls:

Changing custom_settings:

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Changing `start_urls`:

Changing `custom_settings`:

Packages