|
| 1 | +#### This example will scrape book titles and prices from the sample website `http://books.toscrape.com/`, use regex to extract prices, and save the data to a CSV file. It’s beginner-friendly and includes clear explanations. |
| 2 | + |
| 3 | +```x-python |
| 4 | +
|
| 5 | +```python |
| 6 | +import requests |
| 7 | +from bs4 import BeautifulSoup |
| 8 | +import re |
| 9 | +import csv |
| 10 | +
|
| 11 | +# Step 1: Fetch the web page |
| 12 | +url = "http://books.toscrape.com/" |
| 13 | +response = requests.get(url) |
| 14 | +
|
| 15 | +# Check if the request was successful |
| 16 | +if response.status_code != 200: |
| 17 | + print("Failed to fetch page") |
| 18 | + exit() |
| 19 | +
|
| 20 | +# Step 2: Parse the HTML |
| 21 | +soup = BeautifulSoup(response.text, "html.parser") |
| 22 | +
|
| 23 | +# Step 3: Extract book titles and prices |
| 24 | +books = soup.find_all("article", class_="product_pod") |
| 25 | +data = [] |
| 26 | +
|
| 27 | +for book in books: |
| 28 | + # Get book title |
| 29 | + title = book.find("h3").find("a")["title"] |
| 30 | + |
| 31 | + # Get price and use regex to extract the numerical part (e.g., 51.77 from £51.77) |
| 32 | + price_text = book.find("p", class_="price_color").text |
| 33 | + price_match = re.search(r"£(\d+\.\d{2})", price_text) |
| 34 | + price = price_match.group(1) if price_match else "N/A" |
| 35 | + |
| 36 | + data.append([title, price]) |
| 37 | +
|
| 38 | +# Step 4: Save to CSV |
| 39 | +with open("scraped_books.csv", "w", newline="", encoding="utf-8") as file: |
| 40 | + writer = csv.writer(file) |
| 41 | + writer.writerow(["Title", "Price (£)"]) # Header |
| 42 | + writer.writerows(data) # Data |
| 43 | +
|
| 44 | +print("Data saved to scraped_books.csv") |
| 45 | +
|
| 46 | +# Step 5: Print a few results |
| 47 | +print("\nSample Results:") |
| 48 | +for title, price in data[:3]: # Show first 3 books |
| 49 | + print(f"Book: {title}, Price: £{price}") |
| 50 | +``` |
| 51 | + |
| 52 | +``` |
| 53 | +
|
| 54 | +### Explanation |
| 55 | +1. **Fetching the Page**: |
| 56 | + - Uses `requests.get()` to download the HTML from `http://books.toscrape.com/`. |
| 57 | + - Checks the response status to ensure the page was fetched successfully. |
| 58 | +
|
| 59 | +2. **Parsing with BeautifulSoup**: |
| 60 | + - `BeautifulSoup` parses the HTML to make it easy to navigate. |
| 61 | + - Finds all `<article>` tags with class `product_pod`, which contain book information. |
| 62 | +
|
| 63 | +3. **Extracting Data**: |
| 64 | + - **Title**: Extracts the book title from the `<a>` tag’s `title` attribute within each `<h3>`. |
| 65 | + - **Price**: Finds the price in the `<p>` tag with class `price_color`. Uses regex pattern `r"£(\d+\.\d{2})"` to extract the numerical part (e.g., `51.77` from `£51.77`). |
| 66 | + - `£`: Matches the pound symbol. |
| 67 | + - `(\d+\.\d{2})`: Captures digits followed by a decimal point and exactly two digits. |
| 68 | + - `group(1)`: Retrieves the captured numerical part. |
| 69 | +
|
| 70 | +4. **Saving to CSV**: |
| 71 | + - Writes the titles and prices to a CSV file named `scraped_books.csv` with headers. |
| 72 | + - Uses `utf-8` encoding to handle special characters. |
| 73 | +
|
| 74 | +5. **Output**: |
| 75 | + - Prints a confirmation message and displays the first three scraped books for verification. |
| 76 | +
|
| 77 | +### Expected Output |
| 78 | +The script creates a file `scraped_books.csv` with content like: |
| 79 | +```csv |
| 80 | +Title,Price (£) |
| 81 | +A Light in the Attic,51.77 |
| 82 | +Tipping the Velvet,53.74 |
| 83 | +Soumission,50.10 |
| 84 | +... |
| 85 | +``` |
| 86 | + |
| 87 | +Console output (sample): |
| 88 | +``` |
| 89 | +Data saved to scraped_books.csv |
| 90 | +
|
| 91 | +Sample Results: |
| 92 | +Book: A Light in the Attic, Price: £51.77 |
| 93 | +Book: Tipping the Velvet, Price: £53.74 |
| 94 | +Book: Soumission, Price: £50.10 |
| 95 | +``` |
| 96 | + |
| 97 | +### Prerequisites |
| 98 | +Install the required libraries: |
| 99 | +```bash |
| 100 | +pip install requests beautifulsoup4 |
| 101 | +``` |
| 102 | + |
| 103 | +### Notes |
| 104 | +- **Website**: The example uses `http://books.toscrape.com/`, a site designed for scraping practice. |
| 105 | +- **Ethics**: Always check a website’s `robots.txt` and terms of service before scraping. |
| 106 | +- **Customization**: You can modify the regex pattern (e.g., `r"\$(\d+\.\d{2})"` for USD) or extract other data like book ratings. |
| 107 | +- **Chart Option**: If you’d like a chart (e.g., a histogram of prices), let me know, and I can generate one using the scraped data! |
| 108 | + |
0 commit comments