Skip to content

Commit 561f282

Browse files
committed
Add example for web scraping book titles and prices with CSV output
1 parent eadedf0 commit 561f282

File tree

2 files changed

+173
-0
lines changed

2 files changed

+173
-0
lines changed

23 Day abstract/Example.ipynb

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": 1,
6+
"id": "669f4d79",
7+
"metadata": {},
8+
"outputs": [
9+
{
10+
"name": "stdout",
11+
"output_type": "stream",
12+
"text": [
13+
"Washing apples...\n",
14+
"Cutting apples...\n",
15+
"Squeezing apples...\n",
16+
"Here is your apple juice!\n"
17+
]
18+
}
19+
],
20+
"source": [
21+
"class JuiceMachine:\n",
22+
" def make_apple_juice(self):\n",
23+
" # hidden steps inside\n",
24+
" self.__wash_apples()\n",
25+
" self.__cut_apples()\n",
26+
" self.__squeeze_apples()\n",
27+
" print(\"Here is your apple juice!\")\n",
28+
"\n",
29+
" def __wash_apples(self):\n",
30+
" print(\"Washing apples...\")\n",
31+
"\n",
32+
" def __cut_apples(self):\n",
33+
" print(\"Cutting apples...\")\n",
34+
"\n",
35+
" def __squeeze_apples(self):\n",
36+
" print(\"Squeezing apples...\")\n",
37+
"\n",
38+
"# User just presses button\n",
39+
"machine = JuiceMachine()\n",
40+
"machine.make_apple_juice()"
41+
]
42+
}
43+
],
44+
"metadata": {
45+
"kernelspec": {
46+
"display_name": "Python 3",
47+
"language": "python",
48+
"name": "python3"
49+
},
50+
"language_info": {
51+
"codemirror_mode": {
52+
"name": "ipython",
53+
"version": 3
54+
},
55+
"file_extension": ".py",
56+
"mimetype": "text/x-python",
57+
"name": "python",
58+
"nbconvert_exporter": "python",
59+
"pygments_lexer": "ipython3",
60+
"version": "3.13.5"
61+
}
62+
},
63+
"nbformat": 4,
64+
"nbformat_minor": 5
65+
}

58 Day Web Scraping/Example01.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
#### This example will scrape book titles and prices from the sample website `http://books.toscrape.com/`, use regex to extract prices, and save the data to a CSV file. It’s beginner-friendly and includes clear explanations.
2+
3+
```x-python
4+
5+
```python
6+
import requests
7+
from bs4 import BeautifulSoup
8+
import re
9+
import csv
10+
11+
# Step 1: Fetch the web page
12+
url = "http://books.toscrape.com/"
13+
response = requests.get(url)
14+
15+
# Check if the request was successful
16+
if response.status_code != 200:
17+
print("Failed to fetch page")
18+
exit()
19+
20+
# Step 2: Parse the HTML
21+
soup = BeautifulSoup(response.text, "html.parser")
22+
23+
# Step 3: Extract book titles and prices
24+
books = soup.find_all("article", class_="product_pod")
25+
data = []
26+
27+
for book in books:
28+
# Get book title
29+
title = book.find("h3").find("a")["title"]
30+
31+
# Get price and use regex to extract the numerical part (e.g., 51.77 from £51.77)
32+
price_text = book.find("p", class_="price_color").text
33+
price_match = re.search(r"£(\d+\.\d{2})", price_text)
34+
price = price_match.group(1) if price_match else "N/A"
35+
36+
data.append([title, price])
37+
38+
# Step 4: Save to CSV
39+
with open("scraped_books.csv", "w", newline="", encoding="utf-8") as file:
40+
writer = csv.writer(file)
41+
writer.writerow(["Title", "Price (£)"]) # Header
42+
writer.writerows(data) # Data
43+
44+
print("Data saved to scraped_books.csv")
45+
46+
# Step 5: Print a few results
47+
print("\nSample Results:")
48+
for title, price in data[:3]: # Show first 3 books
49+
print(f"Book: {title}, Price: £{price}")
50+
```
51+
52+
```
53+
54+
### Explanation
55+
1. **Fetching the Page**:
56+
- Uses `requests.get()` to download the HTML from `http://books.toscrape.com/`.
57+
- Checks the response status to ensure the page was fetched successfully.
58+
59+
2. **Parsing with BeautifulSoup**:
60+
- `BeautifulSoup` parses the HTML to make it easy to navigate.
61+
- Finds all `<article>` tags with class `product_pod`, which contain book information.
62+
63+
3. **Extracting Data**:
64+
- **Title**: Extracts the book title from the `<a>` tag’s `title` attribute within each `<h3>`.
65+
- **Price**: Finds the price in the `<p>` tag with class `price_color`. Uses regex pattern `r"£(\d+\.\d{2})"` to extract the numerical part (e.g., `51.77` from `£51.77`).
66+
- `£`: Matches the pound symbol.
67+
- `(\d+\.\d{2})`: Captures digits followed by a decimal point and exactly two digits.
68+
- `group(1)`: Retrieves the captured numerical part.
69+
70+
4. **Saving to CSV**:
71+
- Writes the titles and prices to a CSV file named `scraped_books.csv` with headers.
72+
- Uses `utf-8` encoding to handle special characters.
73+
74+
5. **Output**:
75+
- Prints a confirmation message and displays the first three scraped books for verification.
76+
77+
### Expected Output
78+
The script creates a file `scraped_books.csv` with content like:
79+
```csv
80+
Title,Price (£)
81+
A Light in the Attic,51.77
82+
Tipping the Velvet,53.74
83+
Soumission,50.10
84+
...
85+
```
86+
87+
Console output (sample):
88+
```
89+
Data saved to scraped_books.csv
90+
91+
Sample Results:
92+
Book: A Light in the Attic, Price: £51.77
93+
Book: Tipping the Velvet, Price: £53.74
94+
Book: Soumission, Price: £50.10
95+
```
96+
97+
### Prerequisites
98+
Install the required libraries:
99+
```bash
100+
pip install requests beautifulsoup4
101+
```
102+
103+
### Notes
104+
- **Website**: The example uses `http://books.toscrape.com/`, a site designed for scraping practice.
105+
- **Ethics**: Always check a website’s `robots.txt` and terms of service before scraping.
106+
- **Customization**: You can modify the regex pattern (e.g., `r"\$(\d+\.\d{2})"` for USD) or extract other data like book ratings.
107+
- **Chart Option**: If you’d like a chart (e.g., a histogram of prices), let me know, and I can generate one using the scraped data!
108+

0 commit comments

Comments
 (0)