Skip to content

Commit cd3008a

Browse files
committed
Add notes on web scraping
1 parent 5a65a71 commit cd3008a

File tree

2 files changed

+93
-0
lines changed

2 files changed

+93
-0
lines changed

README.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2705,3 +2705,70 @@ with open("cat.json", "w") as file:
27052705
contents = file.read()
27062706
unfrozen = jsonpickle.decode(contents)
27072707
```
2708+
2709+
## Web Scraping
2710+
2711+
Web scraping involves programmatically getting data from a web page.
2712+
2713+
There are three steps: download, extract data, PROFIT!
2714+
2715+
Why scrape?
2716+
2717+
- There's data on a site that you want to store or anaylze.
2718+
- You can't get it by other means (an API)
2719+
- You want to programmatically grab the data (instead of lots of manual copy/pasting)
2720+
2721+
Is scraping... ok?
2722+
2723+
- Some websites don't want people scraping them
2724+
- Best practice is to consult the `robots.txt` file
2725+
- If making many requests, time them out
2726+
- If you're too aggressive, your IP can be blocked
2727+
2728+
### BeautifulSoup
2729+
2730+
BeautifulSoup lets you navigate through HTML with Python.
2731+
2732+
BeautifulSoup does not download HTML, you need to use the request module.
2733+
2734+
`BeautifulSoup(html_string, "html.parser")` - parses HTML
2735+
2736+
Once parsed, there are several ways to navigate:
2737+
2738+
- By tag name
2739+
- Using `find` - returns one matching tag
2740+
- Using `find_all` - returns a list of matching tags
2741+
- Using CSS selectors
2742+
2743+
```python
2744+
from bs4 import BeautifulSoup
2745+
html = """
2746+
<!DOCTYPE html>
2747+
<html lang="en">
2748+
<head>
2749+
<meta charset="UTF-8">
2750+
<title>First HTML Page</title>
2751+
</head>
2752+
<body>
2753+
<div id="first">
2754+
<h3 data-example="yes">hi</h3>
2755+
<p>more text.</p>
2756+
</div>
2757+
<ol>
2758+
<li class="special">This list item is special.</li>
2759+
<li class="special">This list item is also special.</li>
2760+
<li>This list item is not special.</li>
2761+
</ol>
2762+
<div data-example="yes">bye</div>
2763+
</body>
2764+
</html>
2765+
"""
2766+
2767+
soup = BeautifulSoup(html, "html.parser")
2768+
print(soup.body.div) # first div
2769+
a = soup.find(id="first")
2770+
b = soup.find_all(class_="special")
2771+
c = soup.find_all(attrs={"data-example": "yes"})
2772+
d = soup.select("[data-example]")
2773+
print(d)
2774+
```

web_scraping/bs_basics.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
from bs4 import BeautifulSoup
2+
html = """
3+
<!DOCTYPE html>
4+
<html lang="en">
5+
<head>
6+
<meta charset="UTF-8">
7+
<title>First HTML Page</title>
8+
</head>
9+
<body>
10+
<div id="first">
11+
<h3 data-example="yes">hi</h3>
12+
<p>more text.</p>
13+
</div>
14+
<ol>
15+
<li class="special">This list item is special.</li>
16+
<li class="special">This list item is also special.</li>
17+
<li>This list item is not special.</li>
18+
</ol>
19+
<div data-example="yes">bye</div>
20+
</body>
21+
</html>
22+
"""
23+
24+
soup = BeautifulSoup(html, "html.parser")
25+
d = soup.select("[data-example]")
26+
print(d)

0 commit comments

Comments
 (0)