@@ -2705,3 +2705,70 @@ with open("cat.json", "w") as file:
27052705 contents = file .read()
27062706 unfrozen = jsonpickle.decode(contents)
27072707```
2708+
2709+ ## Web Scraping
2710+
2711+ Web scraping involves programmatically getting data from a web page.
2712+
2713+ There are three steps: download, extract data, PROFIT!
2714+
2715+ Why scrape?
2716+
2717+ - There's data on a site that you want to store or anaylze.
2718+ - You can't get it by other means (an API)
2719+ - You want to programmatically grab the data (instead of lots of manual copy/pasting)
2720+
2721+ Is scraping... ok?
2722+
2723+ - Some websites don't want people scraping them
2724+ - Best practice is to consult the ` robots.txt ` file
2725+ - If making many requests, time them out
2726+ - If you're too aggressive, your IP can be blocked
2727+
2728+ ### BeautifulSoup
2729+
2730+ BeautifulSoup lets you navigate through HTML with Python.
2731+
2732+ BeautifulSoup does not download HTML, you need to use the request module.
2733+
2734+ ` BeautifulSoup(html_string, "html.parser") ` - parses HTML
2735+
2736+ Once parsed, there are several ways to navigate:
2737+
2738+ - By tag name
2739+ - Using ` find ` - returns one matching tag
2740+ - Using ` find_all ` - returns a list of matching tags
2741+ - Using CSS selectors
2742+
2743+ ``` python
2744+ from bs4 import BeautifulSoup
2745+ html = """
2746+ <!DOCTYPE html>
2747+ <html lang="en">
2748+ <head>
2749+ <meta charset="UTF-8">
2750+ <title>First HTML Page</title>
2751+ </head>
2752+ <body>
2753+ <div id="first">
2754+ <h3 data-example="yes">hi</h3>
2755+ <p>more text.</p>
2756+ </div>
2757+ <ol>
2758+ <li class="special">This list item is special.</li>
2759+ <li class="special">This list item is also special.</li>
2760+ <li>This list item is not special.</li>
2761+ </ol>
2762+ <div data-example="yes">bye</div>
2763+ </body>
2764+ </html>
2765+ """
2766+
2767+ soup = BeautifulSoup(html, " html.parser" )
2768+ print (soup.body.div) # first div
2769+ a = soup.find(id = " first" )
2770+ b = soup.find_all(class_ = " special" )
2771+ c = soup.find_all(attrs = {" data-example" : " yes" })
2772+ d = soup.select(" [data-example]" )
2773+ print (d)
2774+ ```
0 commit comments