Skip to content

Commit 5d764af

Browse files
committed
style: use WE for the lesson, YOU for exercises (and a few consistency fixes)
1 parent 45da62e commit 5d764af

File tree

11 files changed

+101
-85
lines changed

11 files changed

+101
-85
lines changed

sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,31 @@
11
---
22
title: Inspecting web pages with browser DevTools
33
sidebar_label: "DevTools: Inspecting"
4-
description: Lesson about using the browser tools for developers to inspect and manipulate the structure of an e-commerce website.
4+
description: Lesson about using the browser tools for developers to inspect and manipulate the structure of a website.
55
slug: /scraping-basics-python/devtools-inspecting
66
---
77

88
import Exercises from './_exercises.mdx';
99

10-
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of an e-commerce website.**
10+
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.**
1111

1212
---
1313

1414
A browser is the most complete tool for navigating websites. Scrapers are like automated browsers—and sometimes, they actually are automated browsers. The key difference? There's no user to decide where to go or eyes to see what's displayed. Everything has to be pre-programmed.
1515

16-
All modern browsers provide developer tools, or DevTools, for website developers to debug their work. We'll use them to understand how websites are structured and identify the behavior our scraper needs to mimic. Here's the typical workflow for creating a scraper:
16+
All modern browsers provide developer tools, or _DevTools_, for website developers to debug their work. We'll use them to understand how websites are structured and identify the behavior our scraper needs to mimic. Here's the typical workflow for creating a scraper:
1717

1818
1. Inspect the target website in DevTools to understand its structure and determine how to extract the required data.
1919
1. Translate those findings into code.
2020
1. If the scraper fails due to overlooked edge cases or, over time, due to website changes, go back to step 1.
2121

22-
Now let's spend some time figuring out what the detective work from step 1 is about.
22+
Now let's spend some time figuring out what the detective work in step 1 is about.
2323

2424
## Opening DevTools
2525

26-
Google Chrome is currently the most popular browser, and many others use the same core. That's why we'll focus on [Chrome DevTools](https://developer.chrome.com/docs/devtools) here. However, the steps are similar in other browsers like Safari ([Web Inspector](https://developer.apple.com/documentation/safari-developer-tools/web-inspector)) or Firefox ([DevTools](https://firefox-source-docs.mozilla.org/devtools-user/)).
26+
Google Chrome is currently the most popular browser, and many others use the same core. That's why we'll focus on [Chrome DevTools](https://developer.chrome.com/docs/devtools) here. However, the steps are similar in other browsers, as Safari has its [Web Inspector](https://developer.apple.com/documentation/safari-developer-tools/web-inspector) and Firefox also has [DevTools](https://firefox-source-docs.mozilla.org/devtools-user/).
2727

28-
Let's peek behind the scenes of a real-world website—say, Wikipedia. Open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Press **F12**, or right-click anywhere on the page and select **Inspect**.
28+
Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**.
2929

3030
![Wikipedia with Chrome DevTools open](./images/devtools-wikipedia.png)
3131

@@ -35,11 +35,11 @@ Websites are built with three main technologies: HTML, CSS, and JavaScript. In t
3535

3636
:::warning Screen adaptations
3737

38-
On smaller or low-resolution screens, DevTools might look different. For example, the CSS styles section might appear below the HTML elements instead of in the right pane.
38+
DevTools may appear differently depending on your screen size. For instance, on smaller screens, the CSS panel might move below the HTML elements panel instead of appearing in the right pane.
3939

4040
:::
4141

42-
Think of [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML) as the frame that defines a page's structure. A basic HTML element includes an opening tag, a closing tag, and attributes. Here's an `article` element with an `id` attribute. It wraps `h1` and `p` elements, both containing text. Some text is emphasized using `em`.
42+
Think of [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML) elements as the frame that defines a page's structure. A basic HTML element includes an opening tag, a closing tag, and attributes. Here's an `article` element with an `id` attribute. It wraps `h1` and `p` elements, both containing text. Some text is emphasized using `em`.
4343

4444
```html
4545
<article id="article-123">
@@ -59,17 +59,17 @@ HTML, a markup language, describes how everything on a page is organized, how el
5959

6060
While HTML and CSS describe what the browser should display, [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/JavaScript) is a general-purpose programming language that adds interaction to the page.
6161

62-
In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press **ESC** to toggle the Console. Running commands in the Console lets you manipulate the loaded page—we’ll try this shortly.
62+
In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press **ESC** to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly.
6363

6464
![Console in Chrome DevTools](./images/devtools-console.png)
6565

6666
## Selecting an element
6767

68-
In the top-left corner of DevTools, find the icon with an arrow pointing to a square.
68+
In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square.
6969

7070
![Chrome DevTools element selection tool](./images/devtools-element-selection.png)
7171

72-
Click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As you move your cursor, DevTools will display information about the HTML element under it. Click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle.
72+
We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle.
7373

7474
![Chrome DevTools element hover](./images/devtools-hover.png)
7575

@@ -105,35 +105,35 @@ Encyclopedia
105105

106106
We won't be creating Python scrapers just yet. Let's first get familiar with what we can do in the JavaScript console and how we can further interact with HTML elements on the page.
107107

108-
In the **Elements** tab, with the subtitle element highlighted, right-click the element to open the context menu. There, choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready.
108+
In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready.
109109

110110
![Global variable in Chrome DevTools Console](./images/devtools-console-variable.png)
111111

112112
The Console allows us to run JavaScript in the context of the loaded page, similar to Python's [interactive REPL](https://realpython.com/interacting-with-python/). We can use it to play around with elements.
113113

114-
For a start, let's access some of the subtitle's properties. One such property is `textContent`, which contains the text inside the HTML element. The last line in the Console is where your cursor is. Type the following and hit **Enter**:
114+
For a start, let's access some of the subtitle's properties. One such property is `textContent`, which contains the text inside the HTML element. The last line in the Console is where your cursor is. We'll type the following and hit **Enter**:
115115

116116
```js
117117
temp1.textContent;
118118
```
119119

120-
The result should be `'The Free Encyclopedia'`. Now try this:
120+
The result should be `'The Free Encyclopedia'`. Now let's try this:
121121

122122
```js
123123
temp1.outerHTML;
124124
```
125125

126-
This should return the element's HTML tag as a string. Finally, run the next line to change the text of the element:
126+
This should return the element's HTML tag as a string. Finally, we'll run the next line to change the text of the element:
127127

128128
```js
129129
temp1.textContent = 'Hello World!';
130130
```
131131

132-
When you change elements in the Console, those changes reflect immediately on the page!
132+
When we change elements in the Console, those changes reflect immediately on the page!
133133

134134
![Changing textContent in Chrome DevTools Console](./images/devtools-console-textcontent.png)
135135

136-
But don't worry—you haven't hacked Wikipedia. The change only happens in your browser. If you reload the page, your change will disappear. This, however, is an easy way to craft a screenshot with fake content—so screenshots shouldn't be trusted as evidence.
136+
But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence.
137137

138138
We're not here for playing around with elements, though—we want to create a scraper for an e-commerce website to watch prices. In the next lesson, we'll examine the website and use CSS selectors to locate HTML elements containing the data we need.
139139

sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Instead of artificial scraping playgrounds or sandboxes, we'll scrape a real e-c
2121

2222
Live sites like Amazon are complex, loaded with promotions, frequently changing, and equipped with anti-scraping measures. While those challenges are manageable, they're advanced topics. For this beginner course, we're sticking to a lightweight, stable environment.
2323

24-
That said, we designed all the exercises to work with live websites. This means occasional updates might be needed, but we think it's worth it for a more authentic learning experience.
24+
That said, we designed all the additional exercises to work with live websites. This means occasional updates might be needed, but we think it's worth it for a more authentic learning experience.
2525

2626
:::
2727

@@ -31,13 +31,13 @@ As mentioned in the previous lesson, before building a scraper, we need to under
3131

3232
![Warehouse store with DevTools open](./images/devtools-warehouse.png)
3333

34-
The page displays a grid of product cards, each showing a product's title and picture. Open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. Highlight it in the **Elements** tab by clicking on it.
34+
The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it.
3535

3636
![Selecting an element with DevTools](./images/devtools-product-title.png)
3737

3838
Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.
3939

40-
In the **Elements** tab, move your cursor up from the `a` element containing the subwoofer's title. On the way, hover over each element until you highlight the entire product card. Alternatively, use the arrow-up key. The `div` element you land on is the **parent element**, and all nested elements are its **child elements**.
40+
In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**.
4141

4242
![Selecting an element with hover](./images/devtools-hover-product.png)
4343

@@ -55,9 +55,9 @@ The `class` attribute can hold multiple values separated by whitespace. This par
5555

5656
## Programmatically locating a product card
5757

58-
Let's jump into the **Console** and write some JavaScript. Don't worry—you don't need to know the language, and yes, this is a helpful step on our journey to creating a scraper in Python.
58+
Let's jump into the **Console** and write some JavaScript. Don't worry—we don't need to know the language, and yes, this is a helpful step on our journey to creating a scraper in Python.
5959

60-
In browsers, JavaScript represents the current page as the [`Document`](https://developer.mozilla.org/en-US/docs/Web/API/Document) object, accessible via `document`. This object offers many useful methods, including [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector). This method takes a CSS selector as a string and returns the first HTML element that matches. Try typing this into the **Console**:
60+
In browsers, JavaScript represents the current page as the [`Document`](https://developer.mozilla.org/en-US/docs/Web/API/Document) object, accessible via `document`. This object offers many useful methods, including [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector). This method takes a CSS selector as a string and returns the first HTML element that matches. We'll try typing this into the **Console**:
6161

6262
```js
6363
document.querySelector('.product-item');
@@ -109,33 +109,33 @@ How did we know `.product-item` selects a product card? By inspecting the markup
109109

110110
## Choosing good selectors
111111

112-
Multiple approaches often exist for creating a CSS selector that targets the element you want. Pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. Avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning.
112+
Multiple approaches often exist for creating a CSS selector that targets the element we want. We should pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. We better avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning.
113113

114114
The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card *is* a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules.
115115

116-
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, you can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
116+
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
117117

118118
![Overview of all the product cards in DevTools](./images/devtools-product-list.png)
119119

120120
## Locating all product cards
121121

122-
In the **Console**, hovering your cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list.
122+
In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list.
123123

124124
![Highlighting a querySelector() result](./images/devtools-hover-queryselector.png)
125125

126-
But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Type this into the **Console**:
126+
But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**:
127127

128128
```js
129129
document.querySelectorAll('.product-item');
130130
```
131131

132132
The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/Web/API/NodeList), a collection of nodes. Browsers understand an HTML document as a tree of nodes. Most nodes are HTML elements, but there are also text nodes for plain text, and others.
133133

134-
Expand the result by clicking the small arrow, then hover your cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer!
134+
We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer!
135135

136136
![Highlighting a querySelectorAll() result](./images/devtools-hover-queryselectorall.png)
137137

138-
To save the subwoofer in a variable for further inspection, use index access with brackets, just like in Python lists (or JavaScript arrays):
138+
To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like in Python lists (or JavaScript arrays):
139139

140140
```js
141141
products = document.querySelectorAll('.product-item');

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Using browser tools for developers is crucial for understanding the structure of
1515

1616
## Starting a Python project
1717

18-
Before we start coding, we need to set up a Python project. Create new directory with a virtual environment, then inside the directory and with the environment activated, install the HTTPX library:
18+
Before we start coding, we need to set up a Python project. Let's create new directory with a virtual environment. Inside the directory and with the environment activated, we'll install the HTTPX library:
1919

2020
```text
2121
$ pip install httpx
@@ -29,15 +29,15 @@ Being comfortable around Python project setup and installing packages is a prere
2929

3030
:::
3131

32-
Now let's test that all works. Inside the project directory create a new file called `main.py` with the following code:
32+
Now let's test that all works. Inside the project directory we'll create a new file called `main.py` with the following code:
3333

3434
```py
3535
import httpx
3636

3737
print("OK")
3838
```
3939

40-
Running it as a Python program will verify that your setup is okay and you've installed HTTPX:
40+
Running it as a Python program will verify that our setup is okay and we've installed HTTPX:
4141

4242
```text
4343
$ python main.py
@@ -62,7 +62,7 @@ response = httpx.get(url)
6262
print(response.text)
6363
```
6464

65-
If you run the program now, it should print the downloaded HTML:
65+
If we run the program now, it should print the downloaded HTML:
6666

6767
```text
6868
$ python main.py

0 commit comments

Comments
 (0)