You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md
+17-17Lines changed: 17 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -1,31 +1,31 @@
1
1
---
2
2
title: Inspecting web pages with browser DevTools
3
3
sidebar_label: "DevTools: Inspecting"
4
-
description: Lesson about using the browser tools for developers to inspect and manipulate the structure of an e-commerce website.
4
+
description: Lesson about using the browser tools for developers to inspect and manipulate the structure of a website.
5
5
slug: /scraping-basics-python/devtools-inspecting
6
6
---
7
7
8
8
import Exercises from './_exercises.mdx';
9
9
10
-
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of an e-commerce website.**
10
+
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.**
11
11
12
12
---
13
13
14
14
A browser is the most complete tool for navigating websites. Scrapers are like automated browsers—and sometimes, they actually are automated browsers. The key difference? There's no user to decide where to go or eyes to see what's displayed. Everything has to be pre-programmed.
15
15
16
-
All modern browsers provide developer tools, or DevTools, for website developers to debug their work. We'll use them to understand how websites are structured and identify the behavior our scraper needs to mimic. Here's the typical workflow for creating a scraper:
16
+
All modern browsers provide developer tools, or _DevTools_, for website developers to debug their work. We'll use them to understand how websites are structured and identify the behavior our scraper needs to mimic. Here's the typical workflow for creating a scraper:
17
17
18
18
1. Inspect the target website in DevTools to understand its structure and determine how to extract the required data.
19
19
1. Translate those findings into code.
20
20
1. If the scraper fails due to overlooked edge cases or, over time, due to website changes, go back to step 1.
21
21
22
-
Now let's spend some time figuring out what the detective work from step 1 is about.
22
+
Now let's spend some time figuring out what the detective work in step 1 is about.
23
23
24
24
## Opening DevTools
25
25
26
-
Google Chrome is currently the most popular browser, and many others use the same core. That's why we'll focus on [Chrome DevTools](https://developer.chrome.com/docs/devtools) here. However, the steps are similar in other browsers like Safari ([Web Inspector](https://developer.apple.com/documentation/safari-developer-tools/web-inspector)) or Firefox ([DevTools](https://firefox-source-docs.mozilla.org/devtools-user/)).
26
+
Google Chrome is currently the most popular browser, and many others use the same core. That's why we'll focus on [Chrome DevTools](https://developer.chrome.com/docs/devtools) here. However, the steps are similar in other browsers, as Safari has its [Web Inspector](https://developer.apple.com/documentation/safari-developer-tools/web-inspector) and Firefox also has [DevTools](https://firefox-source-docs.mozilla.org/devtools-user/).
27
27
28
-
Let's peek behind the scenes of a real-world website—say, Wikipedia. Open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Press**F12**, or right-click anywhere on the page and select **Inspect**.
28
+
Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press**F12**, or right-click anywhere on the page and select **Inspect**.
29
29
30
30

31
31
@@ -35,11 +35,11 @@ Websites are built with three main technologies: HTML, CSS, and JavaScript. In t
35
35
36
36
:::warning Screen adaptations
37
37
38
-
On smaller or low-resolution screens, DevTools might look different. For example, the CSS styles section might appear below the HTML elements instead of in the right pane.
38
+
DevTools may appear differently depending on your screen size. For instance, on smaller screens, the CSS panel might move below the HTML elements panel instead of appearing in the right pane.
39
39
40
40
:::
41
41
42
-
Think of [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML) as the frame that defines a page's structure. A basic HTML element includes an opening tag, a closing tag, and attributes. Here's an `article` element with an `id` attribute. It wraps `h1` and `p` elements, both containing text. Some text is emphasized using `em`.
42
+
Think of [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML)elements as the frame that defines a page's structure. A basic HTML element includes an opening tag, a closing tag, and attributes. Here's an `article` element with an `id` attribute. It wraps `h1` and `p` elements, both containing text. Some text is emphasized using `em`.
43
43
44
44
```html
45
45
<articleid="article-123">
@@ -59,17 +59,17 @@ HTML, a markup language, describes how everything on a page is organized, how el
59
59
60
60
While HTML and CSS describe what the browser should display, [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/JavaScript) is a general-purpose programming language that adds interaction to the page.
61
61
62
-
In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press **ESC** to toggle the Console. Running commands in the Console lets you manipulate the loaded page—we’ll try this shortly.
62
+
In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press **ESC** to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly.
63
63
64
64

65
65
66
66
## Selecting an element
67
67
68
-
In the top-left corner of DevTools, find the icon with an arrow pointing to a square.
68
+
In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square.
69
69
70
70

71
71
72
-
Click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As you move your cursor, DevTools will display information about the HTML element under it. Click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle.
72
+
We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle.
73
73
74
74

75
75
@@ -105,35 +105,35 @@ Encyclopedia
105
105
106
106
We won't be creating Python scrapers just yet. Let's first get familiar with what we can do in the JavaScript console and how we can further interact with HTML elements on the page.
107
107
108
-
In the **Elements** tab, with the subtitle element highlighted, right-click the element to open the context menu. There, choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready.
108
+
In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready.
109
109
110
110

111
111
112
112
The Console allows us to run JavaScript in the context of the loaded page, similar to Python's [interactive REPL](https://realpython.com/interacting-with-python/). We can use it to play around with elements.
113
113
114
-
For a start, let's access some of the subtitle's properties. One such property is `textContent`, which contains the text inside the HTML element. The last line in the Console is where your cursor is. Type the following and hit **Enter**:
114
+
For a start, let's access some of the subtitle's properties. One such property is `textContent`, which contains the text inside the HTML element. The last line in the Console is where your cursor is. We'll type the following and hit **Enter**:
115
115
116
116
```js
117
117
temp1.textContent;
118
118
```
119
119
120
-
The result should be `'The Free Encyclopedia'`. Now try this:
120
+
The result should be `'The Free Encyclopedia'`. Now let's try this:
121
121
122
122
```js
123
123
temp1.outerHTML;
124
124
```
125
125
126
-
This should return the element's HTML tag as a string. Finally, run the next line to change the text of the element:
126
+
This should return the element's HTML tag as a string. Finally, we'll run the next line to change the text of the element:
127
127
128
128
```js
129
129
temp1.textContent='Hello World!';
130
130
```
131
131
132
-
When you change elements in the Console, those changes reflect immediately on the page!
132
+
When we change elements in the Console, those changes reflect immediately on the page!
133
133
134
134

135
135
136
-
But don't worry—you haven't hacked Wikipedia. The change only happens in your browser. If you reload the page, your change will disappear. This, however, is an easy way to craft a screenshot with fake content—so screenshots shouldn't be trusted as evidence.
136
+
But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence.
137
137
138
138
We're not here for playing around with elements, though—we want to create a scraper for an e-commerce website to watch prices. In the next lesson, we'll examine the website and use CSS selectors to locate HTML elements containing the data we need.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md
+11-11Lines changed: 11 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ Instead of artificial scraping playgrounds or sandboxes, we'll scrape a real e-c
21
21
22
22
Live sites like Amazon are complex, loaded with promotions, frequently changing, and equipped with anti-scraping measures. While those challenges are manageable, they're advanced topics. For this beginner course, we're sticking to a lightweight, stable environment.
23
23
24
-
That said, we designed all the exercises to work with live websites. This means occasional updates might be needed, but we think it's worth it for a more authentic learning experience.
24
+
That said, we designed all the additional exercises to work with live websites. This means occasional updates might be needed, but we think it's worth it for a more authentic learning experience.
25
25
26
26
:::
27
27
@@ -31,13 +31,13 @@ As mentioned in the previous lesson, before building a scraper, we need to under
31
31
32
32

33
33
34
-
The page displays a grid of product cards, each showing a product's title and picture. Open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. Highlight it in the **Elements** tab by clicking on it.
34
+
The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it.
35
35
36
36

37
37
38
38
Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.
39
39
40
-
In the **Elements** tab, move your cursor up from the `a` element containing the subwoofer's title. On the way, hover over each element until you highlight the entire product card. Alternatively, use the arrow-up key. The `div` element you land on is the **parent element**, and all nested elements are its **child elements**.
40
+
In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**.
41
41
42
42

43
43
@@ -55,9 +55,9 @@ The `class` attribute can hold multiple values separated by whitespace. This par
55
55
56
56
## Programmatically locating a product card
57
57
58
-
Let's jump into the **Console** and write some JavaScript. Don't worry—you don't need to know the language, and yes, this is a helpful step on our journey to creating a scraper in Python.
58
+
Let's jump into the **Console** and write some JavaScript. Don't worry—we don't need to know the language, and yes, this is a helpful step on our journey to creating a scraper in Python.
59
59
60
-
In browsers, JavaScript represents the current page as the [`Document`](https://developer.mozilla.org/en-US/docs/Web/API/Document) object, accessible via `document`. This object offers many useful methods, including [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector). This method takes a CSS selector as a string and returns the first HTML element that matches. Try typing this into the **Console**:
60
+
In browsers, JavaScript represents the current page as the [`Document`](https://developer.mozilla.org/en-US/docs/Web/API/Document) object, accessible via `document`. This object offers many useful methods, including [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector). This method takes a CSS selector as a string and returns the first HTML element that matches. We'll try typing this into the **Console**:
61
61
62
62
```js
63
63
document.querySelector('.product-item');
@@ -109,33 +109,33 @@ How did we know `.product-item` selects a product card? By inspecting the markup
109
109
110
110
## Choosing good selectors
111
111
112
-
Multiple approaches often exist for creating a CSS selector that targets the element you want. Pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. Avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning.
112
+
Multiple approaches often exist for creating a CSS selector that targets the element we want. We should pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. We better avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning.
113
113
114
114
The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card *is* a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules.
115
115
116
-
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, you can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
116
+
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
117
117
118
118

119
119
120
120
## Locating all product cards
121
121
122
-
In the **Console**, hovering your cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list.
122
+
In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list.
123
123
124
124

125
125
126
-
But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Type this into the **Console**:
126
+
But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**:
127
127
128
128
```js
129
129
document.querySelectorAll('.product-item');
130
130
```
131
131
132
132
The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/Web/API/NodeList), a collection of nodes. Browsers understand an HTML document as a tree of nodes. Most nodes are HTML elements, but there are also text nodes for plain text, and others.
133
133
134
-
Expand the result by clicking the small arrow, then hover your cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer!
134
+
We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer!
135
135
136
136

137
137
138
-
To save the subwoofer in a variable for further inspection, use index access with brackets, just like in Python lists (or JavaScript arrays):
138
+
To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like in Python lists (or JavaScript arrays):
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ Using browser tools for developers is crucial for understanding the structure of
15
15
16
16
## Starting a Python project
17
17
18
-
Before we start coding, we need to set up a Python project. Create new directory with a virtual environment, then inside the directory and with the environment activated, install the HTTPX library:
18
+
Before we start coding, we need to set up a Python project. Let's create new directory with a virtual environment. Inside the directory and with the environment activated, we'll install the HTTPX library:
19
19
20
20
```text
21
21
$ pip install httpx
@@ -29,15 +29,15 @@ Being comfortable around Python project setup and installing packages is a prere
29
29
30
30
:::
31
31
32
-
Now let's test that all works. Inside the project directory create a new file called `main.py` with the following code:
32
+
Now let's test that all works. Inside the project directory we'll create a new file called `main.py` with the following code:
33
33
34
34
```py
35
35
import httpx
36
36
37
37
print("OK")
38
38
```
39
39
40
-
Running it as a Python program will verify that your setup is okay and you've installed HTTPX:
40
+
Running it as a Python program will verify that our setup is okay and we've installed HTTPX:
41
41
42
42
```text
43
43
$ python main.py
@@ -62,7 +62,7 @@ response = httpx.get(url)
62
62
print(response.text)
63
63
```
64
64
65
-
If you run the program now, it should print the downloaded HTML:
65
+
If we run the program now, it should print the downloaded HTML:
0 commit comments