Skip to content
This repository was archived by the owner on Apr 23, 2025. It is now read-only.

Commit 3358637

Browse files
committed
docs(web-scraping): added documentation for Content trackers utility
1 parent 7c603b2 commit 3358637

File tree

5 files changed

+249
-0
lines changed

5 files changed

+249
-0
lines changed

docs/guides/web_scraping/content.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
---
2+
sidebar_position: 2
3+
sidebar_label: Content Trackers
4+
---
5+
6+
# What is a web page content tracker?
7+
8+
The web page content tracker is a utility that empowers developers to detect and monitor the content of any web page. Alongside [web page resources trackers](./resources.md), it falls under the category of [synthetic monitoring](https://en.wikipedia.org/wiki/Synthetic_monitoring) tools. However, it extends its capabilities to cover a broader set of use cases. These range from ensuring that the deployed application loads only the intended content throughout its lifecycle to tracking changes in arbitrary web content when the application lacks native tracking capabilities. In the event of a change, whether it's caused by a broken deployment or a legitimate content modification, the tracker promptly notifies the user.
9+
10+
On this page, you can find guides on creating and using web page content trackers.
11+
12+
:::note
13+
The `Content extractor` script allows you to extract almost anything as long as it can be considered [**valid markdown-style content**](https://eui.elastic.co/#/editors-syntax/markdown-format#kitchen-sink) and doesn't exceed **200KB** in size. For instance, you can include text, links, images, or even JSON.
14+
:::
15+
16+
## Create a web page content tracker
17+
18+
In this guide, you'll create a simple content tracker for the top post on [Hacker News](https://news.ycombinator.com/):
19+
20+
1. Navigate to [Web Scraping → Content trackers](https://secutils.dev/ws/web_scraping__content) and click **Track content** button
21+
2. Configure a new tracker with the following values:
22+
23+
<table class="su-table">
24+
<tbody>
25+
<tr>
26+
<td><b>Name</b></td>
27+
<td>
28+
```
29+
Hacker News Top Post
30+
```
31+
</td>
32+
</tr>
33+
<tr>
34+
<td><b>URL</b></td>
35+
<td>
36+
```
37+
https://news.ycombinator.com
38+
```
39+
</td>
40+
</tr>
41+
<tr>
42+
<td><b>Frequency</b></td>
43+
<td>
44+
```
45+
Manually
46+
```
47+
</td>
48+
</tr>
49+
<tr>
50+
<td><b>Content extractor</b></td>
51+
<td>
52+
```javascript
53+
return document.querySelector('.titleline')?.textContent ?? 'Uh oh!';
54+
```
55+
</td>
56+
</tr>
57+
</tbody>
58+
</table>
59+
60+
3. Click the **Save** button to save the tracker
61+
4. Once the tracker is set up, it will appear in the trackers grid
62+
5. Expand the tracker's row and click the **Update** button to make the first snapshot of the web page content
63+
64+
After a few seconds, the tracker will fetch the content of the top post on Hacker News and display it below the tracker's row. The content includes only the title of the post. However, as noted at the beginning of this guide, the content extractor script allows you to return almost anything, even the entire HTML of the post.
65+
66+
Watch the video demo below to see all the steps mentioned earlier in action:
67+
68+
<video controls preload="metadata" width="100%">
69+
<source src="../../video/guides/web_scraping_content_tracker.webm" type="video/webm" />
70+
<source src="../../video/guides/web_scraping_content_tracker.mp4" type="video/mp4" />
71+
</video>
72+
73+
## Detect changes with a web page content tracker
74+
75+
In this guide, you'll create a web page content tracker and test it with changing content:
76+
77+
1. Navigate to [Web Scraping → Content trackers](https://secutils.dev/ws/web_scraping__content) and click **Track content** button
78+
2. Configure a new tracker with the following values:
79+
80+
<table class="su-table">
81+
<tbody>
82+
<tr>
83+
<td><b>Name</b></td>
84+
<td>
85+
```
86+
World Clock
87+
```
88+
</td>
89+
</tr>
90+
<tr>
91+
<td><b>URL</b></td>
92+
<td>
93+
```
94+
https://www.timeanddate.com/worldclock/germany/berlin
95+
```
96+
</td>
97+
</tr>
98+
<tr>
99+
<td><b>Delay</b></td>
100+
<td>
101+
```
102+
0
103+
```
104+
</td>
105+
</tr>
106+
<tr>
107+
<td><b>Frequency</b></td>
108+
<td>
109+
```
110+
Hourly
111+
```
112+
</td>
113+
</tr>
114+
<tr>
115+
<td><b>Content extractor</b></td>
116+
<td>
117+
```javascript
118+
const time = document.querySelector('#qlook #ct')?.textContent;
119+
return time
120+
? `Berlin time is [**${time}**](https://www.timeanddate.com/worldclock/germany/berlin)`
121+
: 'Uh oh!';
122+
```
123+
</td>
124+
</tr>
125+
</tbody>
126+
</table>
127+
128+
3. Click the **Save** button to save the tracker
129+
4. Once the tracker is set up, it will appear in the trackers grid with bell and timer icons, indicating that the tracker is configured to regularly check content and send notifications when changes are detected
130+
5. Expand the tracker's row and click the **Update** button to make the first snapshot of the web page content
131+
6. After a few seconds, the tracker will fetch the current Berlin time and render a nice markdown with a link to a word clock website:
132+
133+
:::note EXAMPLE
134+
Berlin time is [**01:02:03**](https://www.timeanddate.com/worldclock/germany/berlin)
135+
:::
136+
137+
7. With this configuration, the tracker will check the content of the web page every hour and notify you if any changes are detected.
138+
139+
:::caution NOTE
140+
Normally, Secutils.dev caches web page content for **10 minutes**. This means that even if you click the **Update** button repeatedly, you won't see any changes in web content until the cache expires. If you're testing the content tracker and wish to see changes sooner, you can slightly modify the **Headers** or **Content extractor** script to invalidate the cache. Please note that unlike **Headers** or **Content extractor**, changing the **URL** will completely clear your content history.
141+
:::
142+
143+
Watch the video demo below to see all the steps mentioned earlier in action:
144+
145+
<video controls preload="metadata" width="100%">
146+
<source src="../../video/guides/web_scraping_content_tracker_diff.webm" type="video/webm" />
147+
<source src="../../video/guides/web_scraping_content_tracker_diff.mp4" type="video/mp4" />
148+
</video>
149+
150+
## Annex: Content extractor script examples
151+
152+
In this section, you can find examples of content extractor scripts that extract various content from web pages. Essentially, the script defines a function executed once the web page fully loads, receiving a single `context` argument. The returned value can be anything as long as it can be serialized to a [JSON string](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify#description), including any [valid markdown-style content](https://eui.elastic.co/#/editors-syntax/markdown-format#kitchen-sink).
153+
154+
The `context` argument has the following interface:
155+
156+
```typescript
157+
export interface Context {
158+
// The context extracted during the previous execution, if available.
159+
previous?: T;
160+
// HTTP response headers returned for the loaded web page.
161+
responseHeaders: Record<string, string>;
162+
}
163+
```
164+
165+
### Track markdown-style content
166+
The script can return any [**valid markdown-style content**](https://eui.elastic.co/#/editors-syntax/markdown-format#kitchen-sink) that Secutils.dev will happily render in preview mode.
167+
168+
```javascript
169+
return `
170+
## Text
171+
### h3 Heading
172+
#### h4 Heading
173+
174+
**This is bold text**
175+
176+
*This is italic text*
177+
178+
~~Strikethrough~~
179+
180+
## Lists
181+
182+
* Item 1
183+
* Item 2
184+
* Item 2a
185+
186+
## Code
187+
188+
\`\`\` js
189+
const foo = (bar) => {
190+
return bar++;
191+
};
192+
193+
console.log(foo(5));
194+
\`\`\`
195+
196+
## Tables
197+
198+
| Option | Description |
199+
| -------- | ------------- |
200+
| Option#1 | Description#1 |
201+
| Option#2 | Description#2 |
202+
203+
## Links
204+
205+
[Link Text](https://secutils.dev)
206+
207+
## Emojies
208+
209+
:wink: :cry: :laughing: :yum:
210+
`;
211+
```
212+
213+
### Track API response
214+
You can use content tracker to track API responses as well (until dedicated [`API tracker` utility](https://github.com/secutils-dev/secutils/issues/32) is released). For instance, you can track the response of the [JSONPlaceholder](https://jsonplaceholder.typicode.com/) API:
215+
216+
```javascript
217+
const { url, method, headers, body } = {
218+
url: 'https://jsonplaceholder.typicode.com/posts',
219+
method: 'POST',
220+
headers: { 'Content-Type': 'application/json; charset=UTF-8' },
221+
body: JSON.stringify({ title: 'foo', body: 'bar', userId: 1 }),
222+
};
223+
const response = await fetch(url, { method, headers, body });
224+
return {
225+
status: response.status,
226+
headers: Object.fromEntries(response.headers.entries()),
227+
body: (await response.text()) ?? '',
228+
};
229+
```
230+
231+
### Use previous content
232+
233+
In the content extract script, you can use the `context.previous` property to access the content extracted during the previous execution:
234+
235+
```javascript
236+
// Update counter based on the previous content.
237+
return (context.previous ?? 0) + 1;
238+
```
239+
240+
### Use external content extractor script
241+
Sometimes, your content extractor script can become large and complicated, making it hard to edit in the Secutils.dev UI. In such cases, you can develop and deploy the script separately in any development environment you prefer. Once the script is deployed, you can use the `import` statement to asynchronously load it:
242+
243+
```javascript
244+
// This code assumes your script exports a function named `run`.
245+
return import('https://secutils-dev.github.io/secutils-sandbox/markdown-table/markdown-table.js')
246+
.then((module) => module.run(context));
247+
```
248+
249+
You can find more examples of content extractor scripts at the [Secutils.dev Sandbox](https://github.com/secutils-dev/secutils-sandbox/tree/main/content-extractor-scripts) repository.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)