Skip to content

Commit

Permalink
docs: perfect agency part
Browse files Browse the repository at this point in the history
  • Loading branch information
coder-hxl committed Apr 21, 2024
1 parent 8049d0a commit 080cf53
Show file tree
Hide file tree
Showing 2 changed files with 90 additions and 46 deletions.
35 changes: 28 additions & 7 deletions docs/cn/guide/proxy.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,36 @@

配合失败重试,自定义错误次数以及 HTTP 状态码为爬取目标自动轮换代理。

可以在 创建爬虫应用实例、进阶用法、详细目标 这三个地方设置。
```js{8,9,10,11,12,13,14,15,16}
import { createCrawl } from 'x-crawl'
const crawlApp = createCrawl()
crawlApp
.crawlPage({
url: 'https://www.example.com',
maxRetry: 10,
proxy: {
urls: [
'https://www.example.com/proxy-1',
'https://www.example.com/proxy-2'
],
switchByHttpStatus: [401, 403],
switchByErrorCount: 3
}
})
.then((res) => {})
```

以 crawlPage 为例:
上面的示例中我们使用 `switchByErrorCount` 为每个代理设置了 3 次机会,当 3 次机会用完了就会自动切换下一个代理。如果提供 `switchByHttpStatus` ,那么就会优先根据状态码自动切换代理。

```js
::: tip
需要配合 maxRetry 失败重试才能使用,并且 maxRetry 必需大于该目标所有代理的 switchByErrorCount 总和,因为 maxRetry 控制该目标的重试次数。
:::

**可以在 创建爬虫应用实例、进阶用法、详细目标 这三个地方设置。**

```js{13,17,18,19,20,21,22,23,26,28,29,30,31,32,33,34,35,36}
import { createCrawl } from 'x-crawl'
const crawlApp = createCrawl()
Expand Down Expand Up @@ -46,7 +71,3 @@ crawlApp
})
.then((res) => {})
```

::: tip
该功能需要配合失败重试才能正常使用。
:::
101 changes: 62 additions & 39 deletions docs/guide/proxy.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,51 +2,74 @@

In conjunction with failed retries, customized error times and HTTP status codes automatically rotate agents for crawling targets.

It can be set in three places: Create crawler application instance, advanced usage, and detailed goals.

Take crawlPage as an example:

```js
```js{8,9,10,11,12,13,14,15,16}
import { createCrawl } from 'x-crawl'
const crawlApp = createCrawl()
crawlApp
.crawlPage({
targets: [
'https://www.example.com/page-1',
'https://www.example.com/page-2',
'https://www.example.com/page-3',
'https://www.example.com/page-4',
// Cancel the proxy for this target
{ url: 'https://www.example.com/page-6', proxy: null },
// Set up a separate proxy for this target
{
url: 'https://www.example.com/page-6',
proxy: {
urls: [
'https://www.example.com/proxy-4',
'https://www.example.com/proxy-5'
],
switchByErrorCount: 3
}
}
],
maxRetry: 10,
// Set the proxy uniformly for this target
proxy: {
urls: [
'https://www.example.com/proxy-1',
'https://www.example.com/proxy-2',
'https://www.example.com/proxy-3'
],
switchByErrorCount: 3,
switchByHttpStatus: [401, 403]
}
})
.then((res) => {})
.crawlPage({
url: 'https://www.example.com',
maxRetry: 10,
proxy: {
urls: [
'https://www.example.com/proxy-1',
'https://www.example.com/proxy-2'
],
switchByHttpStatus: [401, 403],
switchByErrorCount: 3
}
})
.then((res) => {})
```

In the above example, we use `switchByErrorCount` to set 3 opportunities for each agent. When the 3 opportunities are used up, the next agent will be automatically switched. If `switchByHttpStatus` is provided, the proxy will be automatically switched based on the status code first.

::: tip
This function needs to be retried upon failure to function properly.
This parameter is available only when maxRetry fails. maxRetry must be greater than the sum of switchByErrorCount of all proxies in the target, because maxRetry controls the number of retries of the target.
:::

**It can be set in three places: Create crawler application instance, advanced usage, and detailed goals. **

Take crawlPage as an example:

```js{13,17,18,19,20,21,22,23,26,28,29,30,31,32,33,34,35,36}
import { createCrawl } from 'x-crawl'
const crawlApp = createCrawl()
crawlApp
.crawlPage({
targets: [
'https://www.example.com/page-1',
'https://www.example.com/page-2',
'https://www.example.com/page-3',
'https://www.example.com/page-4',
// Cancel the proxy for this target
{ url: 'https://www.example.com/page-6', proxy: null },
// Set up a separate proxy for this target
{
url: 'https://www.example.com/page-6',
proxy: {
urls: [
'https://www.example.com/proxy-4',
'https://www.example.com/proxy-5'
],
switchByErrorCount: 3
}
}
],
maxRetry: 10,
// Set the proxy uniformly for this target
proxy: {
urls: [
'https://www.example.com/proxy-1',
'https://www.example.com/proxy-2',
'https://www.example.com/proxy-3'
],
switchByErrorCount: 3,
switchByHttpStatus: [401, 403]
}
})
.then((res) => {})
```

0 comments on commit 080cf53

Please sign in to comment.