diff --git a/docs/cn/guide/proxy.md b/docs/cn/guide/proxy.md index 8516c06..4f2d9dc 100644 --- a/docs/cn/guide/proxy.md +++ b/docs/cn/guide/proxy.md @@ -2,11 +2,36 @@ 配合失败重试,自定义错误次数以及 HTTP 状态码为爬取目标自动轮换代理。 -可以在 创建爬虫应用实例、进阶用法、详细目标 这三个地方设置。 +```js{8,9,10,11,12,13,14,15,16} +import { createCrawl } from 'x-crawl' + +const crawlApp = createCrawl() + +crawlApp + .crawlPage({ + url: 'https://www.example.com', + maxRetry: 10, + proxy: { + urls: [ + 'https://www.example.com/proxy-1', + 'https://www.example.com/proxy-2' + ], + switchByHttpStatus: [401, 403], + switchByErrorCount: 3 + } + }) + .then((res) => {}) +``` -以 crawlPage 为例: +上面的示例中我们使用 `switchByErrorCount` 为每个代理设置了 3 次机会,当 3 次机会用完了就会自动切换下一个代理。如果提供 `switchByHttpStatus` ,那么就会优先根据状态码自动切换代理。 -```js +::: tip +需要配合 maxRetry 失败重试才能使用,并且 maxRetry 必需大于该目标所有代理的 switchByErrorCount 总和,因为 maxRetry 控制该目标的重试次数。 +::: + +**可以在 创建爬虫应用实例、进阶用法、详细目标 这三个地方设置。** + +```js{13,17,18,19,20,21,22,23,26,28,29,30,31,32,33,34,35,36} import { createCrawl } from 'x-crawl' const crawlApp = createCrawl() @@ -46,7 +71,3 @@ crawlApp }) .then((res) => {}) ``` - -::: tip -该功能需要配合失败重试才能正常使用。 -::: diff --git a/docs/guide/proxy.md b/docs/guide/proxy.md index bfe14ed..d31ab9e 100644 --- a/docs/guide/proxy.md +++ b/docs/guide/proxy.md @@ -2,51 +2,74 @@ In conjunction with failed retries, customized error times and HTTP status codes automatically rotate agents for crawling targets. -It can be set in three places: Create crawler application instance, advanced usage, and detailed goals. - -Take crawlPage as an example: - -```js +```js{8,9,10,11,12,13,14,15,16} import { createCrawl } from 'x-crawl' const crawlApp = createCrawl() crawlApp - .crawlPage({ - targets: [ - 'https://www.example.com/page-1', - 'https://www.example.com/page-2', - 'https://www.example.com/page-3', - 'https://www.example.com/page-4', - // Cancel the proxy for this target - { url: 'https://www.example.com/page-6', proxy: null }, - // Set up a separate proxy for this target - { - url: 'https://www.example.com/page-6', - proxy: { - urls: [ - 'https://www.example.com/proxy-4', - 'https://www.example.com/proxy-5' - ], - switchByErrorCount: 3 - } - } - ], - maxRetry: 10, - // Set the proxy uniformly for this target - proxy: { - urls: [ - 'https://www.example.com/proxy-1', - 'https://www.example.com/proxy-2', - 'https://www.example.com/proxy-3' - ], - switchByErrorCount: 3, - switchByHttpStatus: [401, 403] - } - }) - .then((res) => {}) + .crawlPage({ + url: 'https://www.example.com', + maxRetry: 10, + proxy: { + urls: [ + 'https://www.example.com/proxy-1', + 'https://www.example.com/proxy-2' + ], + switchByHttpStatus: [401, 403], + switchByErrorCount: 3 + } + }) + .then((res) => {}) ``` +In the above example, we use `switchByErrorCount` to set 3 opportunities for each agent. When the 3 opportunities are used up, the next agent will be automatically switched. If `switchByHttpStatus` is provided, the proxy will be automatically switched based on the status code first. + ::: tip -This function needs to be retried upon failure to function properly. +This parameter is available only when maxRetry fails. maxRetry must be greater than the sum of switchByErrorCount of all proxies in the target, because maxRetry controls the number of retries of the target. ::: + +**It can be set in three places: Create crawler application instance, advanced usage, and detailed goals. ** + +Take crawlPage as an example: + +```js{13,17,18,19,20,21,22,23,26,28,29,30,31,32,33,34,35,36} +import { createCrawl } from 'x-crawl' + +const crawlApp = createCrawl() + +crawlApp + .crawlPage({ + targets: [ + 'https://www.example.com/page-1', + 'https://www.example.com/page-2', + 'https://www.example.com/page-3', + 'https://www.example.com/page-4', + // Cancel the proxy for this target + { url: 'https://www.example.com/page-6', proxy: null }, + // Set up a separate proxy for this target + { + url: 'https://www.example.com/page-6', + proxy: { + urls: [ + 'https://www.example.com/proxy-4', + 'https://www.example.com/proxy-5' + ], + switchByErrorCount: 3 + } + } + ], + maxRetry: 10, + // Set the proxy uniformly for this target + proxy: { + urls: [ + 'https://www.example.com/proxy-1', + 'https://www.example.com/proxy-2', + 'https://www.example.com/proxy-3' + ], + switchByErrorCount: 3, + switchByHttpStatus: [401, 403] + } + }) + .then((res) => {}) +```