-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
The following snippet works with Crawlee v3, but will break on current v4:
import { CheerioCrawler } from "@crawlee/cheerio";
const crawler = new CheerioCrawler({
requestHandler: async () => {
// pass
},
});
await crawler.run([{
url: 'http://example.com',
skipNavigation: true,
}]);INFO CheerioCrawler: Starting the crawler.
WARN CheerioCrawler: Reclaiming failed request back to the list or queue. The `contentType` property is not available - `skipNavigation` was used
at get contentType (file:///home/jindrichbar/Desktop/apify/crawlee/packages/http-crawler/dist/internals/http-crawler.js:207:27) {"id":"8OamqXBCpPHxyH9","url":"http://example.com","retryCount":1}
ERROR CheerioCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated.
The `request.loadedUrl` property is not available - `skipNavigation` was used
at Object.get (file:///home/jindrichbar/Desktop/apify/crawlee/packages/http-crawler/dist/internals/http-crawler.js:177:35)
at Function.entries (<anonymous>)
at _ObjectValidator.handleIgnoreStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2089:41)
at _ObjectValidator.handlePassthroughStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2170:25)
at _ObjectValidator.handleStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:1982:47)
at _ObjectValidator.handle (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2081:17)
at _ObjectValidator.parse (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:964:90)
at RequestQueueClient.updateRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/memory-storage/dist/resource-clients/request-queue.js:366:22)
at RequestQueue.reclaimRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/core/dist/storages/request_provider.js:386:35)
at RequestQueue.reclaimRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/core/dist/storages/request_queue_v2.js:219:33)
The crawler gets a double whammy, first from CheerioCrawler's parseContent (accesses crawlingContext.contentType):
| const isXml = crawlingContext.contentType.type.includes('xml'); |
and then Shapeshift's validation on updateRequest while handling the error above (this accesses request.loadedUrl):
| requestShape.parse(request); |
This is caused by the addition of the validation Proxy on CrawlingContext and Request in HttpCrawler (link and link)
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.