Skip to content

ContextPipeline changes break skipNavigation with CheerioCrawler #3304

@barjin

Description

@barjin

The following snippet works with Crawlee v3, but will break on current v4:

import { CheerioCrawler } from "@crawlee/cheerio";

const crawler = new CheerioCrawler({
    requestHandler: async () => {
        // pass
    },
});

await crawler.run([{
    url: 'http://example.com',
    skipNavigation: true,
}]);
INFO  CheerioCrawler: Starting the crawler.
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. The `contentType` property is not available - `skipNavigation` was used
    at get contentType (file:///home/jindrichbar/Desktop/apify/crawlee/packages/http-crawler/dist/internals/http-crawler.js:207:27) {"id":"8OamqXBCpPHxyH9","url":"http://example.com","retryCount":1}
ERROR CheerioCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated. 
  The `request.loadedUrl` property is not available - `skipNavigation` was used
      at Object.get (file:///home/jindrichbar/Desktop/apify/crawlee/packages/http-crawler/dist/internals/http-crawler.js:177:35)
      at Function.entries (<anonymous>)
      at _ObjectValidator.handleIgnoreStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2089:41)
      at _ObjectValidator.handlePassthroughStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2170:25)
      at _ObjectValidator.handleStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:1982:47)
      at _ObjectValidator.handle (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2081:17)
      at _ObjectValidator.parse (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:964:90)
      at RequestQueueClient.updateRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/memory-storage/dist/resource-clients/request-queue.js:366:22)
      at RequestQueue.reclaimRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/core/dist/storages/request_provider.js:386:35)
      at RequestQueue.reclaimRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/core/dist/storages/request_queue_v2.js:219:33)

The crawler gets a double whammy, first from CheerioCrawler's parseContent (accesses crawlingContext.contentType):

const isXml = crawlingContext.contentType.type.includes('xml');

and then Shapeshift's validation on updateRequest while handling the error above (this accesses request.loadedUrl):

This is caused by the addition of the validation Proxy on CrawlingContext and Request in HttpCrawler (link and link)

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions