- 
                Notifications
    You must be signed in to change notification settings 
- Fork 17
Closed
Labels
bugSomething isn't working.Something isn't working.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Milestone
Description
Description
Request deduplication does not always work in the Apify-Scrapy integration.
Reproduction
- Use the sample code from the Scrapy guide or the Scrapy template.
- Input:
{
  "allowedDomains": [
    "crawlee.dev"
  ],
  "proxyConfiguration": {
    "useApifyProxy": false
  },
  "startUrls": [
    {
      "url": "https://crawlee.dev/",
      "method": "GET"
    }
  ]
}Observed behavior
- The start URL "https://crawlee.dev" was crawled four times.
- The URL "https://crawlee.dev/docs/examples" was crawled twice.
Logs:
2025-02-10T17:26:24.729Z ACTOR: Pulling Docker image of build 0VUi8LhZspd5TTEGF from repository.
2025-02-10T17:26:31.307Z ACTOR: Creating Docker container.
2025-02-10T17:26:32.118Z ACTOR: Starting Docker container.
2025-02-10T17:26:35.619Z [apify] INFO  Initializing Actor...
2025-02-10T17:26:35.622Z [apify] INFO  Initializing Actor... ({"message": "Initializing Actor..."})
2025-02-10T17:26:35.625Z [apify] INFO  System info ({"apify_sdk_version": "2.2.2", "apify_client_version": "1.9.1", "crawlee_version": "0.5.4", "python_version": "3.12.8", "os": "linux"})
2025-02-10T17:26:35.628Z [apify] INFO  System info ({"apify_sdk_version": "2.2.2", "apify_client_version": "1.9.1", "crawlee_version": "0.5.4", "python_version": "3.12.8", "os": "linux", "message": "System info"})
2025-02-10T17:26:35.731Z [scrapy.addons] INFO  Enabled addons:
2025-02-10T17:26:35.734Z [] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:35.841Z [scrapy.middleware] INFO  Enabled extensions:
2025-02-10T17:26:35.844Z ['scrapy.extensions.corestats.CoreStats',
2025-02-10T17:26:35.846Z  'scrapy.extensions.memusage.MemoryUsage',
2025-02-10T17:26:35.848Z  'scrapy.extensions.logstats.LogStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:35.851Z [scrapy.crawler] INFO  Overridden settings:
2025-02-10T17:26:35.854Z {'BOT_NAME': 'titlebot',
2025-02-10T17:26:35.856Z  'DEPTH_LIMIT': 1,
2025-02-10T17:26:35.859Z  'LOG_LEVEL': 'INFO',
2025-02-10T17:26:35.862Z  'NEWSPIDER_MODULE': 'src.spiders',
2025-02-10T17:26:35.864Z  'ROBOTSTXT_OBEY': True,
2025-02-10T17:26:35.867Z  'SCHEDULER': 'apify.scrapy.scheduler.ApifyScheduler',
2025-02-10T17:26:35.869Z  'SPIDER_MODULES': ['src.spiders'],
2025-02-10T17:26:35.872Z  'TELNETCONSOLE_ENABLED': False,
2025-02-10T17:26:35.874Z  'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-10T17:26:36.176Z [apify] INFO  ApifyHttpProxyMiddleware is not going to be used. Actor input field "proxyConfiguration.useApifyProxy" is set to False.
2025-02-10T17:26:36.179Z [apify] INFO  ApifyHttpProxyMiddleware is not going to be used. Actor input field "proxyConfiguration.useApifyProxy" is set to False. ({"message": "ApifyHttpProxyMiddleware is not going to be used. Actor input field \"proxyConfiguration.useApifyProxy\" is set to False."})
2025-02-10T17:26:36.182Z [scrapy.middleware] INFO  Enabled downloader middlewares:
2025-02-10T17:26:36.185Z ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
2025-02-10T17:26:36.188Z  'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
2025-02-10T17:26:36.191Z  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
2025-02-10T17:26:36.193Z  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
2025-02-10T17:26:36.195Z  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
2025-02-10T17:26:36.198Z  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
2025-02-10T17:26:36.200Z  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
2025-02-10T17:26:36.203Z  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
2025-02-10T17:26:36.205Z  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
2025-02-10T17:26:36.208Z  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
2025-02-10T17:26:36.211Z  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
2025-02-10T17:26:36.213Z  'scrapy.downloadermiddlewares.stats.DownloaderStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.215Z [scrapy.middleware] INFO  Enabled spider middlewares:
2025-02-10T17:26:36.218Z ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
2025-02-10T17:26:36.220Z  'scrapy.spidermiddlewares.referer.RefererMiddleware',
2025-02-10T17:26:36.223Z  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
2025-02-10T17:26:36.225Z  'scrapy.spidermiddlewares.depth.DepthMiddleware'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.227Z [scrapy.middleware] INFO  Enabled item pipelines:
2025-02-10T17:26:36.230Z ['apify.scrapy.pipelines.ActorDatasetPushPipeline'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.232Z [scrapy.core.engine] INFO  Spider opened ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:36.343Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:36.832Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:38.548Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:39.136Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/examples>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:39.909Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/blog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.332Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.527Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.539Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/python>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.042Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/next/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.311Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core/changelog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.331Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.663Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.676Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.11/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.897Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.931Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.10/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.207Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.227Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.9/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.456Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.478Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.8/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.762Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.7/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.972Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.6/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.216Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.5/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.416Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.4/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.700Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.3/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.923Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.2/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.189Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.1/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.423Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.0/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.638Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/introduction>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.851Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/javascript-rendering>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.061Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/typescript-project>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.288Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/avoid-blocking>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.504Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/cheerio-crawler-guide>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.710Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/jsdom-crawler-guide>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.936Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/javascript-rendering>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.127Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core/class/AutoscaledPool>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.372Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/proxy-management>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.618Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/result-storage>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.787Z [scrapy.spidermiddlewares.urllength] INFO  Ignoring link (url length > 2083): https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcbiAgICBcImNvZGVcIjogXCJpbXBvcnQgeyBQbGF5d3JpZ2h0Q3Jhd2xlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIEltcG9ydCB0aGUgYEFjdG9yYCBjbGFzcyBmcm9tIHRoZSBBcGlmeSBTREsuXFxuaW1wb3J0IHsgQWN0b3IgfSBmcm9tICdhcGlmeSc7XFxuXFxuLy8gU2V0IHVwIHRoZSBpbnRlZ3JhdGlvbiB0byBBcGlmeS5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuLy8gQ3Jhd2xlciBzZXR1cCBmcm9tIHRoZSBwcmV2aW91cyBleGFtcGxlLlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUGxheXdyaWdodENyYXdsZXIoe1xcbiAgICAvLyBVc2UgdGhlIHJlcXVlc3RIYW5kbGVyIHRvIHByb2Nlc3MgZWFjaCBvZiB0aGUgY3Jhd2xlZCBwYWdlcy5cXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlLCBlbnF1ZXVlTGlua3MsIHB1c2hEYXRhLCBsb2cgfSkge1xcbiAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLnRpdGxlKCk7XFxuICAgICAgICBsb2cuaW5mbyhgVGl0bGUgb2YgJHtyZXF1ZXN0LmxvYWRlZFVybH0gaXMgJyR7dGl0bGV9J2ApO1xcblxcbiAgICAgICAgLy8gU2F2ZSByZXN1bHRzIGFz... [line-too-long]
2025-02-10T17:26:46.851Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/request-storage>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.052Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/utils/namespace/social>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.287Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/utils>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.503Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.776Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.970Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/deployment/aws-cheerio>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.029Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/deployment/gcp-cheerio>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.081Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.114Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/examples>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.426Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/upgrading/upgrading-to-v3>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.500Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/blog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.530Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.277Z [scrapy.core.engine] INFO  Closing spider (finished) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.281Z [scrapy.statscollectors] INFO  Dumping Scrapy stats:
2025-02-10T17:26:49.283Z {'downloader/request_bytes': 13234,
2025-02-10T17:26:49.286Z  'downloader/request_count': 49,
2025-02-10T17:26:49.288Z  'downloader/request_method_count/GET': 49,
2025-02-10T17:26:49.291Z  'downloader/response_bytes': 1384307,
2025-02-10T17:26:49.293Z  'downloader/response_count': 49,
2025-02-10T17:26:49.296Z  'downloader/response_status_count/200': 49,
2025-02-10T17:26:49.298Z  'elapsed_time_seconds': 12.935337,
2025-02-10T17:26:49.301Z  'finish_reason': 'finished',
2025-02-10T17:26:49.303Z  'finish_time': datetime.datetime(2025, 2, 10, 17, 26, 49, 277550, tzinfo=datetime.timezone.utc),
2025-02-10T17:26:49.306Z  'httpcompression/response_bytes': 8749039,
2025-02-10T17:26:49.308Z  'httpcompression/response_count': 49,
2025-02-10T17:26:49.310Z  'item_scraped_count': 48,
2025-02-10T17:26:49.313Z  'items_per_minute': None,
2025-02-10T17:26:49.316Z  'log_count/INFO': 58,
2025-02-10T17:26:49.318Z  'memusage/max': 105684992,
2025-02-10T17:26:49.320Z  'memusage/startup': 105684992,
2025-02-10T17:26:49.322Z  'offsite/domains': 10,
2025-02-10T17:26:49.324Z  'offsite/filtered': 20,
2025-02-10T17:26:49.327Z  'request_depth_max': 1,
2025-02-10T17:26:49.329Z  'response_received_count': 49,
2025-02-10T17:26:49.332Z  'responses_per_minute': None,
2025-02-10T17:26:49.335Z  'robotstxt/request_count': 1,
2025-02-10T17:26:49.338Z  'robotstxt/response_count': 1,
2025-02-10T17:26:49.340Z  'robotstxt/response_status_count/200': 1,
2025-02-10T17:26:49.343Z  'start_time': datetime.datetime(2025, 2, 10, 17, 26, 36, 342213, tzinfo=datetime.timezone.utc),
2025-02-10T17:26:49.345Z  'urllength/request_ignored_count': 1} ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.348Z [scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.351Z [apify] INFO  Exiting Actor ({"exit_code": 0})
2025-02-10T17:26:49.353Z [apify] INFO  Exiting Actor ({"exit_code": 0, "message": "Exiting Actor"})
Metadata
Metadata
Assignees
Labels
bugSomething isn't working.Something isn't working.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.