Skip to content

RobotsFile.isAllowed returns false for allowed routes #2437

@Sajito

Description

@Sajito

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/utils

Issue description

Load a robots.txt containing only disallow rules and check any non matching url. It should return true, but returns false.

That is because the underlying robots-parser package returns undefined for urls, which are not part of the robots.txt. In the RobotsFile class undefined is converted to false which is wrong. robots.txt is used to define exclusion rules, therefore non matching urls should be allowed.

Either undefined should be converted to true or the wrapping method should also return undefined.

Code sample

const robots = `
User-agent: *
Disallow: /private
`;
const robots = RobotsFile.from('https://example.com', robots);

robots.isAllowed('https://example.com/allowed'); // returns false, should return true

Package version

3.9.2

Node.js version

v21.7.3

Operating system

No response

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions