Skip to content

Conversation

@koechkevin
Copy link
Contributor

@koechkevin koechkevin commented Feb 3, 2026

Description

This PR adds environment-specific robots.txt configuration for TrustLab, allowing different crawler policies for varioust environments

  • Ensured robots.txt content can be dynamically updated by users
  • Add CMS-driven robots.txt management for TrustLab via Payload globals and App Router route
  • Set default txt to disallow all
  • Add robots parsing utilities, including mixed text/object input, new directives, and stricter lint compliance

Fixes #1394

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Screenshots

image image

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

@koechkevin koechkevin self-assigned this Feb 3, 2026
@koechkevin koechkevin requested a review from a team February 3, 2026 12:16
@kilemensi
Copy link
Member

I get where you're going with this @koechkevin but isn't there a way where we can upload/set robots.txt via the CMS? I don't believe we should be making code changes every time we want to tweak robots.txt in an app with a full blown CMS.

... and whether DEV or PROD, the default should always be to block everything.

@koechkevin koechkevin changed the title feat: add production robots.txt and conditional copy in Dockerfile feat: add dynamic robots.txt Feb 4, 2026
@koechkevin koechkevin requested a review from kilemensi February 6, 2026 11:52
@kilemensi
Copy link
Member

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ca1a4c122

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Member

@kilemensi kilemensi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: feat: add dynamic robots.txt

Overview

This PR implements CMS-driven robots.txt management for TrustLab, replacing the static file with dynamic generation that supports both plain text and structured object formats.

Summary:

  • ✅ Removes static robots.txt and adds dynamic generation via App Router
  • ✅ Adds Payload CMS field for robots.txt configuration with dual input modes
  • ✅ Implements robots.txt parsing utilities for text-to-metadata conversion
  • ⚠️ Several issues need addressing before merge

Critical Issues

1. Console.log in production code 🔴

File: apps/trustlab/src/lib/data/common/seo.js:114

console.log(JSON.stringify(sanitizedRules, null, 2));
return { rules: sanitizedRules };

Issue: Debug logging should be removed before merging.


2. Incorrect fallback logic 🔴

File: apps/trustlab/src/lib/data/common/seo.js:118-128

export function processRobotsTxtContent(robotsTxt) {
  if (robotsTxt?.format === "object") {
    return (
      {
        rules: robotsTxt.objectContent.ruleSet?.map((entry) =>
          formatRuleSet(entry.rule),
        ),
        sitemap: robotsTxt.objectContent.sitemap ?? null,
        host: robotsTxt.objectContent.host ?? null,
      } || []  // ❌ This || [] is wrong
    );
  }
  return parseRobotsToMetadata(robotsTxt?.textContent || "");
}

Issue: The || [] at line 127 will never be reached because an object literal is always truthy. If the intent was to provide a fallback, this should be handled differently:

const result = {
  rules: robotsTxt.objectContent.ruleSet?.map((entry) =>
    formatRuleSet(entry.rule),
  ) ?? [],
  sitemap: robotsTxt.objectContent.sitemap ?? null,
  host: robotsTxt.objectContent.host ?? null,
};
return result;

3. Missing null/undefined handling 🟡

File: apps/trustlab/src/lib/data/common/seo.js:63-76

const formatRuleSet = (ruleSet) => {
  return {
    userAgent: (ruleSet?.userAgent || "*")
      .split(",")
      .map((agent) => `${agent}`.trim()),
    allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim()),  // ❌ ruleSet.allow not ruleSet?.allow
    disallow: (ruleSet.disallow || "")  // ❌ Same here
      .split(",")
      .map((path) => `${path}`.trim()),
    crawlDelay: ruleSet.crawlDelay ?? null,  // ❌ ruleSet?.crawlDelay
  };
};

Issue: Inconsistent use of optional chaining. Lines 68 and 69 use ruleSet.allow and ruleSet.disallow but line 64 uses ruleSet?.userAgent.


Medium Priority Issues

4. Unclear purpose of legacy_robots.txt.js 🟡

File: apps/trustlab/src/pages/legacy_robots.txt.js

export const getServerSideProps = async (context) => {
  return getPageServerSideProps({
    ...context,
    params: { slugs: ["robots.txt"] },
  });
};

export default () => null;

Question: This seems to route /legacy_robots.txt through the normal page system with slug ["robots.txt"]. Is this intentional? Typically, robots.txt should be served at /robots.txt, which the App Router file (src/app/robots.js) already handles. What's the purpose of this legacy file?


5. Potential empty string handling 🟡

When splitting comma-separated values, empty strings might create unwanted entries:

// If allow = "", this creates [""]
allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim())

Suggestion: Filter out empty values:

allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim()).filter(Boolean)

6. Missing test coverage 🟡

Complex parsing logic without tests:

  • parseRobotsToMetadata() - Multi-line parsing with state machine
  • formatRuleSet() - Data transformation
  • processRobotsTxtContent() - Format branching logic

Recommendation: Add unit tests for these utility functions.


Minor Issues

7. Typo in PR description

"varioust environments" → "various environments"


8. Template string unnecessary 🟢

File: apps/trustlab/src/lib/data/common/seo.js:66

userAgent: (ruleSet?.userAgent || "*")
  .split(",")
  .map((agent) => `${agent}`.trim()),  // ${agent} is already a string

The template literal wrapper ${agent} is redundant since agent is already a string from split. Can be simplified to .map((agent) => agent.trim())


Questions

  1. Default behavior: With format defaulting to "text" and default content "User-agent: *\nDisallow: /", all crawling is blocked by default. Is this intentional for all environments?

  2. Sitemap handling: The structured format supports sitemap URLs, but how should comma-separated sitemaps be rendered in the final robots.txt output?

  3. Host directive: The Host directive is non-standard and only supported by Yandex. Is this intentional?


Positive Aspects ✅

  1. Good separation of concerns - Parser logic separated from CMS field definitions
  2. Flexible input formats - Supports both text and structured input
  3. Comprehensive regex patterns - Handles various robots.txt directives (including cache-delay, visit-time)
  4. User-friendly CMS interface - Clear field descriptions and helpful placeholders
  5. App Router migration - Uses modern Next.js patterns

Recommendation

Request Changes - The PR needs the following before merge:

  1. Remove console.log (critical)
  2. Fix the || [] logic error (critical)
  3. Add consistent optional chaining in formatRuleSet() (important)
  4. Clarify purpose of legacy_robots.txt.js or remove if unnecessary
  5. Filter empty strings from comma-separated lists
  6. Fix typo in PR description
  7. Consider adding unit tests for parsing logic

🤖 Generated with Claude Code

@kilemensi
Copy link
Member

For context @koechkevin, the above is no my review. It's from @claude .

@koechkevin koechkevin requested a review from kilemensi February 10, 2026 03:51
Copy link
Member

@kilemensi kilemensi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽 @koechkevin


  1. We shouldn't support "Structured object (Next.js RobotsFile)". I don't think any content editor will even know what that is.
  2. Support all robots.txt fields supported by Google crawler e.g. sitemap is missing. If we can support more fields, all good with me but the core ones must be there.
  3. Validate the robots.txt file when saving in the CMS so that if it has any errors, content editors can correct them e.g. one shown on screenshot below shouldn't be accepted because boom and bam are not valid fields.
  4. Use your tools! claude, opencode, etc. still show the code can be improved by a bit.
Image

@kilemensi
Copy link
Member

PS: Default robots.txt should always be:

User-agent: *
Disallow: /

Nothing more, nothing less.

};
}

export async function getServerSideProps(context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used anywhere?

admin: {
language: "plaintext",
rows: 14,
description: "Paste the exact robots.txt text to serve.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: "Paste the exact robots.txt text to serve.",
description: "Enter the exact robots.txt text to serve.",

Why paste?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, from the bots:


Looking at your code and the Next.js robots object interface, there's a mismatch. The Next.js robots metadata object expects a specific shape like:

type Robots = {
  rules: Rule | Rule[]
  sitemap?: string | string[]
  host?: string
}

type Rule = {
  userAgent?: string | string[]
  allow?: string | string[]
  disallow?: string | string[]
  crawlDelay?: number
}

Your parser is fairly close but has some unnecessary complexity for this use case. A few observations:

  1. cacheDelay, visitTime, cleanParam — Next.js doesn't use these, so they're dead weight if your only goal is producing a Next.js-compatible object.
  2. The regex-per-directive approach is overkill when you're already doing keyLower comparison — just compare strings directly.
  3. appendDirectiveValue / appendUserAgent add complexity to handle the "single value vs array" duality, but you could simplify by always collecting into arrays and flattening at the end.
  4. The autoUserAgent / ensureRule pattern is a bit convoluted for handling implicit * groups.

Here's a leaner version focused specifically on producing a Next.js-compatible robots object:Key improvements over your original:

/**
 * Parses a robots.txt string into a Next.js-compatible robots metadata object.
 * @see https://nextjs.org/docs/app/api-reference/file-conventions/metadata/robots#robots-object
 *
 * @param {string} raw - robots.txt content
 * @returns {{ rules: Rule | Rule[], sitemap?: string | string[], host?: string }}
 */
export default function parseRobotsToNextJs(raw = "") {
  const lines = String(raw).replace(/\r\n?/g, "\n").split("\n");
  const groups = []; // { userAgent: string[], allow: string[], disallow: string[], crawlDelay?: number }
  const sitemaps = [];
  let host = null;
  let cur = null;
  let curHasDirectives = false;

  const ensureGroup = () => {
    if (!cur) {
      cur = { userAgent: ["*"], allow: [], disallow: [] };
      groups.push(cur);
      curHasDirectives = false;
    }
    return cur;
  };

  for (const rawLine of lines) {
    const line = rawLine.replace(/#.*$/, "").trim();
    if (!line) continue;

    const sep = line.indexOf(":");
    if (sep === -1) continue;

    const key = line.slice(0, sep).trim().toLowerCase();
    const val = line.slice(sep + 1).trim();

    switch (key) {
      case "sitemap":
        if (val) sitemaps.push(val);
        break;
      case "host":
        if (val) host = val;
        break;
      case "user-agent":
        if (!cur || curHasDirectives) {
          cur = { userAgent: [val || "*"], allow: [], disallow: [] };
          groups.push(cur);
          curHasDirectives = false;
        } else {
          cur.userAgent.push(val || "*");
        }
        break;
      case "allow":
        ensureGroup().allow.push(val);
        curHasDirectives = true;
        break;
      case "disallow":
        ensureGroup().disallow.push(val);
        curHasDirectives = true;
        break;
      case "crawl-delay": {
        const n = Number(val);
        if (Number.isFinite(n)) ensureGroup().crawlDelay = n;
        curHasDirectives = true;
        break;
      }
      // Ignore unknown directives silently
    }
  }

  // Collapse single-element arrays to plain values for Next.js compat
  const unwrap = (arr) => (arr.length === 1 ? arr[0] : arr.length ? arr : undefined);

  const rules = groups.map((g) => {
    const rule = { userAgent: unwrap(g.userAgent) };
    const allow = unwrap(g.allow);
    const disallow = unwrap(g.disallow);
    if (allow !== undefined) rule.allow = allow;
    if (disallow !== undefined) rule.disallow = disallow;
    if (g.crawlDelay !== undefined) rule.crawlDelay = g.crawlDelay;
    return rule;
  });

  const result = { rules: rules.length === 1 ? rules[0] : rules };
  const sm = unwrap(sitemaps);
  if (sm !== undefined) result.sitemap = sm;
  if (host) result.host = host;
  return result;
}
  • ~60% less code by dropping unused directives and regex objects in favor of a simple switch on the lowercased key string.
  • Always collects into arrays first, then unwraps at the end — no juggling between scalar/array mid-parse.
  • Strips inline comments (# ...) from lines, which your original didn't handle.
  • Output matches the Next.js shape exactlyrules is a single Rule when there's one group, otherwise Rule[]; same for sitemap.
  • No temporary markers like autoUserAgent that need cleanup.

If you also need diagnostics for debugging malformed files, you could add that back as an opt-in, but for a "parse robots.txt → Next.js metadata" utility, this is all you need.

};
}

export async function getRobotsTxtContent() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're parsing content to Robots object, shouldn't this be:

Suggested change
export async function getRobotsTxtContent() {
export async function getRobots() {

},
{
name: "robotsTxt",
label: "robots.txt content",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
label: "robots.txt content",
label: "robots.txt",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TrustLab - Allow Search bots on TrustLab live site : Googlebot, Bingbot, Applebot

3 participants