feat: add dynamic robots.txt #1395

koechkevin · 2026-02-03T12:15:32Z

Description

This PR adds environment-specific robots.txt configuration for TrustLab, allowing different crawler policies for varioust environments

Ensured robots.txt content can be dynamically updated by users
Add CMS-driven robots.txt management for TrustLab via Payload globals and App Router route
Set default txt to disallow all
Add robots parsing utilities, including mixed text/object input, new directives, and stricter lint compliance

Fixes #1394

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

Screenshots

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation

kilemensi · 2026-02-03T12:22:42Z

I get where you're going with this @koechkevin but isn't there a way where we can upload/set robots.txt via the CMS? I don't believe we should be making code changes every time we want to tweak robots.txt in an app with a full blown CMS.

... and whether DEV or PROD, the default should always be to block everything.

apps/trustlab/src/payload/globals/tabs/Robots.js

apps/trustlab/src/pages/legacy_robots.txt.js

apps/trustlab/src/payload/globals/index.js

…nd new configuration fields

… importMap.js

kilemensi · 2026-02-09T13:18:58Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ca1a4c122

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

apps/trustlab/src/lib/data/common/seo.js

kilemensi

PR Review: feat: add dynamic robots.txt

Overview

This PR implements CMS-driven robots.txt management for TrustLab, replacing the static file with dynamic generation that supports both plain text and structured object formats.

Summary:

✅ Removes static robots.txt and adds dynamic generation via App Router
✅ Adds Payload CMS field for robots.txt configuration with dual input modes
✅ Implements robots.txt parsing utilities for text-to-metadata conversion
⚠️ Several issues need addressing before merge

Critical Issues

1. Console.log in production code 🔴

File: apps/trustlab/src/lib/data/common/seo.js:114

console.log(JSON.stringify(sanitizedRules, null, 2));
return { rules: sanitizedRules };

Issue: Debug logging should be removed before merging.

2. Incorrect fallback logic 🔴

File: apps/trustlab/src/lib/data/common/seo.js:118-128

export function processRobotsTxtContent(robotsTxt) {
  if (robotsTxt?.format === "object") {
    return (
      {
        rules: robotsTxt.objectContent.ruleSet?.map((entry) =>
          formatRuleSet(entry.rule),
        ),
        sitemap: robotsTxt.objectContent.sitemap ?? null,
        host: robotsTxt.objectContent.host ?? null,
      } || []  // ❌ This || [] is wrong
    );
  }
  return parseRobotsToMetadata(robotsTxt?.textContent || "");
}

Issue: The || [] at line 127 will never be reached because an object literal is always truthy. If the intent was to provide a fallback, this should be handled differently:

const result = {
  rules: robotsTxt.objectContent.ruleSet?.map((entry) =>
    formatRuleSet(entry.rule),
  ) ?? [],
  sitemap: robotsTxt.objectContent.sitemap ?? null,
  host: robotsTxt.objectContent.host ?? null,
};
return result;

3. Missing null/undefined handling 🟡

File: apps/trustlab/src/lib/data/common/seo.js:63-76

const formatRuleSet = (ruleSet) => {
  return {
    userAgent: (ruleSet?.userAgent || "*")
      .split(",")
      .map((agent) => `${agent}`.trim()),
    allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim()),  // ❌ ruleSet.allow not ruleSet?.allow
    disallow: (ruleSet.disallow || "")  // ❌ Same here
      .split(",")
      .map((path) => `${path}`.trim()),
    crawlDelay: ruleSet.crawlDelay ?? null,  // ❌ ruleSet?.crawlDelay
  };
};

Issue: Inconsistent use of optional chaining. Lines 68 and 69 use ruleSet.allow and ruleSet.disallow but line 64 uses ruleSet?.userAgent.

Medium Priority Issues

4. Unclear purpose of legacy_robots.txt.js 🟡

File: apps/trustlab/src/pages/legacy_robots.txt.js

export const getServerSideProps = async (context) => {
  return getPageServerSideProps({
    ...context,
    params: { slugs: ["robots.txt"] },
  });
};

export default () => null;

Question: This seems to route /legacy_robots.txt through the normal page system with slug ["robots.txt"]. Is this intentional? Typically, robots.txt should be served at /robots.txt, which the App Router file (src/app/robots.js) already handles. What's the purpose of this legacy file?

5. Potential empty string handling 🟡

When splitting comma-separated values, empty strings might create unwanted entries:

// If allow = "", this creates [""]
allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim())

Suggestion: Filter out empty values:

allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim()).filter(Boolean)

6. Missing test coverage 🟡

Complex parsing logic without tests:

parseRobotsToMetadata() - Multi-line parsing with state machine
formatRuleSet() - Data transformation
processRobotsTxtContent() - Format branching logic

Recommendation: Add unit tests for these utility functions.

Minor Issues

7. Typo in PR description

"varioust environments" → "various environments"

8. Template string unnecessary 🟢

File: apps/trustlab/src/lib/data/common/seo.js:66

userAgent: (ruleSet?.userAgent || "*")
  .split(",")
  .map((agent) => `${agent}`.trim()),  // ${agent} is already a string

The template literal wrapper ${agent} is redundant since agent is already a string from split. Can be simplified to .map((agent) => agent.trim())

Questions

Default behavior: With format defaulting to "text" and default content "User-agent: *\nDisallow: /", all crawling is blocked by default. Is this intentional for all environments?
Sitemap handling: The structured format supports sitemap URLs, but how should comma-separated sitemaps be rendered in the final robots.txt output?
Host directive: The Host directive is non-standard and only supported by Yandex. Is this intentional?

Positive Aspects ✅

Good separation of concerns - Parser logic separated from CMS field definitions
Flexible input formats - Supports both text and structured input
Comprehensive regex patterns - Handles various robots.txt directives (including cache-delay, visit-time)
User-friendly CMS interface - Clear field descriptions and helpful placeholders
App Router migration - Uses modern Next.js patterns

Recommendation

Request Changes - The PR needs the following before merge:

Remove console.log (critical)
Fix the || [] logic error (critical)
Add consistent optional chaining in formatRuleSet() (important)
Clarify purpose of legacy_robots.txt.js or remove if unnecessary
Filter empty strings from comma-separated lists
Fix typo in PR description
Consider adding unit tests for parsing logic

🤖 Generated with Claude Code

kilemensi · 2026-02-09T14:04:28Z

For context @koechkevin, the above is no my review. It's from @claude .

…emove legacy file

kilemensi

👍🏽 @koechkevin

We shouldn't support "Structured object (Next.js RobotsFile)". I don't think any content editor will even know what that is.
Support all robots.txt fields supported by Google crawler e.g. sitemap is missing. If we can support more fields, all good with me but the core ones must be there.
Validate the robots.txt file when saving in the CMS so that if it has any errors, content editors can correct them e.g. one shown on screenshot below shouldn't be accepted because boom and bam are not valid fields.
Use your tools! claude, opencode, etc. still show the code can be improved by a bit.

kilemensi · 2026-02-10T06:07:25Z

PS: Default robots.txt should always be:

User-agent: *
Disallow: /

Nothing more, nothing less.

…botsTxt field

kilemensi · 2026-02-10T12:26:53Z

apps/trustlab/src/lib/data/local/index.js

  };
 }
+
+export async function getServerSideProps(context) {


Is this being used anywhere?

kilemensi · 2026-02-10T13:18:07Z

apps/trustlab/src/payload/globals/tabs/SeoTab.js

+      admin: {
+        language: "plaintext",
+        rows: 14,
+        description: "Paste the exact robots.txt text to serve.",


Suggested change

description: "Paste the exact robots.txt text to serve.",

description: "Enter the exact robots.txt text to serve.",

Why paste?

kilemensi · 2026-02-10T13:32:10Z

apps/trustlab/src/utils/parseRobotsTxt.js

Again, from the bots:

Looking at your code and the Next.js robots object interface, there's a mismatch. The Next.js robots metadata object expects a specific shape like:

type Robots = { rules: Rule | Rule[] sitemap?: string | string[] host?: string } type Rule = { userAgent?: string | string[] allow?: string | string[] disallow?: string | string[] crawlDelay?: number }

Your parser is fairly close but has some unnecessary complexity for this use case. A few observations:

cacheDelay, visitTime, cleanParam — Next.js doesn't use these, so they're dead weight if your only goal is producing a Next.js-compatible object.

The regex-per-directive approach is overkill when you're already doing keyLower comparison — just compare strings directly.

appendDirectiveValue / appendUserAgent add complexity to handle the "single value vs array" duality, but you could simplify by always collecting into arrays and flattening at the end.

The autoUserAgent / ensureRule pattern is a bit convoluted for handling implicit * groups.

Here's a leaner version focused specifically on producing a Next.js-compatible robots object:Key improvements over your original:

/** * Parses a robots.txt string into a Next.js-compatible robots metadata object. * @see https://nextjs.org/docs/app/api-reference/file-conventions/metadata/robots#robots-object * * @param {string} raw - robots.txt content * @returns {{ rules: Rule | Rule[], sitemap?: string | string[], host?: string }} */ export default function parseRobotsToNextJs(raw = "") { const lines = String(raw).replace(/\r\n?/g, "\n").split("\n"); const groups = []; // { userAgent: string[], allow: string[], disallow: string[], crawlDelay?: number } const sitemaps = []; let host = null; let cur = null; let curHasDirectives = false; const ensureGroup = () => { if (!cur) { cur = { userAgent: ["*"], allow: [], disallow: [] }; groups.push(cur); curHasDirectives = false; } return cur; }; for (const rawLine of lines) { const line = rawLine.replace(/#.*$/, "").trim(); if (!line) continue; const sep = line.indexOf(":"); if (sep === -1) continue; const key = line.slice(0, sep).trim().toLowerCase(); const val = line.slice(sep + 1).trim(); switch (key) { case "sitemap": if (val) sitemaps.push(val); break; case "host": if (val) host = val; break; case "user-agent": if (!cur || curHasDirectives) { cur = { userAgent: [val || "*"], allow: [], disallow: [] }; groups.push(cur); curHasDirectives = false; } else { cur.userAgent.push(val || "*"); } break; case "allow": ensureGroup().allow.push(val); curHasDirectives = true; break; case "disallow": ensureGroup().disallow.push(val); curHasDirectives = true; break; case "crawl-delay": { const n = Number(val); if (Number.isFinite(n)) ensureGroup().crawlDelay = n; curHasDirectives = true; break; } // Ignore unknown directives silently } } // Collapse single-element arrays to plain values for Next.js compat const unwrap = (arr) => (arr.length === 1 ? arr[0] : arr.length ? arr : undefined); const rules = groups.map((g) => { const rule = { userAgent: unwrap(g.userAgent) }; const allow = unwrap(g.allow); const disallow = unwrap(g.disallow); if (allow !== undefined) rule.allow = allow; if (disallow !== undefined) rule.disallow = disallow; if (g.crawlDelay !== undefined) rule.crawlDelay = g.crawlDelay; return rule; }); const result = { rules: rules.length === 1 ? rules[0] : rules }; const sm = unwrap(sitemaps); if (sm !== undefined) result.sitemap = sm; if (host) result.host = host; return result; }

~60% less code by dropping unused directives and regex objects in favor of a simple switch on the lowercased key string.

Always collects into arrays first, then unwraps at the end — no juggling between scalar/array mid-parse.

Strips inline comments (# ...) from lines, which your original didn't handle.

Output matches the Next.js shape exactly — rules is a single Rule when there's one group, otherwise Rule[]; same for sitemap.

No temporary markers like autoUserAgent that need cleanup.

If you also need diagnostics for debugging malformed files, you could add that back as an opt-in, but for a "parse robots.txt → Next.js metadata" utility, this is all you need.

kilemensi · 2026-02-10T13:35:55Z

apps/trustlab/src/lib/data/local/index.js

+  };
+}
+
+export async function getRobotsTxtContent() {


Since we're parsing content to Robots object, shouldn't this be:

Suggested change

export async function getRobotsTxtContent() {

export async function getRobots() {

kilemensi · 2026-02-10T13:37:13Z

apps/trustlab/src/payload/globals/tabs/SeoTab.js

    },
+    {
+      name: "robotsTxt",
+      label: "robots.txt content",


Suggested change

label: "robots.txt content",

label: "robots.txt",

feat: add production robots.txt and conditional copy in Dockerfile

ad313b9

koechkevin self-assigned this Feb 3, 2026

koechkevin requested a review from a team February 3, 2026 12:16

feat: implement dynamic robots.txt handling and remove static files

ff16126

koechkevin changed the title ~~feat: add production robots.txt and conditional copy in Dockerfile~~ feat: add dynamic robots.txt Feb 4, 2026

kilemensi reviewed Feb 4, 2026

View reviewed changes

apps/trustlab/src/payload/globals/tabs/Robots.js Outdated Show resolved Hide resolved

kilemensi reviewed Feb 4, 2026

View reviewed changes

apps/trustlab/src/pages/legacy_robots.txt.js Outdated Show resolved Hide resolved

kilemensi reviewed Feb 4, 2026

View reviewed changes

apps/trustlab/src/payload/globals/index.js Outdated Show resolved Hide resolved

Kevin Koech added 2 commits February 5, 2026 11:08

Update robots.txt to SEO tab instead of own tb

1345c34

feat: implement robots.txt handling with dynamic content processing a…

4690872

…nd new configuration fields

koechkevin requested a review from kilemensi February 6, 2026 11:52

refactor: update import statements for consistency and readability in…

2ca1a4c

… importMap.js

chatgpt-codex-connector bot reviewed Feb 9, 2026

View reviewed changes

apps/trustlab/src/lib/data/common/seo.js Outdated Show resolved Hide resolved

apps/trustlab/src/lib/data/common/seo.js Outdated Show resolved Hide resolved

kilemensi requested changes Feb 9, 2026

View reviewed changes

feat: enhance robots.txt processing with improved rule handling and r…

ed5dfe1

…emove legacy file

koechkevin requested a review from kilemensi February 10, 2026 03:51

kilemensi reviewed Feb 10, 2026

View reviewed changes

feat: implement robots.txt parsing and validation, remove obsolete ro…

cffc389

…botsTxt field

kelvinkipruto approved these changes Feb 10, 2026

View reviewed changes

kilemensi approved these changes Feb 10, 2026

View reviewed changes

	description: "Paste the exact robots.txt text to serve.",
	description: "Enter the exact robots.txt text to serve.",

	export async function getRobotsTxtContent() {
	export async function getRobots() {

feat: add dynamic robots.txt #1395

Are you sure you want to change the base?

feat: add dynamic robots.txt #1395

Uh oh!

Conversation

koechkevin commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Screenshots

Checklist:

Uh oh!

kilemensi commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kilemensi commented Feb 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

kilemensi left a comment

Choose a reason for hiding this comment

PR Review: feat: add dynamic robots.txt

Overview

Critical Issues

1. Console.log in production code 🔴

2. Incorrect fallback logic 🔴

3. Missing null/undefined handling 🟡

Medium Priority Issues

4. Unclear purpose of legacy_robots.txt.js 🟡

5. Potential empty string handling 🟡

6. Missing test coverage 🟡

Minor Issues

7. Typo in PR description

8. Template string unnecessary 🟢

Questions

Positive Aspects ✅

Recommendation

Uh oh!

kilemensi commented Feb 9, 2026

Uh oh!

kilemensi left a comment

Choose a reason for hiding this comment

Uh oh!

kilemensi commented Feb 10, 2026

Uh oh!

kilemensi Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

kilemensi Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

kilemensi Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

kilemensi Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

kilemensi Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

koechkevin commented Feb 3, 2026 •

edited

Loading