-
Notifications
You must be signed in to change notification settings - Fork 2
feat: add dynamic robots.txt #1395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I get where you're going with this @koechkevin but isn't there a way where we can upload/set ... and whether |
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ca1a4c122
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
kilemensi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Review: feat: add dynamic robots.txt
Overview
This PR implements CMS-driven robots.txt management for TrustLab, replacing the static file with dynamic generation that supports both plain text and structured object formats.
Summary:
- ✅ Removes static
robots.txtand adds dynamic generation via App Router - ✅ Adds Payload CMS field for robots.txt configuration with dual input modes
- ✅ Implements robots.txt parsing utilities for text-to-metadata conversion
⚠️ Several issues need addressing before merge
Critical Issues
1. Console.log in production code 🔴
File: apps/trustlab/src/lib/data/common/seo.js:114
console.log(JSON.stringify(sanitizedRules, null, 2));
return { rules: sanitizedRules };Issue: Debug logging should be removed before merging.
2. Incorrect fallback logic 🔴
File: apps/trustlab/src/lib/data/common/seo.js:118-128
export function processRobotsTxtContent(robotsTxt) {
if (robotsTxt?.format === "object") {
return (
{
rules: robotsTxt.objectContent.ruleSet?.map((entry) =>
formatRuleSet(entry.rule),
),
sitemap: robotsTxt.objectContent.sitemap ?? null,
host: robotsTxt.objectContent.host ?? null,
} || [] // ❌ This || [] is wrong
);
}
return parseRobotsToMetadata(robotsTxt?.textContent || "");
}Issue: The || [] at line 127 will never be reached because an object literal is always truthy. If the intent was to provide a fallback, this should be handled differently:
const result = {
rules: robotsTxt.objectContent.ruleSet?.map((entry) =>
formatRuleSet(entry.rule),
) ?? [],
sitemap: robotsTxt.objectContent.sitemap ?? null,
host: robotsTxt.objectContent.host ?? null,
};
return result;3. Missing null/undefined handling 🟡
File: apps/trustlab/src/lib/data/common/seo.js:63-76
const formatRuleSet = (ruleSet) => {
return {
userAgent: (ruleSet?.userAgent || "*")
.split(",")
.map((agent) => `${agent}`.trim()),
allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim()), // ❌ ruleSet.allow not ruleSet?.allow
disallow: (ruleSet.disallow || "") // ❌ Same here
.split(",")
.map((path) => `${path}`.trim()),
crawlDelay: ruleSet.crawlDelay ?? null, // ❌ ruleSet?.crawlDelay
};
};Issue: Inconsistent use of optional chaining. Lines 68 and 69 use ruleSet.allow and ruleSet.disallow but line 64 uses ruleSet?.userAgent.
Medium Priority Issues
4. Unclear purpose of legacy_robots.txt.js 🟡
File: apps/trustlab/src/pages/legacy_robots.txt.js
export const getServerSideProps = async (context) => {
return getPageServerSideProps({
...context,
params: { slugs: ["robots.txt"] },
});
};
export default () => null;Question: This seems to route /legacy_robots.txt through the normal page system with slug ["robots.txt"]. Is this intentional? Typically, robots.txt should be served at /robots.txt, which the App Router file (src/app/robots.js) already handles. What's the purpose of this legacy file?
5. Potential empty string handling 🟡
When splitting comma-separated values, empty strings might create unwanted entries:
// If allow = "", this creates [""]
allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim())Suggestion: Filter out empty values:
allow: (ruleSet.allow || "").split(",").map((path) => `${path}`.trim()).filter(Boolean)6. Missing test coverage 🟡
Complex parsing logic without tests:
parseRobotsToMetadata()- Multi-line parsing with state machineformatRuleSet()- Data transformationprocessRobotsTxtContent()- Format branching logic
Recommendation: Add unit tests for these utility functions.
Minor Issues
7. Typo in PR description
"varioust environments" → "various environments"
8. Template string unnecessary 🟢
File: apps/trustlab/src/lib/data/common/seo.js:66
userAgent: (ruleSet?.userAgent || "*")
.split(",")
.map((agent) => `${agent}`.trim()), // ${agent} is already a stringThe template literal wrapper ${agent} is redundant since agent is already a string from split. Can be simplified to .map((agent) => agent.trim())
Questions
-
Default behavior: With format defaulting to "text" and default content "User-agent: *\nDisallow: /", all crawling is blocked by default. Is this intentional for all environments?
-
Sitemap handling: The structured format supports sitemap URLs, but how should comma-separated sitemaps be rendered in the final robots.txt output?
-
Host directive: The Host directive is non-standard and only supported by Yandex. Is this intentional?
Positive Aspects ✅
- Good separation of concerns - Parser logic separated from CMS field definitions
- Flexible input formats - Supports both text and structured input
- Comprehensive regex patterns - Handles various robots.txt directives (including cache-delay, visit-time)
- User-friendly CMS interface - Clear field descriptions and helpful placeholders
- App Router migration - Uses modern Next.js patterns
Recommendation
Request Changes - The PR needs the following before merge:
- Remove console.log (critical)
- Fix the
|| []logic error (critical) - Add consistent optional chaining in
formatRuleSet()(important) - Clarify purpose of
legacy_robots.txt.jsor remove if unnecessary - Filter empty strings from comma-separated lists
- Fix typo in PR description
- Consider adding unit tests for parsing logic
🤖 Generated with Claude Code
|
For context @koechkevin, the above is no my review. It's from @claude . |
…emove legacy file
kilemensi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽 @koechkevin
- We shouldn't support "Structured object (Next.js RobotsFile)". I don't think any content editor will even know what that is.
- Support all robots.txt fields supported by Google crawler e.g.
sitemapis missing. If we can support more fields, all good with me but the core ones must be there. - Validate the robots.txt file when saving in the CMS so that if it has any errors, content editors can correct them e.g. one shown on screenshot below shouldn't be accepted because
boomandbamare not valid fields. - Use your tools! claude, opencode, etc. still show the code can be improved by a bit.
|
PS: Default robots.txt should always be: Nothing more, nothing less. |
| }; | ||
| } | ||
|
|
||
| export async function getServerSideProps(context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being used anywhere?
| admin: { | ||
| language: "plaintext", | ||
| rows: 14, | ||
| description: "Paste the exact robots.txt text to serve.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| description: "Paste the exact robots.txt text to serve.", | |
| description: "Enter the exact robots.txt text to serve.", |
Why paste?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, from the bots:
Looking at your code and the Next.js robots object interface, there's a mismatch. The Next.js robots metadata object expects a specific shape like:
type Robots = {
rules: Rule | Rule[]
sitemap?: string | string[]
host?: string
}
type Rule = {
userAgent?: string | string[]
allow?: string | string[]
disallow?: string | string[]
crawlDelay?: number
}Your parser is fairly close but has some unnecessary complexity for this use case. A few observations:
cacheDelay,visitTime,cleanParam— Next.js doesn't use these, so they're dead weight if your only goal is producing a Next.js-compatible object.- The regex-per-directive approach is overkill when you're already doing
keyLowercomparison — just compare strings directly. appendDirectiveValue/appendUserAgentadd complexity to handle the "single value vs array" duality, but you could simplify by always collecting into arrays and flattening at the end.- The
autoUserAgent/ensureRulepattern is a bit convoluted for handling implicit*groups.
Here's a leaner version focused specifically on producing a Next.js-compatible robots object:Key improvements over your original:
/**
* Parses a robots.txt string into a Next.js-compatible robots metadata object.
* @see https://nextjs.org/docs/app/api-reference/file-conventions/metadata/robots#robots-object
*
* @param {string} raw - robots.txt content
* @returns {{ rules: Rule | Rule[], sitemap?: string | string[], host?: string }}
*/
export default function parseRobotsToNextJs(raw = "") {
const lines = String(raw).replace(/\r\n?/g, "\n").split("\n");
const groups = []; // { userAgent: string[], allow: string[], disallow: string[], crawlDelay?: number }
const sitemaps = [];
let host = null;
let cur = null;
let curHasDirectives = false;
const ensureGroup = () => {
if (!cur) {
cur = { userAgent: ["*"], allow: [], disallow: [] };
groups.push(cur);
curHasDirectives = false;
}
return cur;
};
for (const rawLine of lines) {
const line = rawLine.replace(/#.*$/, "").trim();
if (!line) continue;
const sep = line.indexOf(":");
if (sep === -1) continue;
const key = line.slice(0, sep).trim().toLowerCase();
const val = line.slice(sep + 1).trim();
switch (key) {
case "sitemap":
if (val) sitemaps.push(val);
break;
case "host":
if (val) host = val;
break;
case "user-agent":
if (!cur || curHasDirectives) {
cur = { userAgent: [val || "*"], allow: [], disallow: [] };
groups.push(cur);
curHasDirectives = false;
} else {
cur.userAgent.push(val || "*");
}
break;
case "allow":
ensureGroup().allow.push(val);
curHasDirectives = true;
break;
case "disallow":
ensureGroup().disallow.push(val);
curHasDirectives = true;
break;
case "crawl-delay": {
const n = Number(val);
if (Number.isFinite(n)) ensureGroup().crawlDelay = n;
curHasDirectives = true;
break;
}
// Ignore unknown directives silently
}
}
// Collapse single-element arrays to plain values for Next.js compat
const unwrap = (arr) => (arr.length === 1 ? arr[0] : arr.length ? arr : undefined);
const rules = groups.map((g) => {
const rule = { userAgent: unwrap(g.userAgent) };
const allow = unwrap(g.allow);
const disallow = unwrap(g.disallow);
if (allow !== undefined) rule.allow = allow;
if (disallow !== undefined) rule.disallow = disallow;
if (g.crawlDelay !== undefined) rule.crawlDelay = g.crawlDelay;
return rule;
});
const result = { rules: rules.length === 1 ? rules[0] : rules };
const sm = unwrap(sitemaps);
if (sm !== undefined) result.sitemap = sm;
if (host) result.host = host;
return result;
}- ~60% less code by dropping unused directives and regex objects in favor of a simple
switchon the lowercased key string. - Always collects into arrays first, then
unwraps at the end — no juggling between scalar/array mid-parse. - Strips inline comments (
# ...) from lines, which your original didn't handle. - Output matches the Next.js shape exactly —
rulesis a singleRulewhen there's one group, otherwiseRule[]; same forsitemap. - No temporary markers like
autoUserAgentthat need cleanup.
If you also need diagnostics for debugging malformed files, you could add that back as an opt-in, but for a "parse robots.txt → Next.js metadata" utility, this is all you need.
| }; | ||
| } | ||
|
|
||
| export async function getRobotsTxtContent() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're parsing content to Robots object, shouldn't this be:
| export async function getRobotsTxtContent() { | |
| export async function getRobots() { |
| }, | ||
| { | ||
| name: "robotsTxt", | ||
| label: "robots.txt content", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| label: "robots.txt content", | |
| label: "robots.txt", |
Description
This PR adds environment-specific robots.txt configuration for TrustLab, allowing different crawler policies for varioust environments
Fixes #1394
Type of change
Please delete options that are not relevant.
Screenshots
Checklist: