Skip to content

Conversation

jcobis
Copy link
Collaborator

@jcobis jcobis commented Sep 11, 2025

Description

Implement the mongosh script generation engine that creates an executable script based on the faker schema. The script will generate and upload mock data.

Checklist

  • New tests and/or benchmarks are included
  • Documentation is changed or added
  • If this change updates the UI, screenshots/videos are added and a design review is requested
  • I have signed the MongoDB Contributor License Agreement (https://www.mongodb.com/legal/contributor-agreement)

Types of changes

  • Backport Needed
  • Patch (non-breaking change which fixes an issue)
  • Minor (non-breaking change which adds functionality)
  • Major (fix or feature that would cause existing functionality to change)

@Copilot Copilot AI review requested due to automatic review settings September 11, 2025 14:35
@jcobis jcobis requested a review from a team as a code owner September 11, 2025 14:35
@jcobis jcobis requested a review from johnjackweir September 11, 2025 14:35
@github-actions github-actions bot added the feat label Sep 11, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a mongosh script generation engine that creates executable scripts for generating and uploading mock data based on faker schemas. The engine supports complex nested structures, arrays, probabilistic fields, and configurable array lengths.

  • Core script generation functionality with support for nested objects, arrays, and multi-dimensional arrays
  • Faker.js integration with argument handling and fallback methods for unrecognized field types
  • Probabilistic field generation for optional data modeling

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts Implements the complete script generation engine with type definitions, path parsing, document structure building, and code generation
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.spec.ts Comprehensive test suite covering all functionality including edge cases, array handling, probability features, and faker argument processing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jcobis jcobis added the no release notes Fix or feature not for release notes label Sep 11, 2025
@jcobis jcobis requested review from kpamaran and ncarbon September 11, 2025 16:02
@mongodb-js mongodb-js deleted a comment from Copilot AI Sep 11, 2025
@mongodb-js mongodb-js deleted a comment from Copilot AI Sep 11, 2025
const { faker } = require('@faker-js/faker');

// Connect to database
use('${escapedDbName}');
Copy link
Collaborator

@kpamaran kpamaran Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if in addition the database name should supplied within the connection string

const RUN_SCRIPT_COMMAND = `
mongosh "mongodb+srv://<your-cluster>.mongodb.net/<your-database>" \\
  --username <your-username> \\
  --password "<your-password>" \\
  mockdatascript.js
`;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, if the user is providing the database name in the connection string, there is no need to call use('${escapedDbName}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do include a use() call regardless of specifying the database in the connection string 🙂 It's quite easy (and intended to be easy) for shell users to end up using a connection string that was copied from some external source instead of the default one

const { faker } = require('@faker-js/faker');

// Connect to database
use('${escapedDbName}');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to CodeQL's comments above, if you want to safely insert a string into JS code, always just use JSON.stringify(), i.e.

Suggested change
use('${escapedDbName}');
use(${JSON.stringify(escapedDbName)});

(for objects, you can use JSON.stringify() as well if you can ensure that none of the keys can be '__proto__', otherwise you need something like ... (JSON.parse(${JSON.stringify(JSON.stringify(obj))})) ... – doesn't seem to matter here, though)

}

/**
* Converts faker arguments to JavaScript code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this docstring more descriptive? It's not clear to be how or why the faker arguments are converted to JS code when the return type is a string.

/**
* Gets default faker method for unrecognized fields based on MongoDB type
*/
export function getDefaultFakerMethod(mongoType: string): string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for doing this in this PR. I assume this will be the default type -> faker method mapping we'll be using?

Wouldn't hurt to include a link to the faker documentation in the docstring for reference.

// Generated for collection: ${escapedDbName}.${escapedCollectionName}
// Document count: ${options.documentCount}

const { faker } = require('@faker-js/faker');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably got an error about importing this CommonJS module as ESM, but you can add a TS ignore annotation and import it the ESM way:
import { faker } from '@faker-js/faker/locale/en';

See this comment in #7295 for more details.

Also, notice we're only accessing the en locale, as suggested here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is in the script, so not our codebase. So I don't think we're really importing this in this file actually, unless I'm misunderstanding you

}

// Insert documents into collection
db.getCollection('${escapedCollectionName}').insertMany(documents);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me if we have a limit on documentCount? I assume this will be validated in the user input, but just curious. The insertMany will batch the inserts if they're over 100k, but we should still ensure we place a limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, a benefit of doing our own batching is showing a progress message to the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we could validate that on the document output specification screen

Copy link
Collaborator

@kpamaran kpamaran Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite comprehensive so far, I've reviewed all the tests 👍🏼

Tests I don't see:

  • database / collection name escaping (handling single quotes, backticks)

It'd be excellent to assert if every script fakerjs object within the script is a runnable expression, eval(...) is a cheap way to do so and should be fine in tests

}
});

it('should use default faker method for unrecognized number fields', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • could use a for-loop to assert the same for int32 and int64
  • likewise for the decimal128 on the other test as well

expect(result.success).to.equal(true);
if (result.success) {
const expectedReturnBlock = `return {
matrix: Array.from({length: 3}, () => Array.from({length: 3}, () => faker.number.int()))
Copy link
Collaborator

@kpamaran kpamaran Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ✨

if (probability < 1.0) {
// Use Math.random for conditional field inclusion
rootLevelFields.push(
`${fieldIndent}...(Math.random() < ${probability} ? { ${fieldName}: ${fakerCall} } : {})`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why include the empty object? vs simply leaving rootLevelFields as is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this isn't the most readable but it's conditional rendering using the spread operator ..., so if the probability condition is not met, nothing gets rendered (i.e no empty object). I can add a comment, or potentially refactor this to make it clearer. The advantage of this notation is that it's one line, and there may be many of these

/**
* Gets default faker method for unrecognized fields based on MongoDB type
*/
export function getDefaultFakerMethod(mongoType: string): string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to cover any more of the allowed datatypes?

some of these would be easy like null, Long, timestamp

if ('mongoType' in elementType) {
// Array of primitives
const fakerCall = generateFakerCall(elementType as FieldMapping);
return `Array.from({length: ${arrayLength}}, () => ${fakerCall})`;

Check warning

Code scanning / CodeQL

Improper code sanitization Medium

Code construction depends on an
improperly sanitized value
.

Copilot Autofix

AI about 21 hours ago

To fix the vulnerability, we need to additionally sanitize string arguments beyond simply calling JSON.stringify. The recommended approach is to escape potentially dangerous characters in the string output before including them in the generated code. Specifically, after stringifying the argument, we should replace any characters that can break out of string or script contexts—such as <, >, /, \, as well as special whitespace and control characters—with their safe, escaped equivalents, as shown in the example code above using a character map and a replacement function. The best fix is to introduce a helper function (escapeUnsafeChars) in script-generation-utils.ts and use it whenever we stringify string arguments in formatFakerArgs, by calling escapeUnsafeChars(JSON.stringify(arg)) instead of just JSON.stringify(arg).

We should insert the helper function (e.g., after the other utility functions), and update the relevant line(s) in formatFakerArgs to apply the escaping. No imports are needed as the escaping logic is internal.


Suggested changeset 1
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts b/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
--- a/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
+++ b/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
@@ -505,7 +505,7 @@
     const arg = fakerArgs[i];
 
     if (typeof arg === 'string') {
-      stringifiedArgs.push(JSON.stringify(arg));
+      stringifiedArgs.push(escapeUnsafeChars(JSON.stringify(arg)));
     } else if (typeof arg === 'number') {
       if (!Number.isFinite(arg)) {
         throw new Error(
@@ -528,3 +528,25 @@
 
   return stringifiedArgs.join(', ');
 }
+
+/**
+ * Escape potentially dangerous characters when generating code
+ */
+const charMap: Record<string, string> = {
+  '<': '\\u003C',
+  '>': '\\u003E',
+  '/': '\\u002F',
+  '\\': '\\\\',
+  '\b': '\\b',
+  '\f': '\\f',
+  '\n': '\\n',
+  '\r': '\\r',
+  '\t': '\\t',
+  '\0': '\\0',
+  '\u2028': '\\u2028',
+  '\u2029': '\\u2029'
+};
+
+function escapeUnsafeChars(str: string): string {
+  return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => charMap[x] || x);
+}
EOF
@@ -505,7 +505,7 @@
const arg = fakerArgs[i];

if (typeof arg === 'string') {
stringifiedArgs.push(JSON.stringify(arg));
stringifiedArgs.push(escapeUnsafeChars(JSON.stringify(arg)));
} else if (typeof arg === 'number') {
if (!Number.isFinite(arg)) {
throw new Error(
@@ -528,3 +528,25 @@

return stringifiedArgs.join(', ');
}

/**
* Escape potentially dangerous characters when generating code
*/
const charMap: Record<string, string> = {
'<': '\\u003C',
'>': '\\u003E',
'/': '\\u002F',
'\\': '\\\\',
'\b': '\\b',
'\f': '\\f',
'\n': '\\n',
'\r': '\\r',
'\t': '\\t',
'\0': '\\0',
'\u2028': '\\u2028',
'\u2029': '\\u2029'
};

function escapeUnsafeChars(str: string): string {
return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => charMap[x] || x);
}
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
arrayLengthMap,
dimensionIndex + 1 // Next dimension
);
return `Array.from({length: ${arrayLength}}, () => ${nestedArrayCode})`;

Check warning

Code scanning / CodeQL

Improper code sanitization Medium

Code construction depends on an
improperly sanitized value
.

Copilot Autofix

AI about 21 hours ago

The best way to mitigate the risk is to apply additional character escaping to the output of JSON.stringify for all string arguments before the generated string is inserted into JavaScript code. This involves writing a utility function to escape problematic JavaScript characters such as <, >, /, \u2028, and \u2029 (and others specified in the background). This function should be used in formatFakerArgs on any argument of type string (i.e., after calling JSON.stringify) and also on any code that is the result of user-provided or dynamic input in paths leading to the code generation sink. The update can be made entirely within script-generation-utils.ts: first, by adding the escaping utility, and second, by modifying formatFakerArgs to escape unsafe characters in string arguments prior to code emission.


Suggested changeset 1
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts b/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
--- a/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
+++ b/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
@@ -498,6 +498,26 @@
  * formatFakerArgs(['male', 25, true]) // Returns: '"male", 25, true'
  * formatFakerArgs([{json: '{"min": 1}'}]) // Returns: '{"min": 1}'
  */
+// Escapes unsafe characters in a JS string to prevent code injection when constructing JS code.
+const unsafeJsCharMap: Record<string, string> = {
+  '<': '\\u003C',
+  '>': '\\u003E',
+  '/': '\\u002F',
+  '\\': '\\\\',
+  '\b': '\\b',
+  '\f': '\\f',
+  '\n': '\\n',
+  '\r': '\\r',
+  '\t': '\\t',
+  '\0': '\\0',
+  '\u2028': '\\u2028',
+  '\u2029': '\\u2029'
+};
+
+function escapeUnsafeJsChars(str: string): string {
+  return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => unsafeJsCharMap[x] || x);
+}
+
 export function formatFakerArgs(fakerArgs: FakerArg[]): string {
   const stringifiedArgs: string[] = [];
 
@@ -505,7 +525,8 @@
     const arg = fakerArgs[i];
 
     if (typeof arg === 'string') {
-      stringifiedArgs.push(JSON.stringify(arg));
+      // Stringify AND then escape unsafe JS chars (see security recommendation).
+      stringifiedArgs.push(escapeUnsafeJsChars(JSON.stringify(arg)));
     } else if (typeof arg === 'number') {
       if (!Number.isFinite(arg)) {
         throw new Error(
EOF
@@ -498,6 +498,26 @@
* formatFakerArgs(['male', 25, true]) // Returns: '"male", 25, true'
* formatFakerArgs([{json: '{"min": 1}'}]) // Returns: '{"min": 1}'
*/
// Escapes unsafe characters in a JS string to prevent code injection when constructing JS code.
const unsafeJsCharMap: Record<string, string> = {
'<': '\\u003C',
'>': '\\u003E',
'/': '\\u002F',
'\\': '\\\\',
'\b': '\\b',
'\f': '\\f',
'\n': '\\n',
'\r': '\\r',
'\t': '\\t',
'\0': '\\0',
'\u2028': '\\u2028',
'\u2029': '\\u2029'
};

function escapeUnsafeJsChars(str: string): string {
return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => unsafeJsCharMap[x] || x);
}

export function formatFakerArgs(fakerArgs: FakerArg[]): string {
const stringifiedArgs: string[] = [];

@@ -505,7 +525,8 @@
const arg = fakerArgs[i];

if (typeof arg === 'string') {
stringifiedArgs.push(JSON.stringify(arg));
// Stringify AND then escape unsafe JS chars (see security recommendation).
stringifiedArgs.push(escapeUnsafeJsChars(JSON.stringify(arg)));
} else if (typeof arg === 'number') {
if (!Number.isFinite(arg)) {
throw new Error(
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
indent,
nestedArrayLengthMap
);
return `Array.from({length: ${arrayLength}}, () => (${objectCode}))`;

Check warning

Code scanning / CodeQL

Improper code sanitization Medium

Code construction depends on an
improperly sanitized value
.

Copilot Autofix

AI about 21 hours ago

General fix: Any string that is incorporated into generated JavaScript as code (such as as object property names or string arguments) must be escaped to avoid breaking out of literal or code context, even after JSON.stringify. This can be done with a function that replaces dangerous characters (<, >, /, etc.) with their Unicode escape equivalents.

Best detailed fix:

  • Implement a JavaScript/TypeScript utility function (e.g., escapeUnsafeJsChars) that escapes all relevant characters as per the REMEDIATION pattern.
  • Ensure that everywhere JSON.stringify(arg) is used on a string in user input, the result is passed through this function before being embedded in generated code.
  • Specifically, in formatFakerArgs, call the escape function on the JSON-stringified value of string arguments.
  • Add tests/validation for the escaping to guarantee codegen output remains valid and safe.
  • These changes should only be applied within this file and function—on the use of stringifiedArgs.push(JSON.stringify(arg));—to minimize any breakage elsewhere.

What is needed:

  • A utility escape function (copied from the example, possibly tweaked for TS).
  • The call site(s) in formatFakerArgs (line 508) should wrap the JSON.stringify(arg) result in the escape function.
  • Add relevant imports/definitions in the file.
Suggested changeset 1
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts b/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
--- a/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
+++ b/packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
@@ -1,5 +1,25 @@
 export type FakerArg = string | number | boolean | { json: string };
 
+// Escape dangerous JS characters in generated code
+const charMap: { [key: string]: string } = {
+  '<': '\\u003C',
+  '>': '\\u003E',
+  '/': '\\u002F',
+  '\\': '\\\\',
+  '\b': '\\b',
+  '\f': '\\f',
+  '\n': '\\n',
+  '\r': '\\r',
+  '\t': '\\t',
+  '\0': '\\0',
+  '\u2028': '\\u2028',
+  '\u2029': '\\u2029'
+};
+
+function escapeUnsafeJsChars(str: string): string {
+  return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => charMap[x] || x);
+}
+
 const DEFAULT_ARRAY_LENGTH = 3;
 const INDENT_SIZE = 2;
 
@@ -505,7 +525,7 @@
     const arg = fakerArgs[i];
 
     if (typeof arg === 'string') {
-      stringifiedArgs.push(JSON.stringify(arg));
+      stringifiedArgs.push(escapeUnsafeJsChars(JSON.stringify(arg)));
     } else if (typeof arg === 'number') {
       if (!Number.isFinite(arg)) {
         throw new Error(
EOF
@@ -1,5 +1,25 @@
export type FakerArg = string | number | boolean | { json: string };

// Escape dangerous JS characters in generated code
const charMap: { [key: string]: string } = {
'<': '\\u003C',
'>': '\\u003E',
'/': '\\u002F',
'\\': '\\\\',
'\b': '\\b',
'\f': '\\f',
'\n': '\\n',
'\r': '\\r',
'\t': '\\t',
'\0': '\\0',
'\u2028': '\\u2028',
'\u2029': '\\u2029'
};

function escapeUnsafeJsChars(str: string): string {
return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => charMap[x] || x);
}

const DEFAULT_ARRAY_LENGTH = 3;
const INDENT_SIZE = 2;

@@ -505,7 +525,7 @@
const arg = fakerArgs[i];

if (typeof arg === 'string') {
stringifiedArgs.push(JSON.stringify(arg));
stringifiedArgs.push(escapeUnsafeJsChars(JSON.stringify(arg)));
} else if (typeof arg === 'number') {
if (!Number.isFinite(arg)) {
throw new Error(
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat no release notes Fix or feature not for release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants