-
Notifications
You must be signed in to change notification settings - Fork 229
feat(compass-collection): Mongosh Script Generation Engine - CLOUDP-333860 #7313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a mongosh script generation engine that creates executable scripts for generating and uploading mock data based on faker schemas. The engine supports complex nested structures, arrays, probabilistic fields, and configurable array lengths.
- Core script generation functionality with support for nested objects, arrays, and multi-dimensional arrays
- Faker.js integration with argument handling and fallback methods for unrecognized field types
- Probabilistic field generation for optional data modeling
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts | Implements the complete script generation engine with type definitions, path parsing, document structure building, and code generation |
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.spec.ts | Comprehensive test suite covering all functionality including edge cases, array handling, probability features, and faker argument processing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Fixed
Show fixed
Hide fixed
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Fixed
Show fixed
Hide fixed
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Fixed
Show fixed
Hide fixed
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Fixed
Show fixed
Hide fixed
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Fixed
Show fixed
Hide fixed
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Show resolved
Hide resolved
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Outdated
Show resolved
Hide resolved
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Outdated
Show resolved
Hide resolved
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Show resolved
Hide resolved
const { faker } = require('@faker-js/faker'); | ||
|
||
// Connect to database | ||
use('${escapedDbName}'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if in addition the database name should supplied within the connection string
const RUN_SCRIPT_COMMAND = `
mongosh "mongodb+srv://<your-cluster>.mongodb.net/<your-database>" \\
--username <your-username> \\
--password "<your-password>" \\
mockdatascript.js
`;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, if the user is providing the database name in the connection string, there is no need to call use('${escapedDbName}')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do include a use()
call regardless of specifying the database in the connection string 🙂 It's quite easy (and intended to be easy) for shell users to end up using a connection string that was copied from some external source instead of the default one
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Outdated
Show resolved
Hide resolved
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Outdated
Show resolved
Hide resolved
const { faker } = require('@faker-js/faker'); | ||
|
||
// Connect to database | ||
use('${escapedDbName}'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to CodeQL's comments above, if you want to safely insert a string into JS code, always just use JSON.stringify()
, i.e.
use('${escapedDbName}'); | |
use(${JSON.stringify(escapedDbName)}); |
(for objects, you can use JSON.stringify()
as well if you can ensure that none of the keys can be '__proto__'
, otherwise you need something like ... (JSON.parse(${JSON.stringify(JSON.stringify(obj))})) ...
– doesn't seem to matter here, though)
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Outdated
Show resolved
Hide resolved
} | ||
|
||
/** | ||
* Converts faker arguments to JavaScript code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this docstring more descriptive? It's not clear to be how or why the faker arguments are converted to JS code when the return type is a string.
/** | ||
* Gets default faker method for unrecognized fields based on MongoDB type | ||
*/ | ||
export function getDefaultFakerMethod(mongoType: string): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for doing this in this PR. I assume this will be the default type -> faker method mapping we'll be using?
Wouldn't hurt to include a link to the faker documentation in the docstring for reference.
// Generated for collection: ${escapedDbName}.${escapedCollectionName} | ||
// Document count: ${options.documentCount} | ||
|
||
const { faker } = require('@faker-js/faker'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably got an error about importing this CommonJS module as ESM, but you can add a TS ignore annotation and import it the ESM way:
import { faker } from '@faker-js/faker/locale/en';
See this comment in #7295 for more details.
Also, notice we're only accessing the en
locale, as suggested here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is in the script, so not our codebase. So I don't think we're really importing this in this file actually, unless I'm misunderstanding you
} | ||
|
||
// Insert documents into collection | ||
db.getCollection('${escapedCollectionName}').insertMany(documents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me if we have a limit on documentCount
? I assume this will be validated in the user input, but just curious. The insertMany
will batch the inserts if they're over 100k, but we should still ensure we place a limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, a benefit of doing our own batching is showing a progress message to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we could validate that on the document output specification screen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite comprehensive so far, I've reviewed all the tests 👍🏼
Tests I don't see:
- database / collection name escaping (handling single quotes, backticks)
It'd be excellent to assert if every script fakerjs object within the script is a runnable expression, eval(...) is a cheap way to do so and should be fine in tests
} | ||
}); | ||
|
||
it('should use default faker method for unrecognized number fields', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- could use a for-loop to assert the same for int32 and int64
- likewise for the decimal128 on the other test as well
expect(result.success).to.equal(true); | ||
if (result.success) { | ||
const expectedReturnBlock = `return { | ||
matrix: Array.from({length: 3}, () => Array.from({length: 3}, () => faker.number.int())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice ✨
if (probability < 1.0) { | ||
// Use Math.random for conditional field inclusion | ||
rootLevelFields.push( | ||
`${fieldIndent}...(Math.random() < ${probability} ? { ${fieldName}: ${fakerCall} } : {})` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why include the empty object? vs simply leaving rootLevelFields as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this isn't the most readable but it's conditional rendering using the spread operator ...
, so if the probability condition is not met, nothing gets rendered (i.e no empty object). I can add a comment, or potentially refactor this to make it clearer. The advantage of this notation is that it's one line, and there may be many of these
/** | ||
* Gets default faker method for unrecognized fields based on MongoDB type | ||
*/ | ||
export function getDefaultFakerMethod(mongoType: string): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to cover any more of the allowed datatypes?
some of these would be easy like null, Long, timestamp
if ('mongoType' in elementType) { | ||
// Array of primitives | ||
const fakerCall = generateFakerCall(elementType as FieldMapping); | ||
return `Array.from({length: ${arrayLength}}, () => ${fakerCall})`; |
Check warning
Code scanning / CodeQL
Improper code sanitization Medium
improperly sanitized value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 21 hours ago
To fix the vulnerability, we need to additionally sanitize string arguments beyond simply calling JSON.stringify
. The recommended approach is to escape potentially dangerous characters in the string output before including them in the generated code. Specifically, after stringifying the argument, we should replace any characters that can break out of string or script contexts—such as <
, >
, /
, \
, as well as special whitespace and control characters—with their safe, escaped equivalents, as shown in the example code above using a character map and a replacement function. The best fix is to introduce a helper function (escapeUnsafeChars
) in script-generation-utils.ts
and use it whenever we stringify string arguments in formatFakerArgs
, by calling escapeUnsafeChars(JSON.stringify(arg))
instead of just JSON.stringify(arg)
.
We should insert the helper function (e.g., after the other utility functions), and update the relevant line(s) in formatFakerArgs
to apply the escaping. No imports are needed as the escaping logic is internal.
-
Copy modified line R508 -
Copy modified lines R531-R552
@@ -505,7 +505,7 @@ | ||
const arg = fakerArgs[i]; | ||
|
||
if (typeof arg === 'string') { | ||
stringifiedArgs.push(JSON.stringify(arg)); | ||
stringifiedArgs.push(escapeUnsafeChars(JSON.stringify(arg))); | ||
} else if (typeof arg === 'number') { | ||
if (!Number.isFinite(arg)) { | ||
throw new Error( | ||
@@ -528,3 +528,25 @@ | ||
|
||
return stringifiedArgs.join(', '); | ||
} | ||
|
||
/** | ||
* Escape potentially dangerous characters when generating code | ||
*/ | ||
const charMap: Record<string, string> = { | ||
'<': '\\u003C', | ||
'>': '\\u003E', | ||
'/': '\\u002F', | ||
'\\': '\\\\', | ||
'\b': '\\b', | ||
'\f': '\\f', | ||
'\n': '\\n', | ||
'\r': '\\r', | ||
'\t': '\\t', | ||
'\0': '\\0', | ||
'\u2028': '\\u2028', | ||
'\u2029': '\\u2029' | ||
}; | ||
|
||
function escapeUnsafeChars(str: string): string { | ||
return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => charMap[x] || x); | ||
} |
arrayLengthMap, | ||
dimensionIndex + 1 // Next dimension | ||
); | ||
return `Array.from({length: ${arrayLength}}, () => ${nestedArrayCode})`; |
Check warning
Code scanning / CodeQL
Improper code sanitization Medium
improperly sanitized value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 21 hours ago
The best way to mitigate the risk is to apply additional character escaping to the output of JSON.stringify
for all string arguments before the generated string is inserted into JavaScript code. This involves writing a utility function to escape problematic JavaScript characters such as <
, >
, /
, \u2028
, and \u2029
(and others specified in the background). This function should be used in formatFakerArgs
on any argument of type string
(i.e., after calling JSON.stringify
) and also on any code that is the result of user-provided or dynamic input in paths leading to the code generation sink. The update can be made entirely within script-generation-utils.ts
: first, by adding the escaping utility, and second, by modifying formatFakerArgs
to escape unsafe characters in string arguments prior to code emission.
-
Copy modified lines R501-R520 -
Copy modified lines R528-R529
@@ -498,6 +498,26 @@ | ||
* formatFakerArgs(['male', 25, true]) // Returns: '"male", 25, true' | ||
* formatFakerArgs([{json: '{"min": 1}'}]) // Returns: '{"min": 1}' | ||
*/ | ||
// Escapes unsafe characters in a JS string to prevent code injection when constructing JS code. | ||
const unsafeJsCharMap: Record<string, string> = { | ||
'<': '\\u003C', | ||
'>': '\\u003E', | ||
'/': '\\u002F', | ||
'\\': '\\\\', | ||
'\b': '\\b', | ||
'\f': '\\f', | ||
'\n': '\\n', | ||
'\r': '\\r', | ||
'\t': '\\t', | ||
'\0': '\\0', | ||
'\u2028': '\\u2028', | ||
'\u2029': '\\u2029' | ||
}; | ||
|
||
function escapeUnsafeJsChars(str: string): string { | ||
return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => unsafeJsCharMap[x] || x); | ||
} | ||
|
||
export function formatFakerArgs(fakerArgs: FakerArg[]): string { | ||
const stringifiedArgs: string[] = []; | ||
|
||
@@ -505,7 +525,8 @@ | ||
const arg = fakerArgs[i]; | ||
|
||
if (typeof arg === 'string') { | ||
stringifiedArgs.push(JSON.stringify(arg)); | ||
// Stringify AND then escape unsafe JS chars (see security recommendation). | ||
stringifiedArgs.push(escapeUnsafeJsChars(JSON.stringify(arg))); | ||
} else if (typeof arg === 'number') { | ||
if (!Number.isFinite(arg)) { | ||
throw new Error( |
packages/compass-collection/src/components/mock-data-generator-modal/script-generation-utils.ts
Fixed
Show fixed
Hide fixed
indent, | ||
nestedArrayLengthMap | ||
); | ||
return `Array.from({length: ${arrayLength}}, () => (${objectCode}))`; |
Check warning
Code scanning / CodeQL
Improper code sanitization Medium
improperly sanitized value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 21 hours ago
General fix: Any string that is incorporated into generated JavaScript as code (such as as object property names or string arguments) must be escaped to avoid breaking out of literal or code context, even after JSON.stringify
. This can be done with a function that replaces dangerous characters (<
, >
, /
, etc.) with their Unicode escape equivalents.
Best detailed fix:
- Implement a JavaScript/TypeScript utility function (e.g.,
escapeUnsafeJsChars
) that escapes all relevant characters as per the REMEDIATION pattern. - Ensure that everywhere
JSON.stringify(arg)
is used on a string in user input, the result is passed through this function before being embedded in generated code. - Specifically, in
formatFakerArgs
, call the escape function on the JSON-stringified value of string arguments. - Add tests/validation for the escaping to guarantee codegen output remains valid and safe.
- These changes should only be applied within this file and function—on the use of
stringifiedArgs.push(JSON.stringify(arg));
—to minimize any breakage elsewhere.
What is needed:
- A utility escape function (copied from the example, possibly tweaked for TS).
- The call site(s) in
formatFakerArgs
(line 508) should wrap theJSON.stringify(arg)
result in the escape function. - Add relevant imports/definitions in the file.
-
Copy modified lines R3-R22 -
Copy modified line R528
@@ -1,5 +1,25 @@ | ||
export type FakerArg = string | number | boolean | { json: string }; | ||
|
||
// Escape dangerous JS characters in generated code | ||
const charMap: { [key: string]: string } = { | ||
'<': '\\u003C', | ||
'>': '\\u003E', | ||
'/': '\\u002F', | ||
'\\': '\\\\', | ||
'\b': '\\b', | ||
'\f': '\\f', | ||
'\n': '\\n', | ||
'\r': '\\r', | ||
'\t': '\\t', | ||
'\0': '\\0', | ||
'\u2028': '\\u2028', | ||
'\u2029': '\\u2029' | ||
}; | ||
|
||
function escapeUnsafeJsChars(str: string): string { | ||
return str.replace(/[<>\b\f\n\r\t\0\u2028\u2029/\\]/g, (x) => charMap[x] || x); | ||
} | ||
|
||
const DEFAULT_ARRAY_LENGTH = 3; | ||
const INDENT_SIZE = 2; | ||
|
||
@@ -505,7 +525,7 @@ | ||
const arg = fakerArgs[i]; | ||
|
||
if (typeof arg === 'string') { | ||
stringifiedArgs.push(JSON.stringify(arg)); | ||
stringifiedArgs.push(escapeUnsafeJsChars(JSON.stringify(arg))); | ||
} else if (typeof arg === 'number') { | ||
if (!Number.isFinite(arg)) { | ||
throw new Error( |
Description
Implement the mongosh script generation engine that creates an executable script based on the faker schema. The script will generate and upload mock data.
Checklist
Types of changes