Skip to content

[CDRIVER-6017] Tweak visitor behavior when finding corrupt BSON #2019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

vector-of-bool
Copy link
Contributor

This is a minimal fix for some edge cases around BSON iteration/visitation and text validation. A larger change could be made to make the APIs more complete and easier to use, but this is enough to fix a bug related to UTF-8 validation.

We now invoke visit_corrupt in more cases where bson_visit_all cannot properly decode elements, rather than just stopping with the boolean error code.

This also adds an error code to bson_validate for encountering corruption. Previously, it would just set error->code = 0 and the message to corrupt BSON, which is counter-intuitive to those that expect error->code == 0 to indicate success.

See also: CDRIVER-4448

We now invoke `visit_corrupt` in more cases where `bson_visit_all`
cannot properly decode elements, rather than just stopping with the
boolean error code.
@vector-of-bool vector-of-bool requested a review from a team as a code owner May 20, 2025 20:40
@vector-of-bool vector-of-bool requested a review from kevinAlbs May 20, 2025 20:40
@kevinAlbs kevinAlbs requested a review from eramongodb May 21, 2025 18:22
*
* Passing this as a flag has no effect.
*/
BSON_VALIDATE_CORRUPT = (1 << 30),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse BSON_VALIDATE_UTF8 instead? It already "has no effect" as a behavior control flag and already describes the very UTF-8 validity checks which BSON_VALIDATE_CORRUPT is meant to describe (appears redundant).

s/BSON_VALIDATE_CORRUPT/BSON_VALIDATE_UTF8/g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but VISIT_CORRUPT is now (as of this PR) also used for when a subdocument is invalid (previously this was also just return true;, meaning that bson_validate was also missing this case).

I considered adding a .visit_invalid_utf8 callback, which is potentially useful in general, but there's ambiguity in the semantics of "corruption", because the spec technically says that BSON strings are UTF-8, so a non-utf-8 string could arguably be considered corruption.

@@ -2025,6 +2025,7 @@ bson_iter_visit_all (bson_iter_t *iter, /* INOUT */
while (_bson_iter_next_internal (iter, 0, &key, &bson_type, &unsupported)) {
if (*key && !bson_utf8_validate (key, strlen (key), false)) {
iter->err_off = iter->off;
VISIT_CORRUPT (iter, data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take this opportunity to extend test suite coverage of the .visit_corrupt callback function to account for these new checks? atm there only seems to be a single test case. Perhaps the VALIDATE_TEST macro can be extended to also check .visit_corrupt callback behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is a good idea.

Copy link
Collaborator

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change results in rejecting inserts that previously succeeded:

bson_t to_insert = BSON_INITIALIZER;
ASSERT (bson_append_utf8 (&to_insert, "v", 1, "a\xFF\b", 3));
bool ok = mongoc_collection_insert_one (coll, &to_insert, NULL, NULL, &error);
ASSERT_OR_PRINT (ok, error);
// Before PR : OK
// After PR  : invalid document for insert: corrupt BSON

Write helpers validate UTF-8 by default. However, this validation does not appear documented, and appears to have always been broken.

I would rather remove the UTF-8 validation in write helpers. I expect that would be a performance improvement, and remove the (low?) risk of breaking applications inserting invalid UTF-8.

Notably: UTF-8 key validation appeared to work (and is tested in PHP mongodb/mongo-php-driver#1819 (comment)).

It seems like most drivers do not validate UTF-8 on write. See survey + slack discussion.

Proposal:

  • Do not validate UTF-8 values (but validate keys) if BSON_VALIDATE_UTF8 is not set.
  • Remove BSON_VALIDATE_UTF8 from the default write helper flags (since UTF-8 validation never worked and the write helpers did not document expecting UTF-8 validation).
  • Do validate UTF-8 in bson_validate if BSON_VALIDATE_UTF8 is set (to match documented behavior of bson_validate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants