Skip to content

Conversation

@ActuallyaDeviloper
Copy link

Hello, when writing my own parser (reader/writer) support, I noticed that the library incorrectly assumes that "string_view"s are nullterminated. They aren't (sometimes). Also the string length is sometimes recomputed when it's not really necessary.

I was gonna report an issue but then I decided to just fix it myself. Only problem is that I cannot test the other backends (which I do not intend to use) locally right now. Unsurprisingly, the installation scripts fail deep inside some third party dependencies. I do not want to debug that.

@gemini-code-assist
Copy link

Summary of Changes

Hello @ActuallyaDeviloper, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical issues across multiple backend serialization/deserialization libraries (Avro, Cap'n Proto, JSON, FlexBuf, YAML, XML) where std::string_view was incorrectly assumed to be null-terminated. The changes ensure robust and correct string handling by explicitly managing string lengths or converting to null-terminated strings when interacting with C-style APIs, thereby preventing potential runtime errors and improving data integrity.

Highlights

  • Avro Writer Null-Termination Fix: Ensured that std::string_view arguments passed as names to avro_value_add and avro_value_get_by_name are explicitly converted to std::string to guarantee null-termination, as required by the Avro C API.
  • Cap'n Proto Reader String View Handling: Modified the Cap'n Proto reader to correctly construct std::string_view for map keys using both cStr() and size() from capnp::Text, preventing reliance on implicit null-termination.
  • Cap'n Proto Writer String View Handling: Updated the Cap'n Proto writer to pass std::string_view arguments as kj::StringPtr (using {data(), size()}) to _parent->val_.set and _parent->val_.init methods, ensuring explicit length handling.
  • JSON Reader String Length Correction: Adjusted the JSON reader to use yyjson_get_len along with yyjson_get_str when creating std::string_view for map keys and std::string objects, ensuring accurate string length is always provided.
  • JSON Writer String Length Correction: Replaced yyjson_mut_strcpy with yyjson_mut_strncpy in the JSON writer when adding object fields and converting basic types, explicitly providing string length to prevent null-termination assumptions.
  • XML Writer String Copy Optimization: Optimized the to_string method in include/rfl/xml/Writer.hpp for std::string types to return a reference, avoiding an unnecessary and expensive string copy.
  • XML Writer Direct String Handling: Removed explicit c_str() calls in src/rfl/xml/Writer.cpp when interacting with pugi::xml_node methods, allowing std::string or std::string_view to be passed directly, leveraging potential overloads that handle string views efficiently.
  • YAML Writer String View Integration: Modified the YAML writer to pass std::string_view objects directly as keys to the YAML output stream, allowing the YAML library to handle string views correctly without assuming null-termination via data().
  • FlexBuf Writer Null-Termination Enforcement: Ensured std::string_view arguments for names in FlexBuf writer methods (Null, StartVector, StartMap) are explicitly converted to std::string to guarantee null-termination, as required by the underlying FlatBuffers C++ API.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request does a good job of addressing the incorrect assumption that std::string_view is null-terminated across various serialization backends. The changes correctly use size-aware constructors or create temporary std::string objects to ensure null-termination where C-style strings are required. This improves the robustness and correctness of the library.

However, I've found a few issues that need to be addressed:

  • In include/rfl/capnproto/Writer.hpp, there are a couple of bugs: one is a copy-paste error that uses _name instead of _var, and another is a change that will cause a compilation error.
  • In src/rfl/flexbuf/Writer.cpp, one of the function calls was missed in the refactoring and still incorrectly uses .data() on a string_view.
  • A UTF-8 BOM was added to include/rfl/json/Reader.hpp, which is generally discouraged and should be removed.

Once these issues are fixed, this will be a great improvement to the library.

@ActuallyaDeviloper ActuallyaDeviloper force-pushed the nulltermination branch 2 times, most recently from 86c4234 to 4693ccd Compare November 2, 2025 19:45
@liuzicheng1987
Copy link
Contributor

liuzicheng1987 commented Nov 2, 2025

@ActuallyaDeviloper, in practice, the string are always derived from here:

https://github.com/getml/reflect-cpp/blob/main/include/rfl/internal/StringLiteral.hpp

So actually, it is OK to assume that they are null-terminated. But where you can increase safety without generating unnecessary copies, I think it is a good idea.

@liuzicheng1987
Copy link
Contributor

@ActuallyaDeviloper , some of the tests are failing to compile. Could you have a look?

@liuzicheng1987
Copy link
Contributor

@ActuallyaDeviloper , out of curiosity: For which format did you want to write your own parser? Would you be willing to contribute it to the library?

@ActuallyaDeviloper
Copy link
Author

Hm, when I implemented my own parser this was not necessarily true anymore, as I was pumping my own "string_view" like data into the writer/reader.

IMHO, if the API says it takes a "string_view" but has some special undocumented preconditions then that is very brittle and bad API design.

At the very least it would need to be documented.

The (nice) alternative I see is to use a different type that does guarantee null termination, however I am not hyped about that because then I need to copy my strings in my parser for every backend.

I think the better alternative I is to accept the data copies for now and petition at the libraries of "avro" and "flexbuf" to add "string_view" APIs.

That said, I am all for removing copies. In fact, there is a lot of copies already anyway. Many backends seem to actually copy the string again into their internal data structure even though e.g. the JSON library would provide a non-copy API. Furthermore when trying to input some string "string_view" as value, it too is copied.

@ActuallyaDeviloper
Copy link
Author

@ActuallyaDeviloper , out of curiosity: For which format did you want to write your own parser? Would you be willing to contribute it to the library?

By writing a parser, I meant specializing the "Parser" class. :)
It's for serializing our internal data structures which do not/can not use std-containers, so I don't think I can contribute that.

@ActuallyaDeviloper
Copy link
Author

ActuallyaDeviloper commented Nov 2, 2025

I fixed the issue hopefully. This might take a couple rounds because unfortunately I can't compile/test most backends locally.

@ActuallyaDeviloper
Copy link
Author

ActuallyaDeviloper commented Nov 2, 2025

A crap, YAML is also failing because the API I am using is unreleased:
https://github.com/jbeder/yaml-cpp/blob/a83cd31548b19d50f3f983b069dceb4f4d50756d/include/yaml-cpp/emitter.h#L210.

The last release 0.8 is from 2023 and does not contain that.

@liuzicheng1987
Copy link
Contributor

@ActuallyaDeviloper , I agree, it is brittle design.

For Avro I don't really care that your changes introduce overhead. This format is slow AF anyway.

But flexbuf is one of the two fastest formats supported (the other one being msgpack). Once the code fully compiles, we should take a look at the benchmarks to see how much of an impact this makes. Then, we can make a decision how to proceed here.

@liuzicheng1987
Copy link
Contributor

A crap, YAML is also failing because the API I am using is unreleased: https://github.com/jbeder/yaml-cpp/blob/a83cd31548b19d50f3f983b069dceb4f4d50756d/include/yaml-cpp/emitter.h#L210.

The last release 0.8 is from 2023 and does not contain that.

Yeah, YAML is fine as well. This is the slowest format we have and no one who uses YAML really cares about speed.

…t is really not.

Also avoid recomputing the string length for APIs that allow that.
@ActuallyaDeviloper
Copy link
Author

A crap, YAML is also failing because the API I am using is unreleased: https://github.com/jbeder/yaml-cpp/blob/a83cd31548b19d50f3f983b069dceb4f4d50756d/include/yaml-cpp/emitter.h#L210.
The last release 0.8 is from 2023 and does not contain that.

Yeah, YAML is fine as well. This is the slowest format we have and no one who uses YAML really cares about speed.

Can you restart the testing?

@ZXShady
Copy link
Contributor

ZXShady commented Nov 3, 2025

Related to #444

honestly my solution is to adopt something like C++29 zstring_view it is 0 cost string view wrapper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants