Skip to content

Add support for large, multi-part config messages #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 15, 2025

Conversation

jagerman
Copy link
Member

This implements multi-part config messages, which have been stubbed in for a while but still needed an implementation.

Such messages are split after post-compression, post-padding, but before encryption (i.e. each piece gets encrypted separately), and all but the last piece will be a maximum size message (76800 bytes).

The Config object itself keeps track of partially received messages, joining them up as they arrive and processing them once it has a full set (or else expiring them if they are too old). It also tracks (without storing) recently fully processed multipart messages so that they can be discarded without pointlessly storing them if they happen to arrive again for some reason.

All of this requires some (breaking) external API changes to implement:

  • merge(...) now returns an ordered_set of config messages that failed to merge rather than a vector; we couldn't keep track of the order anymore, because of the next point.
  • merge(...) now returns all the hashes that make up a multipart set when one gets merged, even if they weren't passed in in the current merge call. For instance if you have a three-chunk multipart config and merge parts 1 and 2, but not 3, then 1 and 2 will be preserved internally. When you later provide 3, then all of hash1, hash2, and hash3 get returned to indicate that all of those were processed as a part of this merge.
  • confirm_pushed() previously took a vector and implied the order matters (although we never previously returned more than one). It now takes an unordered_set to be consistent with merge() and current_hashes(), and because the order is irrelevant.
  • The C version of confirm_pushed (config_confirm_pushed) previously only accepted a single message hash + size (and would convert this into a one-element vector for confirm_pushed()). Its API changes quite considerably as it now has to pass an array of hashes and sizes and the size of those arrays. As with the C++ version, the order of elements here is unimportant.
  • The C++ current_hashes() previously only ever returned a 0- or 1-length vector; it now returns an unordered_set, and can easily return more than one (i.e. when the current config is a multipart config message).
  • Added a new active_hashes() method that returns the current hashes plus the hashes of any partially loaded multi-part configs for which we are awaiting remaining parts. When used for "do we want to keep this alive?" this is the more appropriate method to use compared to current_hashes().

Other multi-part related changes (which don't affect the API, but are worth noting):

  • multipart messages do not support (and must not be) protobuf wrapped, because multipart messages already will not be understood by clients not using this new code, and so anything supporting multipart should also have been fixed to not apply extra protobuf wrapping. We also are returning maximum size message parts, and the protobuf overhead will push it over the storage limit in a way that isn't trivial for libsession-util to determine and compensate for.

Other changes not directly related to multipart, but included as part of this:

  • Started converting some use of ustring_view to span<const unsigned char> (alias: ucspan), adding some helper functions and changing some functions to take spans instead. For the most part this conversion can happen piecemeal as, luckily, ustring_views are implicitly convertible to ucspan's, and so most places passing a ustring_view (or ustring) will not break. (The other way, i.e. ucspan to ustring_view, does not work, and so the conversion needs to happen from innermost functions out; this initial change is focussed on some of those inner functions).

  • fixed a bug where a duplicate config (multipart or otherwise) that gets merge()d would result in that config's hash ending up in both current hashes and old hashes, but it should not have gone into old hashes.

  • started converting some string concatenation and std::to_string usage to fmt and/or "..."_format instead now that we have those available (via oxen-logging).

@Bilb
Copy link
Collaborator

Bilb commented Dec 19, 2024

LGTM. I made a fix for the clang-16 build issues at #15

Copy link
Collaborator

@mpretty-cyro mpretty-cyro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like this PR doesn't play nicely with the libQuic changes due to the updated oxen-encoding version (oxenc::bt_list_consumer btlc(data); is no longer valid) - it seems we might need to wait until oxen-io/oxen-libquic#152 has been merged?

I worked around this for my testing by modifying btstream.cpp to do this instead:

auto val = oxenc::const_span<std::byte>{req.data(), req.size()};
oxenc::bt_list_consumer btlc(val);

but that resulted in Cannot create a bt_list_consumer with no data errors in the longs (and then I wasn't able to continue testing due to the config_push assertion but I assume I would have been prevented from testing properly due to this hack)

- Update oxen-encoding to latest dev for new span support in bt code.
- bring in oxen-logging so that we can use fmt; for now this only for
  fmt in the test suite, but we will definitely start using it in the
  future in libsession-util (the libquic PR already does).
- added an overload of the test suite "printable" functions that takes
  format+args.
- replace some `"..."_hex` string concatenations that broke when oxenc
  switched to constexpr _hex no longer return strings.
- Fixed some oxen-related test code that broke where it was touching
  oxenc::detail functions.
This implements multi-part config messages, which have been stubbed in
for a while but still needed an implementation.

Such messages are split after post-compression, post-padding, but before
encryption (i.e. each piece gets encrypted separately), and all but the
last piece will be a maximum size message (76800 bytes).

The Config object itself keeps track of partially received messages,
joining them up as they arrive and processing them once it has a full
set (or else expiring them if they are too old).  It also tracks
(without storing) recently fully processed multipart messages so that
they can be discarded without pointlessly storing them if they happen to
arrive again for some reason.

All of this requires some (breaking) external API changes to implement:

- `merge(...)` now returns an ordered_set of config messages that failed
  to merge rather than a vector; we couldn't keep track of the order
  anymore, because of the next point.
- `merge(...)` now returns *all* the hashes that make up a multipart set
  when one gets merged, even if they weren't passed in in the current
  merge call.  For instance if you have a three-chunk multipart config
  and merge parts 1 and 2, but not 3, then 1 and 2 will be preserved
  internally.  When you later provide 3, then all of hash1, hash2, and
  hash3 get returned to indicate that all of those were processed as a
  part of this merge.
- confirm_pushed() previously took a vector and implied the order
  matters (although we never previously returned more than one).  It now
  takes an unordered_set to be consistent with merge() and
  current_hashes(), and because the order is irrelevant.
- The C version of confirm_pushed (config_confirm_pushed) previously
  only accepted a single message hash + size (and would convert this
  into a one-element vector for confirm_pushed()).  Its API changes
  quite considerably as it now has to pass an array of hashes and sizes
  and the size of those arrays.  As with the C++ version, the order of
  elements here is unimportant.
- The C++ `current_hashes()` previously only ever returned a 0- or
  1-length vector; it now returns an unordered_set, and can easily
  return more than one (i.e. when the current config is a multipart
  config message).
- Added a new `active_hashes()` method that returns the current hashes
  *plus* the hashes of any partially loaded multi-part configs for which
  we are awaiting remaining parts.  When used for "do we want to keep
  this alive?" this is the more appropriate method to use compared to
  `current_hashes()`.

Other multi-part related changes (which don't affect the API, but are
worth noting):

- multipart messages *do not* support (and must not be) protobuf
  wrapped, because multipart messages *already* will not be understood
  by clients not using this new code, and so anything supporting
  multipart should also have been fixed to not apply extra protobuf
  wrapping.  We also are returning maximum size message parts, and the
  protobuf overhead will push it over the storage limit in a way that
  isn't trivial for libsession-util to determine and compensate for.

Other changes not directly related to multipart, but included as part of
this:

- Started converting some use of ustring_view to `span<const unsigned
  char>` (alias: `ucspan`), adding some helper functions and changing
  some functions to take spans instead.  For the most part this
  conversion can happen piecemeal as, luckily, ustring_views are
  implicitly convertible to `ucspan`'s, and so most places passing a
  ustring_view (or ustring) will not break.  (The other way, i.e. ucspan
  to ustring_view, does not work, and so the conversion needs to happen
  from innermost functions out; this initial change is focussed on some
  of those inner functions).

- fixed a bug where a duplicate config (multipart or otherwise) that
  gets merge()d would result in that config's hash ending up in *both*
  current hashes and old hashes, but it should not have gone into old
  hashes.

- started converting some string concatenation and std::to_string usage
  to fmt and/or "..."_format instead now that we have those available
  (via oxen-logging).
# Conflicts:
#	external/CMakeLists.txt
#	external/oxen-encoding
#	include/session/config.hpp
#	include/session/config/base.hpp
#	include/session/config/convo_info_volatile.hpp
#	include/session/config/protos.hpp
#	include/session/hash.hpp
#	include/session/session_encrypt.hpp
#	include/session/types.hpp
#	include/session/util.hpp
#	src/CMakeLists.txt
#	src/config/base.cpp
#	src/config/convo_info_volatile.cpp
#	src/config/encrypt.cpp
#	src/config/internal.cpp
#	src/config/internal.hpp
#	src/config/protos.cpp
#	src/hash.cpp
#	src/session_encrypt.cpp
#	tests/CMakeLists.txt
#	tests/swarm-auth-test.cpp
#	tests/test_blinding.cpp
#	tests/test_bugs.cpp
#	tests/test_config_contacts.cpp
#	tests/test_config_convo_info_volatile.cpp
#	tests/test_config_user_groups.cpp
#	tests/test_config_userprofile.cpp
#	tests/test_configdata.cpp
#	tests/test_group_info.cpp
#	tests/test_group_keys.cpp
#	tests/test_group_members.cpp
#	tests/test_multi_encrypt.cpp
#	tests/test_session_encrypt.cpp
#	tests/utils.hpp
mpretty-cyro and others added 6 commits April 14, 2025 14:26
(Jason amended) rewrote the comment describing the struct layout as
well.
This can be very helpful to identify abnormally slow or hanging tests by
trace logging every test as it runs.
More than half the CPU time in the multi-part config tests is being used
generating 12k proper session IDs (via Ed keygen + conversion to X) for
the test.  This replaces the session id generation with random values
which is sufficient for this test and takes almost no time at all.
A timeout would fail all the CHECKs, but would keep going and then throw
a weird json parsing error.  This fixes that by checking the success
condition to a REQUIRE instead of a CHECK so that the json parsing
exception doesn't happen or get reported if the request failed.
We write all held multipart values when dumping with hash as key, but
that won't necessarily be properly sorted with an unordered_map, so make
it sorted instead.
@mpretty-cyro mpretty-cyro merged commit 3f71894 into session-foundation:dev Apr 15, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants