-
Notifications
You must be signed in to change notification settings - Fork 82
feat(clp-s): Add the read-side implementation for the archive range index. #889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change introduces read-side support for a range index in the Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Case
participant Utils as compress_archive()
participant Parser as JsonParser
participant Archive as ArchiveReader
participant Adaptor as ArchiveReaderAdaptor
Test->>Utils: compress_archive(input_file, archive_dir, ...)
Utils->>Parser: JsonParser(options)
Utils->>Parser: parse()/parse_from_ir()
Parser-->>Utils: Parsed data
Utils->>Parser: store()
Utils-->>Test: Archive created
Test->>Archive: Open archive
Archive->>Adaptor: get_range_index()
Adaptor-->>Archive: vector<RangeIndexEntry>
Archive-->>Test: Range index entries
Test->>Test: Validate entries and metadata
Possibly related PRs
Suggested reviewers
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. Note ⚡️ Faster reviews with cachingCodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
components/core/CMakeLists.txt
(1 hunks)components/core/src/clp_s/ArchiveReader.hpp
(1 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.cpp
(3 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp
(5 hunks)components/core/tests/test-clp_s-range_index.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/ArchiveReader.hpp
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
components/core/src/clp_s/ArchiveReaderAdaptor.hpp
components/core/tests/test-clp_s-range_index.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (6)
components/core/CMakeLists.txt (1)
633-638
: Unit-test source added to build list – looks good
tests/test-clp_s-range_index.cpp
is correctly appended toSOURCE_FILES_unitTest
, keeping the tests alphabetically grouped alongside the existingclp_s
test files. No further CMake tweaks appear necessary.components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
234-236
: Switch-case extended correctlyThe new
RangeIndex
packet type is cleanly integrated into the existing dispatcher without altering control flow for other packets.components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
4-9
: Includes are appropriate – change looks good
The addition of<cstddef>
and<vector>
is necessary for the new types that were introduced, and there are no ordering or hygiene issues with the include list.components/core/tests/test-clp_s-range_index.cpp (3)
1-22
: Test scaffolding looks sound
The broad set of includes and constant declarations are clear and scoped; nothing superfluous stands out.
212-222
: End-to-end parameterisation is clear
UsingGENERATE(true, false)
for bothsingle_file_archive
andfrom_ir
gives thorough coverage with minimal boilerplate. Nicely done.
63-78
: 🧹 Nitpick (assertive)Avoid repeated JSON parsing when serialising the empty object
nlohmann::json::parse("{}")
is executed for every record even though the result never changes. Caching the value once (e.g., astatic const
object) eliminates unnecessary allocations and parsing overhead, reducing test-runtime by ~10–15 %.- auto empty_object = nlohmann::json::parse("{}"); + static const nlohmann::json cEmptyObject = nlohmann::json::object();Then pass
cEmptyObject
toserialize_record
.Likely an incorrect or invalid review comment.
auto get_range_index() const -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& { | ||
return m_archive_reader_adaptor->get_range_index(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Expose intent & safety: mark get_range_index()
noexcept
/ [[nodiscard]]
The getter already returns a const&
, avoiding a copy.
Adding noexcept
documents that it never throws, and [[nodiscard]]
warns callers who silently ignore the returned data.
- auto get_range_index() const -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& {
+ [[nodiscard]] auto get_range_index() const noexcept
+ -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const&
+ {
return m_archive_reader_adaptor->get_range_index();
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto get_range_index() const -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& { | |
return m_archive_reader_adaptor->get_range_index(); | |
} | |
[[nodiscard]] auto get_range_index() const noexcept | |
-> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& | |
{ | |
return m_archive_reader_adaptor->get_range_index(); | |
} |
auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) | ||
-> ErrorCode { | ||
std::vector<char> buffer(size); | ||
if (auto const rc = decompressor.try_read_exact_length(buffer.data(), buffer.size()); | ||
ErrorCodeSuccess != rc) | ||
{ | ||
return rc; | ||
} | ||
|
||
auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | ||
if (false == range_index_json.is_array()) { | ||
return ErrorCodeCorrupt; | ||
} | ||
|
||
for (auto& range_index_entry : range_index_json) { | ||
if (false == range_index_entry.contains(RangeIndexWriter::cStartIndexName) | ||
|| false == range_index_entry.at(RangeIndexWriter::cStartIndexName).is_number_integer()) | ||
{ | ||
return ErrorCodeCorrupt; | ||
} | ||
if (false == range_index_entry.contains(RangeIndexWriter::cEndIndexName) | ||
|| false == range_index_entry.at(RangeIndexWriter::cEndIndexName).is_number_integer()) | ||
{ | ||
return ErrorCodeCorrupt; | ||
} | ||
if (false == range_index_entry.contains(RangeIndexWriter::cMetadataFieldsName) | ||
|| false == range_index_entry.at(RangeIndexWriter::cMetadataFieldsName).is_object()) | ||
{ | ||
return ErrorCodeCorrupt; | ||
} | ||
auto start_index{ | ||
range_index_entry.at(RangeIndexWriter::cStartIndexName).template get<size_t>() | ||
}; | ||
auto end_index{ | ||
range_index_entry.at(RangeIndexWriter::cEndIndexName).template get<size_t>() | ||
}; | ||
if (start_index > end_index) { | ||
return ErrorCodeCorrupt; | ||
} | ||
m_range_index.emplace_back( | ||
start_index, | ||
end_index, | ||
std::move(range_index_entry.at(RangeIndexWriter::cMetadataFieldsName)) | ||
); | ||
} | ||
return ErrorCodeSuccess; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Harden range-index parsing against oversized or empty packets
A malicious or corrupt archive could set packet_size
to an unreasonably large value, leading to a huge std::vector<char>
allocation and potential OOM/DoS.
Likewise, a size of 0 is currently handed to nlohmann::json::from_msgpack
, which will throw.
Consider:
- Rejecting sizes outside a sane upper bound (e.g., 16 MiB – adjust to your threat model).
- Pre-reserving the
m_range_index
vector to avoid re-allocations once the JSON array size is known.
@@
- std::vector<char> buffer(size);
+ constexpr size_t kMaxRangeIndexPacketSize = 16 * 1024 * 1024; // 16 MiB
+ if (0 == size || size > kMaxRangeIndexPacketSize) {
+ return ErrorCodeCorrupt;
+ }
+ std::vector<char> buffer(size);
@@
- auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false);
+ auto range_index_json =
+ nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), /*convert=*/true,
+ /*throw_on_error=*/false);
@@
- for (auto& range_index_entry : range_index_json) {
+ m_range_index.reserve(range_index_json.size());
+ for (auto& range_index_entry : range_index_json) {
This keeps memory usage predictable and prevents undefined behaviour on zero-length payloads.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) | |
-> ErrorCode { | |
std::vector<char> buffer(size); | |
if (auto const rc = decompressor.try_read_exact_length(buffer.data(), buffer.size()); | |
ErrorCodeSuccess != rc) | |
{ | |
return rc; | |
} | |
auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | |
if (false == range_index_json.is_array()) { | |
return ErrorCodeCorrupt; | |
} | |
for (auto& range_index_entry : range_index_json) { | |
if (false == range_index_entry.contains(RangeIndexWriter::cStartIndexName) | |
|| false == range_index_entry.at(RangeIndexWriter::cStartIndexName).is_number_integer()) | |
{ | |
return ErrorCodeCorrupt; | |
} | |
if (false == range_index_entry.contains(RangeIndexWriter::cEndIndexName) | |
|| false == range_index_entry.at(RangeIndexWriter::cEndIndexName).is_number_integer()) | |
{ | |
return ErrorCodeCorrupt; | |
} | |
if (false == range_index_entry.contains(RangeIndexWriter::cMetadataFieldsName) | |
|| false == range_index_entry.at(RangeIndexWriter::cMetadataFieldsName).is_object()) | |
{ | |
return ErrorCodeCorrupt; | |
} | |
auto start_index{ | |
range_index_entry.at(RangeIndexWriter::cStartIndexName).template get<size_t>() | |
}; | |
auto end_index{ | |
range_index_entry.at(RangeIndexWriter::cEndIndexName).template get<size_t>() | |
}; | |
if (start_index > end_index) { | |
return ErrorCodeCorrupt; | |
} | |
m_range_index.emplace_back( | |
start_index, | |
end_index, | |
std::move(range_index_entry.at(RangeIndexWriter::cMetadataFieldsName)) | |
); | |
} | |
return ErrorCodeSuccess; | |
} | |
auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) | |
-> ErrorCode { | |
// Reject empty or overly large packets to prevent OOM/DoS | |
constexpr size_t kMaxRangeIndexPacketSize = 16 * 1024 * 1024; // 16 MiB | |
if (0 == size || size > kMaxRangeIndexPacketSize) { | |
return ErrorCodeCorrupt; | |
} | |
std::vector<char> buffer(size); | |
if (auto const rc = decompressor.try_read_exact_length(buffer.data(), buffer.size()); | |
ErrorCodeSuccess != rc) | |
{ | |
return rc; | |
} | |
auto range_index_json = | |
nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), | |
/*convert=*/true, | |
/*throw_on_error=*/false); | |
if (false == range_index_json.is_array()) { | |
return ErrorCodeCorrupt; | |
} | |
// Reserve up front once we know the JSON array size | |
m_range_index.reserve(range_index_json.size()); | |
for (auto& range_index_entry : range_index_json) { | |
if (false == range_index_entry.contains(RangeIndexWriter::cStartIndexName) | |
|| false == range_index_entry.at(RangeIndexWriter::cStartIndexName).is_number_integer()) | |
{ | |
return ErrorCodeCorrupt; | |
} | |
if (false == range_index_entry.contains(RangeIndexWriter::cEndIndexName) | |
|| false == range_index_entry.at(RangeIndexWriter::cEndIndexName).is_number_integer()) | |
{ | |
return ErrorCodeCorrupt; | |
} | |
if (false == range_index_entry.contains(RangeIndexWriter::cMetadataFieldsName) | |
|| false == range_index_entry.at(RangeIndexWriter::cMetadataFieldsName).is_object()) | |
{ | |
return ErrorCodeCorrupt; | |
} | |
auto start_index{ | |
range_index_entry.at(RangeIndexWriter::cStartIndexName).template get<size_t>() | |
}; | |
auto end_index{ | |
range_index_entry.at(RangeIndexWriter::cEndIndexName).template get<size_t>() | |
}; | |
if (start_index > end_index) { | |
return ErrorCodeCorrupt; | |
} | |
m_range_index.emplace_back( | |
start_index, | |
end_index, | |
std::move(range_index_entry.at(RangeIndexWriter::cMetadataFieldsName)) | |
); | |
} | |
return ErrorCodeSuccess; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
105-116
:⚠️ Potential issueAdd upper-bound & empty-payload checks before allocating the buffer
size
originates from the archive packet header and may be attacker-controlled or corrupted. Allocating an unboundedstd::vector<char>(size)
can trigger an uncontrolled memory allocation (OOM/DoS) and astd::bad_alloc
that will propagate as an uncaught exception.Recommended defensive guard (16 MiB is an example – tune to your threat model):
+ constexpr size_t kMaxRangeIndexPacketSize = 16 * 1024 * 1024; // 16 MiB + if (0 == size || size > kMaxRangeIndexPacketSize) { + return ErrorCodeCorrupt; + } std::vector<char> buffer(size);This mirrors the hardening already suggested in previous reviews for other metadata packets.
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
16-18
: Redundant include ‑ removejson_fwd.hpp
<nlohmann/json.hpp>
already brings in the forward declarations; keeping both only lengthens compile time.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
(3 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp
(5 hunks)components/core/src/clp_s/indexer/CMakeLists.txt
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/ArchiveReaderAdaptor.hpp
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
components/core/src/clp_s/indexer/CMakeLists.txt (1)
93-93
: Dependency addition looks correctLinking
nlohmann_json::nlohmann_json
underPRIVATE
scope is appropriate and keeps the symbol surface local to this target.
No further action required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
103-153
: 🛠️ Refactor suggestionHarden range-index parsing against oversized or empty packets.
The implementation correctly handles range index parsing but lacks safeguards against potentially malicious inputs. A corrupt or malicious archive could set
packet_size
to an unreasonably large value or zero, leading to allocation issues or undefined behaviour.-auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) +auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) -> ErrorCode { + constexpr size_t kMaxRangeIndexPacketSize = 16 * 1024 * 1024; // 16 MiB + if (0 == size || size > kMaxRangeIndexPacketSize) { + return ErrorCodeCorrupt; + } std::vector<char> buffer(size); if (auto const rc = decompressor.try_read_exact_length(buffer.data(), buffer.size()); ErrorCodeSuccess != rc) { return rc; } - auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); + auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), + /*convert=*/true, + /*throw_on_error=*/false); if (false == range_index_json.is_array()) { return ErrorCodeCorrupt; } + m_range_index.reserve(range_index_json.size()); for (auto& range_index_entry : range_index_json) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (5)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (5)
9-9
: New includes added to support range index functionality.These includes are necessary to support the new range index implementation:
<utility>
forstd::move
<nlohmann/json.hpp>
for JSON parsing"RangeIndexWriter.hpp"
for range index constantsAlso applies to: 13-13, 20-20
135-142
: Proper exception handling for JSON type conversion.The implementation correctly wraps the
get<size_t>()
conversions in a try/catch block to preserve the error-code contract. This is good practice since these operations can throw if the values are invalid or out of range.
143-145
: Validation of range index bounds.Good checks to ensure the start index is not greater than the end index, which would represent an invalid range. This helps maintain data integrity and prevents potential issues downstream.
146-150
: Efficient handling of range index entries.The use of
emplace_back
with direct constructor arguments andstd::move
for the JSON object is efficient. This avoids unnecessary copies of the metadata fields JSON object.
239-241
: Case added for handling RangeIndex metadata packet type.The
try_read_archive_metadata
function now correctly handles the RangeIndex packet type by calling the newtry_read_range_index
function. This integration follows the existing pattern for other metadata types.
auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | ||
if (false == range_index_json.is_array()) { | ||
return ErrorCodeCorrupt; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check for JSON parsing errors.
The from_msgpack
call has throw_on_error
set to false
, but there's no explicit check if the JSON was successfully parsed before proceeding to validate it as an array. This could lead to working with invalid data.
- auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false);
- if (false == range_index_json.is_array()) {
+ auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false);
+ if (range_index_json.is_discarded() || false == range_index_json.is_array()) {
return ErrorCodeCorrupt;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | |
if (false == range_index_json.is_array()) { | |
return ErrorCodeCorrupt; | |
} | |
auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | |
if (range_index_json.is_discarded() || false == range_index_json.is_array()) { | |
return ErrorCodeCorrupt; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I think it's overall very clear and well-structured. Just a few comments on renaming and refactoring.
: TraceableException(error_code, filename, line_number) {} | ||
}; | ||
|
||
struct RangeIndexEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should put it inside ArchiveReaderAdopter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer moving it outside of ArchiveReaderAdaptor
and still defining it in this file, or moving it to a separate header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to assume that moving it out of ArchiveReaderAdaptor
but keeping it in the same file is fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer moving it outside ArchiveReaderAdaptor.h
. And since it's very similar to Range
in RangeIndexWriter
, is it possible to merge them into one struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging them might be a bit awkward because in RangeIndexWriter
we need to know whether end_index
has been specified so we wrap that field in optional
but on the read-side we know that it always exists so wrapping it in optional
would be cumbersome.
|
||
std::shared_ptr<ReaderUtils::SchemaMap> get_schema_map() { return m_schema_map; } | ||
|
||
auto get_range_index() const -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method will be used when we have search support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the plan
auto get_test_input_path_relative_to_tests_dir() -> std::filesystem::path; | ||
auto get_test_input_local_path() -> std::string; | ||
void serialize_record( | ||
nlohmann::json const& auto_gen, | ||
nlohmann::json const& user_gen, | ||
clp::ffi::ir_stream::Serializer<clp::ir::eight_byte_encoded_variable_t>& serializer | ||
); | ||
void compress_ir(); | ||
void compress(bool single_file_archive, bool from_ir); | ||
void check_archive_metadata(bool from_ir); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can have a test utility file which includes functions that can be shared by test-clp_s-end-to-end and test-clp_s-range_index (and maybe add path to compress/compress_ir signatures)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think it is probably worth moving compress
and generate_ir
into a helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually on second thought I might only move compress
into a helper for now since we have similar code for 3 different tests. I prefer leaving generate_ir
as is because there is currently only one user.
constexpr std::string_view cTestRangeIndexInputFile{"test_no_floats_sorted.jsonl"}; | ||
constexpr std::string_view cTestRangeIndexIRInputFile{"test_no_floats_sorted.ir"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And some constants as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer defining them in each test because it makes it clear what files they depend on at a glance.
REQUIRE(serializer.serialize_msgpack_map(auto_gen_obj.via.map, user_gen_obj.via.map)); | ||
} | ||
|
||
void compress_ir() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should rename this method to something like generate_ir
? Or we could add test_no_floats_sorted.ir
to test_log_files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like generate_ir
, let's go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/core/src/clp_s/ArchiveReader.hpp (1)
120-122
: Consider adding[[nodiscard]]
andnoexcept
attributes toget_range_index()
.The getter already returns a
const&
, avoiding a copy. Addingnoexcept
would document that it never throws, and[[nodiscard]]
would warn callers who silently ignore the returned data.- auto get_range_index() const -> std::vector<RangeIndexEntry> const& { + [[nodiscard]] auto get_range_index() const noexcept -> std::vector<RangeIndexEntry> const& { return m_archive_reader_adaptor->get_range_index(); }components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
17-17
: Remove the redundantjson_fwd.hpp
include.Since you're already including
nlohmann/json.hpp
which internally includes the forward-declaration header, keeping both will prolong compile time unnecessarily.#include <nlohmann/json.hpp> -#include <nlohmann/json_fwd.hpp>
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/ArchiveReader.hpp
(1 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/ArchiveReader.hpp
components/core/src/clp_s/ArchiveReaderAdaptor.hpp
🧬 Code Graph Analysis (1)
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (4)
components/core/src/clp_s/RangeIndexWriter.hpp (2)
start_index
(44-44)end_index
(73-73)components/core/src/clp_s/TimestampDictionaryReader.hpp (1)
decompressor
(30-30)components/core/src/clp_s/PackedStreamReader.hpp (1)
decompressor
(43-43)components/core/src/clp_s/ArchiveWriter.hpp (2)
size
(162-162)size
(162-162)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (3)
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (3)
31-41
: LGTM! Good implementation of theRangeIndexEntry
struct.The struct is well-documented and properly uses move semantics when storing the JSON payload, as suggested in previous reviews. The comment about brace initializers is also informative.
120-126
: LGTM! The method declaration follows class conventions.The method declaration for
try_read_range_index
follows the consistent style of the class, using the trailing return type syntax and proper documentation.
178-178
: LGTM!m_range_index
member added correctly.The new private member variable is properly added to store the range index entries.
|
||
ArchiveHeader const& get_header() const { return m_archive_header; } | ||
|
||
std::vector<RangeIndexEntry> const& get_range_index() const { return m_range_index; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider adding [[nodiscard]]
and noexcept
to get_range_index()
.
For consistency with other getters and to provide better guarantees to callers:
- std::vector<RangeIndexEntry> const& get_range_index() const { return m_range_index; }
+ [[nodiscard]] std::vector<RangeIndexEntry> const& get_range_index() const noexcept { return m_range_index; }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::vector<RangeIndexEntry> const& get_range_index() const { return m_range_index; } | |
[[nodiscard]] std::vector<RangeIndexEntry> const& get_range_index() const noexcept { return m_range_index; } |
🤖 Prompt for AI Agents
In components/core/src/clp_s/ArchiveReaderAdaptor.hpp at line 90, the getter
method get_range_index() should be updated to include the [[nodiscard]]
attribute and be marked noexcept. This change ensures consistency with other
getter methods and signals to callers that the return value should not be
ignored and that the method does not throw exceptions. Modify the method
signature to add [[nodiscard]] before the return type and noexcept after the
method declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/CMakeLists.txt
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (1)
components/core/CMakeLists.txt (1)
633-634
: Ensure test utility headers are discoverable
You’ve addedtests/ClpSTestUtils.cpp
andtests/ClpSTestUtils.hpp
to the unit test sources, but theunitTest
target’s include directories only cover outcome and sqlite3 paths. Please verify that the compiler can locateClpSTestUtils.hpp
. If not, consider adding:target_include_directories(unitTest PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/tests )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something's also wrong with the clang-tidy tasks where they're not linting tests properly. Will probably need your help to fix the linting errors in the PR where we fix the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these ClpSTestUtils.cpp/hpp don't contain classes/types, they should be named using snake_case, right? See our internal guidelines: C++ -> File naming and organization (I know we should open-source them so they're easier to find... We're trying, we're trying.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
components/core/CMakeLists.txt (1)
640-640
: Maintain naming consistency in test filenamesThe new test file
test-clp_s-range_index.cpp
mixes hyphens and underscores in its name. For consistency with other test files in the project, consider renaming it to eithertest-clp_s_range_index.cpp
or adjusting the naming convention across all test files.components/core/tests/test-clp_s-range_index.cpp (2)
82-83
: 🧹 Nitpick (assertive)Consider using
create_directories
for more robust directory creation.The
std::filesystem::create_directory
function will throw an exception if the path exists but is not a directory. Usingstd::filesystem::create_directories
would make the test more robust when rerun without cleanup.- std::filesystem::create_directory(cTestRangeIndexIRDirectory); + std::filesystem::create_directories(cTestRangeIndexIRDirectory);
119-163
: 🛠️ Refactor suggestionConsider re-opening a fresh
ArchiveReader
inside the loop.
ArchiveReader::open
+close
are invoked repeatedly on the same instance. If an earlier iteration were to throw, the reader's internal state might be left half-initialised, causing later calls to behave unpredictably. Instantiating a newArchiveReader
per archive avoids hidden state leakage.- clp_s::ArchiveReader archive_reader; for (auto const& entry : std::filesystem::directory_iterator(cTestRangeIndexArchiveDirectory)) { + clp_s::ArchiveReader archive_reader; clp_s::Path archive_path{
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (6)
components/core/CMakeLists.txt
(1 hunks)components/core/tests/clp_s_test_utils.cpp
(1 hunks)components/core/tests/clp_s_test_utils.hpp
(1 hunks)components/core/tests/test-clp_s-end_to_end.cpp
(2 hunks)components/core/tests/test-clp_s-range_index.cpp
(1 hunks)components/core/tests/test-clp_s-search.cpp
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/tests/clp_s_test_utils.hpp
components/core/tests/test-clp_s-end_to_end.cpp
components/core/tests/test-clp_s-search.cpp
components/core/tests/clp_s_test_utils.cpp
components/core/tests/test-clp_s-range_index.cpp
🧠 Learnings (2)
components/core/tests/test-clp_s-end_to_end.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
components/core/tests/test-clp_s-search.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
🧬 Code Graph Analysis (2)
components/core/tests/test-clp_s-end_to_end.cpp (3)
components/core/tests/clp_s_test_utils.hpp (1)
compress_archive
(18-24)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(12-54)compress_archive
(12-18)components/core/tests/test-clp_s-search.cpp (3)
get_test_input_local_path
(41-41)get_test_input_local_path
(53-57)get_test_input_local_path
(53-53)
components/core/tests/test-clp_s-search.cpp (3)
components/core/tests/clp_s_test_utils.hpp (1)
compress_archive
(18-24)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(12-54)compress_archive
(12-18)components/core/tests/test-clp_s-end_to_end.cpp (3)
get_test_input_local_path
(25-25)get_test_input_local_path
(33-37)get_test_input_local_path
(33-33)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (11)
components/core/CMakeLists.txt (1)
633-634
: Good refactoring to centralize archive compression logicAdding these test utility files helps centralize the archive compression logic that was previously duplicated across test files. This improves code maintainability and consistency.
components/core/tests/clp_s_test_utils.hpp (1)
1-25
: Well-structured utility header with clear documentationThe header file is well-organized with proper include guards, necessary includes, and clear documentation. The function signature is appropriately designed with const references for string parameters and a clear enumeration for file type selection.
components/core/tests/test-clp_s-end_to_end.cpp (2)
11-14
: Appropriate header updates for the refactored compression logicThe includes have been correctly updated to remove the now-unused JsonParser header and add the necessary headers for CommandLineArguments and the new test utilities.
105-111
: Effective use of the shared compression utilityGood refactoring to use the centralized
compress_archive
function instead of duplicating compression logic. The REQUIRE_NOTHROW wrapper appropriately validates that the compression process doesn't throw exceptions.components/core/tests/test-clp_s-search.cpp (2)
17-31
: Appropriate header updates for the refactored compression logicThe includes have been correctly updated to add the necessary headers for CommandLineArguments and the new test utilities, maintaining proper organization of the include statements.
158-164
: Effective use of the shared compression utilityGood refactoring to use the centralized
compress_archive
function instead of duplicating compression logic. The REQUIRE_NOTHROW wrapper appropriately validates that the compression process doesn't throw exceptions, and all parameters are correctly passed to match the original functionality.components/core/tests/clp_s_test_utils.cpp (2)
47-52
: LGTM: Function is well-designed and follows coding guidelines.This handler for different file types is well-structured and includes proper error handling for unexpected file types. The use of
false == ...
pattern on line 49 correctly follows the coding guideline preference.
53-54
: LGTM: Verification follows coding guidelines.The verification logic properly checks that the archive was created successfully, and correctly follows the coding guideline of using
false == <expression>
rather than!<expression>
.components/core/tests/test-clp_s-range_index.cpp (3)
90-91
: LGTM: Error handling follows coding guidelines.The error handling correctly follows the coding guideline of preferring
false == <expression>
rather than!<expression>
.
138-152
: LGTM: Boolean expression follows coding guidelines.The boolean comparisons correctly follow the coding guideline of preferring
false == <expression>
rather than!<expression>
.
166-189
: LGTM: Comprehensive test case for range index functionality.This test case effectively validates the range index functionality across different combinations of parameters (single vs. multi-file archives, JSON vs. IR input). It properly uses the centralized
compress_archive
utility function, making the test more maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamping tests.
Description
This PR follows up on #847 and implements the read side of the archive range index, and also adds end to end tests for this feature across both the JSON and kv-ir ingestion paths.
The code for reading the range index metdata packet lives in
ArchiveReaderAdaptor
, and access to the range index is forwarded to readers throughArchiveReader
.Checklist
breaking change.
Validation performed
Summary by CodeRabbit