-
Notifications
You must be signed in to change notification settings - Fork 81
feat(clp-s): Expose archive range index metadata in printed archive statistics; Forward archive-level statistics to c++ writers. #1005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Walkthrough
The changes introduce structured archive statistics reporting by adding an `ArchiveStats` class, updating key methods in `ArchiveWriter` and related classes to return JSON metadata, and propagating this metadata through the archive writing and closing process. Tests and interfaces are updated to handle and verify these new return types.
## Changes
| File(s) | Change Summary |
|-----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| `components/core/src/clp_s/ArchiveWriter.cpp`, `ArchiveWriter.hpp` | `ArchiveWriter::close`, `write_single_file_archive`, and `write_archive_metadata` now return structured JSON and stats. Added `ArchiveStats` class. Removed `print_archive_stats`. |
| `components/core/src/clp_s/JsonParser.cpp`, `JsonParser.hpp` | `JsonParser::store` now returns `std::vector<ArchiveStats>`. Internal stats vector added. Calls to `close()` updated to handle return value. |
| `components/core/src/clp_s/RangeIndexWriter.cpp`, `RangeIndexWriter.hpp` | `RangeIndexWriter::write` now takes and outputs JSON metadata via an additional parameter. |
| `components/core/src/clp_s/CMakeLists.txt` | `nlohmann_json` linkage for `clp_s_archive_writer` changed from PRIVATE to PUBLIC. |
| `components/core/src/clp_s/clp-s.cpp` | Explicitly ignores the return value of `parser.store()`. |
| `components/core/tests/clp_s_test_utils.cpp`, `clp_s_test_utils.hpp` | `compress_archive` now returns `std::vector<ArchiveStats>`. Declaration and documentation updated. |
| `components/core/tests/test-clp_s-end_to_end.cpp`, `test-clp_s-search.cpp` | Explicitly discards the return value of `compress_archive` in tests. |
| `components/core/tests/test-clp_s-range_index.cpp` | Refactored and extended metadata validation: new helper functions, validation from both archive and stats. |
| `components/job-orchestration/job_orchestration/executor/compress/compression_task.py`| Changed subprocess stdout decoding from ASCII to UTF-8 before JSON parsing. |
## Sequence Diagram(s)
```mermaid
sequenceDiagram
participant User
participant JsonParser
participant ArchiveWriter
participant RangeIndexWriter
User->>JsonParser: store()
JsonParser->>ArchiveWriter: close()
ArchiveWriter->>ArchiveWriter: write_single_file_archive() / write_archive_metadata()
ArchiveWriter->>RangeIndexWriter: write(..., metadata)
RangeIndexWriter-->>ArchiveWriter: JSON metadata
ArchiveWriter-->>JsonParser: ArchiveStats (with JSON metadata)
JsonParser-->>User: std::vector<ArchiveStats> Possibly related PRs
Suggested reviewers
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🔭 Outside diff range comments (3)
components/core/tests/clp_s_test_utils.cpp (1)
43-57
: 🧹 Nitpick (assertive)Minor simplification & early initialisation
archive_stats
needn’t be default-constructed and later overwritten; we can initialise directly:-std::vector<clp_s::ArchiveStats> archive_stats; -... -REQUIRE_NOTHROW(archive_stats = parser.store()); +auto archive_stats = parser.store(); // already noexceptReduces a move and keeps the variable in the smallest scope.
components/core/src/clp_s/RangeIndexWriter.cpp (1)
46-71
: 🧹 Nitpick (assertive)Double serialisation source – use the already-built
metadata
After constructing
ranges_array
, the code assigns it tometadata
but still serialisesranges_array
. This duplicates one large JSON object. Serialisingmetadata
avoids the extra copy and keeps the two representations in sync:- nlohmann::json ranges_array; + nlohmann::json ranges_array; // unchanged ... - metadata = ranges_array; + metadata = std::move(ranges_array); - nlohmann::json::to_msgpack(ranges_array, serialized_range_index); + nlohmann::json::to_msgpack(metadata, serialized_range_index);No functional change, but a small memory win and removes a potential future divergence bug.
components/core/src/clp_s/JsonParser.cpp (1)
490-503
: 🛠️ Refactor suggestionRepeated
m_archive_writer->close()
error-handling blocks are verbose and may mask lost statisticsThe same 3-line pattern
std::ignore = m_archive_writer->close(); return false;is duplicated ~16 times. Two practical issues:
- Readability / maintenance – fixing the shutdown path later requires touching many sites.
- Lost data – by discarding the returned
ArchiveStats
, any partial work gathered before the failure disappears, although the PR’s main feature is to surface that metadata.Consider refactoring into a small helper that:
[[nodiscard]] bool JsonParser::abort_with_error() { m_archive_stats.emplace_back(m_archive_writer->close()); return false; }and replace each block with
return abort_with_error();
This keeps the stats vector consistent and removes boiler-plate.
Also applies to: 560-562, 581-583, 607-608, 625-627, 645-646, 657-658, 663-664, 954-955, 975-976, 1051-1054, 1070-1073, 1099-1102, 1113-1116, 1131-1134, 1142-1145
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (13)
components/core/src/clp_s/ArchiveWriter.cpp
(6 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(4 hunks)components/core/src/clp_s/CMakeLists.txt
(1 hunks)components/core/src/clp_s/JsonParser.cpp
(16 hunks)components/core/src/clp_s/JsonParser.hpp
(2 hunks)components/core/src/clp_s/RangeIndexWriter.cpp
(2 hunks)components/core/src/clp_s/RangeIndexWriter.hpp
(2 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)components/core/tests/clp_s_test_utils.cpp
(3 hunks)components/core/tests/clp_s_test_utils.hpp
(2 hunks)components/core/tests/test-clp_s-end_to_end.cpp
(2 hunks)components/core/tests/test-clp_s-range_index.cpp
(4 hunks)components/core/tests/test-clp_s-search.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/tests/test-clp_s-search.cpp
components/core/src/clp_s/clp-s.cpp
components/core/tests/test-clp_s-end_to_end.cpp
components/core/src/clp_s/JsonParser.cpp
components/core/tests/clp_s_test_utils.cpp
components/core/src/clp_s/JsonParser.hpp
components/core/tests/clp_s_test_utils.hpp
components/core/src/clp_s/RangeIndexWriter.hpp
components/core/src/clp_s/RangeIndexWriter.cpp
components/core/src/clp_s/ArchiveWriter.hpp
components/core/src/clp_s/ArchiveWriter.cpp
components/core/tests/test-clp_s-range_index.cpp
🧠 Learnings (2)
components/core/tests/test-clp_s-search.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
components/core/tests/test-clp_s-end_to_end.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
🧬 Code Graph Analysis (3)
components/core/tests/test-clp_s-search.cpp (2)
components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(13-57)compress_archive
(13-19)components/core/tests/test-clp_s-end_to_end.cpp (3)
get_test_input_local_path
(26-26)get_test_input_local_path
(34-38)get_test_input_local_path
(34-34)
components/core/tests/test-clp_s-end_to_end.cpp (3)
components/core/tests/clp_s_test_utils.hpp (1)
compress_archive
(22-28)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(13-57)compress_archive
(13-19)components/core/tests/test-clp_s-search.cpp (3)
get_test_input_local_path
(46-46)get_test_input_local_path
(64-68)get_test_input_local_path
(64-64)
components/core/src/clp_s/RangeIndexWriter.cpp (1)
components/core/src/clp_s/RangeIndexWriter.hpp (1)
writer
(85-85)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
🔇 Additional comments (17)
components/core/src/clp_s/clp-s.cpp (1)
123-123
: Explicit discard looks fine – double-check that callers don’t need the stats
std::ignore = parser.store();
makes the intent clear, but if any downstream tooling or metrics rely on the returnedArchiveStats
, they’ll silently lose that information here. Confirm this is the only compression entry-point that drops the stats.components/core/tests/test-clp_s-search.cpp (1)
230-238
: Return value now ignored explicitly – OKThe test was updated consistently with the new signature and already includes
<utility>
, so everything compiles cleanly.components/core/src/clp_s/CMakeLists.txt (1)
231-236
: PUBLIC link of nlohmann_json may broaden transitive depsPromoting
nlohmann_json::nlohmann_json
to PUBLIC will expose its include path and, on some build systems, its compile options to every downstream target ofclp_s::archive_writer
.
If that exposure is unintentional (e.g., only headers inarchive_writer
use nlohmann::json internally), consider keeping it PRIVATE and adding the include directory explicitly where required.components/core/src/clp_s/JsonParser.hpp (1)
232-234
: Member name is plural – ok, but verify lifetime semantics
m_archive_stats
holds all stats accumulated across splits. Ensure:
store()
clears this vector after returning if the sameJsonParser
instance may be reused.- Thread-safety is not a concern (vector is mutated in only one thread).
components/core/tests/test-clp_s-end_to_end.cpp (1)
106-114
: Test updated consistentlyThe explicit
std::ignore
plus added<utility>
header correctly handle the new return type; no further changes needed.components/core/tests/clp_s_test_utils.hpp (1)
22-28
:[[nodiscard]]
may cause noise when tests intentionally ignore the valueMany callers use
std::ignore = compress_archive(...)
or simply drop the return value.
If that is intentional, consider dropping[[nodiscard]]
or wrapping the return with a tiny proxy that suppresses the warning. Otherwise, plan to update all sites to explicitly acknowledge the discard to keep CI warnings clean.components/core/src/clp_s/ArchiveWriter.hpp (4)
11-13
: LGTM!The new includes are appropriate for the JSON handling and metadata constants required by the refactored code.
36-78
: Well-structured archive statistics representation.The
ArchiveStats
struct provides a clean encapsulation of archive metadata with proper initialization and JSON serialization. The comment explaining the avoidance of brace initialization for the metadata field is particularly helpful.
128-132
: Appropriate method signature update.The
close()
method signature properly reflects its new responsibility of returning archive statistics, with the[[nodiscard]]
attribute ensuring callers handle the returned value.
271-273
: Consistent return type updates for metadata propagation.Both methods now appropriately return JSON metadata, enabling the propagation of archive range index information through the call chain.
Also applies to: 281-283
components/core/src/clp_s/ArchiveWriter.cpp (3)
60-136
: Proper implementation of archive statistics collection and return.The
close()
method correctly collects comprehensive archive statistics from various sources and returns them in a structured format. The conditional printing of statistics maintains backward compatibility.
138-157
: Clean implementation with helpful documentation.The method properly collects and returns the archive range index. The comment on line 145 is particularly valuable for preventing potential JSON wrapping issues.
159-209
: Correct metadata propagation implementation.The method properly integrates JSON metadata collection during range index writing while maintaining error handling and existing functionality.
components/core/tests/test-clp_s-range_index.cpp (4)
21-23
: Appropriate test structure updates.The new includes and function declarations properly support the extended test coverage for archive statistics validation.
Also applies to: 44-50
141-166
: Well-structured validation of archive statistics.The function properly validates the JSON metadata structure from
ArchiveStats
objects and correctly delegates to the common validation logic.
127-139
: Excellent refactoring for code reuse.The extraction of validation logic to
check_archive_range_index
eliminates duplication and ensures consistent validation across different data sources.Also applies to: 168-202
220-231
: Comprehensive test coverage for archive statistics.The test properly captures and validates archive statistics from both the archive files and the returned
ArchiveStats
objects, ensuring consistency between the two sources.
* @return Statistics for every archive that was written without encountering an error. | ||
*/ | ||
void store(); | ||
[[nodiscard]] auto store() -> std::vector<ArchiveStats>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider returning by const&
or std::move
to avoid an extra copy
store()
currently returns std::vector<ArchiveStats>
by value.
If callers usually just move the result, returning by value is fine (NRVO/move elision).
If not, returning const std::vector<ArchiveStats>&
or marking the return [[nodiscard]] auto&&
with an explicit std::move
inside could save an extra copy.
🤖 Prompt for AI Agents
In components/core/src/clp_s/JsonParser.hpp around lines 82 to 85, the store()
method returns a std::vector<ArchiveStats> by value, which may cause an
unnecessary copy. To optimize, either return a const reference to the vector if
the caller only needs to read it without taking ownership, or explicitly use
std::move when returning to enable move semantics and avoid copying. Adjust the
return type and implementation accordingly to prevent extra copies based on
typical usage patterns.
#include <vector> | ||
|
||
#include "../src/clp_s/ArchiveWriter.hpp" | ||
#include "../src/clp_s/InputConfig.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Header inclusion could be trimmed to speed up compilation
ArchiveWriter.hpp
is only needed for the clp_s::ArchiveStats
type used in the return-value. A forward declaration would avoid the heavy include and reduce rebuild times in the test-suite:
-#include "../src/clp_s/ArchiveWriter.hpp"
+namespace clp_s { struct ArchiveStats; } // fwd-decl
The full header is already brought in by the implementation file, so this change is safe.
🤖 Prompt for AI Agents
In components/core/tests/clp_s_test_utils.hpp around lines 5 to 8, replace the
#include of "../src/clp_s/ArchiveWriter.hpp" with a forward declaration of the
clp_s::ArchiveStats struct to reduce compilation dependencies and speed up
rebuilds. Keep the #include for "../src/clp_s/InputConfig.hpp" as is. This
change is safe because the full ArchiveWriter.hpp is included in the
implementation file.
m_archive_stats.emplace_back(m_archive_writer->close(true)); | ||
m_archive_options.id = m_generator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Capturing the result of close(true)
would preserve split-boundary stats
In split_archive()
you correctly append the stats from close(true)
, but if close(true)
itself fails (e.g., write error) the exception/return-code is silently ignored. Capture the result and validate, or propagate an exception to avoid silently continuing with corrupt state.
🤖 Prompt for AI Agents
In components/core/src/clp_s/JsonParser.cpp around lines 1156 to 1157, the
result of m_archive_writer->close(true) is added to m_archive_stats but any
failure or error from close(true) is ignored. Modify the code to capture the
return value of close(true), check if it indicates an error or failure, and if
so, either throw an exception or handle the error appropriately to prevent
continuing with corrupt state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp_s/RangeIndexWriter.hpp (1)
12-12
: Remove still-unusedystdlib/error_handling/ErrorCode.hpp
includeThe header itself never references any symbol from
ystdlib::error_handling
; it only uses the project-localErrorCode
from"ErrorCode.hpp"
. Re-introducing this include brings back the redundant dependency flagged in the previous review and risks a name-collision if both headers declare a type namedErrorCode
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/RangeIndexWriter.hpp
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/RangeIndexWriter.hpp
🧬 Code Graph Analysis (1)
components/core/src/clp_s/RangeIndexWriter.hpp (2)
components/core/src/clp_s/ArchiveWriter.hpp (4)
nodiscard
(54-68)nodiscard
(226-237)nodiscard
(243-249)writer
(257-257)components/core/src/clp_s/JsonParser.cpp (6)
nodiscard
(48-51)nodiscard
(53-58)nodiscard
(60-66)nodiscard
(68-71)nodiscard
(73-76)nodiscard
(80-80)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving archive statistics! Just have a few questions
epochtime_t end_timestamp, | ||
size_t uncompressed_size, | ||
size_t compressed_size, | ||
nlohmann::json metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's only for range index? Should we call it metadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, range_index
is probably a better name.
/** | ||
* Closes the archive writer | ||
* Closes the archive writer. | ||
* @param is_split Whether the last file ingested into the archive is split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess every archive except the last one is a split? Why do we need this information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be helpful for error handling in some scenarios -- e.g. if some file gets split across two archives and compression fails while the second archive is being compressed we can use this flag to tell that the last file in the first archive hasn't been fully ingested and handle the error appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would looking at the printed ArchiveStats
help in this case? As I understand it, regardless of whether an error occurs or not, the last archive is never a split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say a user is ingesting files A and B and we produce two archives, where B is split between the first and the second archive. Consider the scenario where the second archive fails compression such that no archive stats get printed for the second archive.
In this scenario, without the is_split flag, the user can see that the first archive contains the files "A" and "B", but has no reliable way of determining that "B" is not fully ingested. However, if they do have the "is_split" flag then they know that they should expect at least one more archive containing the rest of "B".
Right now, since we still close/print archive stats for failed archives it's not as useful, but if we improve the error handling in JsonConstructor
I think it could be pretty useful. The c++ code-path doesn't have this issue since we're only forwarding archive stats for the succesful archives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a direct way to figure out which file is split across two archives if one archive contains multiple files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the range index is part of the statistics
m_archive_writer->close(); | ||
auto JsonParser::store() -> std::vector<ArchiveStats> { | ||
m_archive_stats.emplace_back(m_archive_writer->close()); | ||
return std::move(m_archive_stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually use this return value except for the test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not right now, but I figure it will be useful when we write c++ spider tasks to do ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/ArchiveWriter.hpp
(4 hunks)components/core/tests/test-clp_s-range_index.cpp
(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/ArchiveWriter.hpp
components/core/tests/test-clp_s-range_index.cpp
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (1)
components/core/tests/test-clp_s-range_index.cpp (1)
141-166
: Potential narrowing when extracting indices from JSON
start_index
/end_index
are deserialised intosize_t
, which is platform-dependent (32-bit on some targets).
If the archive ever stores >4 Gi log-event IDs on such platforms, the cast will overflow silently.- auto start_index = entry->at(...).template get<size_t>(); - auto end_index = entry->at(...).template get<size_t>(); + auto start_index = entry->at(...).template get<uint64_t>(); + auto end_index = entry->at(...).template get<uint64_t>();Upstream structs (
RangeIndexEntry
) usesize_t
, so this would require propagating the wider type there as well; flagging for consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
components/core/src/clp_s/ArchiveWriter.hpp (1)
36-53
: Avoid the extra deep-copy ofrange_index
in the ctor
ArchiveStats
still takesnlohmann::json range_index
by value, causing an unnecessary copy before the move. Switching the parameter tonlohmann::json&&
(orconst nlohmann::json&
if a copy is genuinely required) removes the overhead and was already noted in a prior round of review.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/ArchiveWriter.hpp
(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/ArchiveWriter.hpp
🔇 Additional comments (1)
components/core/src/clp_s/ArchiveWriter.hpp (1)
128-133
: Verify all call-sites now consume the returnedArchiveStats
close()
no longer returnsvoid
, so ignoring the return value will silently discard useful metadata. Please grep the codebase to make sure every invocation assigns or forwards the result.#!/bin/bash # List places where ArchiveWriter::close is called and the result is ignored rg --line-number --pretty $'\.close\(' | grep -v '='
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments after skimming the PR.
}; | ||
if (m_print_archive_stats) { | ||
print_archive_stats(); | ||
std::cout << archive_stats.as_string() << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::cout << archive_stats.as_string() << std::endl; | |
std::cout << archive_stats.as_string() << '\n'; |
To resolve clang-tidy violation.
std::string const id; | ||
epochtime_t const begin_timestamp{}; | ||
epochtime_t const end_timestamp{}; | ||
size_t const uncompressed_size{}; | ||
size_t const compressed_size{}; | ||
nlohmann::json const range_index; | ||
bool const is_split{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the reason you're const qualifying these variables is so that they're not independently modified by a caller? clang-tidy understandably has concerns about const-qualified members, so I wonder if we should just convert ArchiveStats
to a full-fledged class so that these members are private? Is any caller actually using these members directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a rabbit suggestion. I agree it makes sense to just turn it into a full class at this point though.
#include <vector> | ||
|
||
#include <nlohmann/json.hpp> | ||
#include <ystdlib/error_handling/ErrorCode.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, will remove.
/** | ||
* Writes ranges to a `ZstdCompressor` then clears internal state. | ||
* | ||
* TODO: refactor this method to return the metadata as a `ystdlib::error_handling::Result`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODOs generally get forgotten so we prefer to open issues in the repo. That also functions as a good entry point for new contributors. So can we convert this to an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put it up as #1025.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the TODO comment?
last_line_decoded = True | ||
else: | ||
stats = json.loads(line.decode("ascii")) | ||
stats = json.loads(line.decode("utf-8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, what motivated this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rabbit suggested just using utf-8 output instead of using the escape to ascii approach because it makes utf-8 content human readable, which I agree with. Since I'm dumping the range index into this metadata now we can expect it to contain some user-inserted kv-pairs, and that will probably be easier for users to spot-check if unicode content renders as unicode characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp_s/RangeIndexWriter.hpp (1)
84-84
: Out-parameter pattern remains from previous review concerns.This method still uses the out-parameter + error code pattern that was flagged in previous reviews as error-prone. The TODO comment acknowledges this technical debt and references issue #1025 for future refactoring to a
Result
type.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp_s/ArchiveWriter.cpp
(6 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(4 hunks)components/core/src/clp_s/RangeIndexWriter.hpp
(1 hunks)components/core/tests/test-clp_s-range_index.cpp
(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/RangeIndexWriter.hpp
components/core/src/clp_s/ArchiveWriter.hpp
components/core/tests/test-clp_s-range_index.cpp
components/core/src/clp_s/ArchiveWriter.cpp
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (7)
components/core/src/clp_s/ArchiveWriter.hpp (3)
36-99
: Well-designed ArchiveStats class with proper encapsulation.The
ArchiveStats
class follows good design principles with private data members and public accessor methods. The const correctness and move semantics in the constructor are appropriate for the JSON handling.
154-154
: Breaking change to close() method is well-justified.The method signature change from
void
to returningArchiveStats
aligns perfectly with the PR objectives to expose archive metadata as structured objects. The optionalis_split
parameter provides useful context about archive content continuity.
292-295
: Method signature changes support structured metadata flow.The updates to return
nlohmann::json
instead ofvoid
enable the propagation of range index metadata through the archive writing pipeline, which is essential for the newArchiveStats
functionality.Also applies to: 303-305
components/core/src/clp_s/ArchiveWriter.cpp (2)
60-137
: Excellent implementation of the new ArchiveStats functionality.The
close()
method implementation successfully:
- Captures JSON metadata from write operations
- Constructs comprehensive
ArchiveStats
with all required fields- Maintains backward compatibility for archive statistics printing
- Provides structured return value for programmatic access
The error handling and cleanup logic remain intact while adding the new functionality.
139-158
: Consistent method updates support metadata propagation.The updates to
write_single_file_archive()
andwrite_archive_metadata()
properly capture and return the JSON range index metadata. The implementation correctly handles the flow fromRangeIndexWriter::write()
through to the finalArchiveStats
construction.Also applies to: 160-210
components/core/tests/test-clp_s-range_index.cpp (2)
44-50
: Excellent test refactoring with comprehensive coverage.The refactoring improves test organization by:
- Extracting common validation logic into
check_archive_range_index()
- Adding
check_archive_metadata_from_stats()
forArchiveStats
validation- Providing dual validation (direct archive reading + structured stats)
This ensures the metadata consistency between both access methods.
Also applies to: 141-167, 169-203
221-232
: Comprehensive test integration validates end-to-end functionality.The test correctly captures the
archive_stats
vector returned by the updatedcompress_archive()
utility and validates the metadata through both the traditional archive reading path and the new structured statistics path. This provides excellent coverage for the new archive metadata reporting mechanism.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the update.
|
||
// Methods | ||
/** | ||
* Dumps the contents of `ArchiveStats` as a JSON object in a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Seems redundant.
* Dumps the contents of `ArchiveStats` as a JSON object in a string. |
/** | ||
* Writes ranges to a `ZstdCompressor` then clears internal state. | ||
* | ||
* TODO: refactor this method to return the metadata as a `ystdlib::error_handling::Result`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the TODO comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving changes I requested. Deferring to @wraymo's approval for everything else.
Description
This PR exposes the contents of the archive range index (including filename, etc.) in the printed archive statistics to enable the package and other users to index this kind of archive-level metadata. Additionally, we refactor how archive-level statistics are represented internally and forward this new representation to the write boundaries of the c++ code such that users of our write-side library can work with archive-level statistics without having to hack around with the printed statistics or read newly-written archives.
Besides including the archive-level meta in the archive-level statistics we also add an "is_split" flag which indicates whether the last file ingested into an archive has content continuing into another archive.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores