Skip to content

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Jun 13, 2025

Description

This PR exposes the contents of the archive range index (including filename, etc.) in the printed archive statistics to enable the package and other users to index this kind of archive-level metadata. Additionally, we refactor how archive-level statistics are represented internally and forward this new representation to the write boundaries of the c++ code such that users of our write-side library can work with archive-level statistics without having to hack around with the printed statistics or read newly-written archives.

Besides including the archive-level meta in the archive-level statistics we also add an "is_split" flag which indicates whether the last file ingested into an archive has content continuing into another archive.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Added test for forwarded archive stats
  • Manually confirmed that printed archive stats contain expected content

Summary by CodeRabbit

  • New Features

    • Archive statistics, including metadata and range index information, are now returned and available after archive creation.
    • Enhanced reporting of archive statistics in a structured JSON format.
  • Bug Fixes

    • Improved handling and validation of archive metadata and range index entries in tests.
  • Refactor

    • Unified archive statistics and metadata handling for consistency across the application.
    • Modularized and extended test logic for archive metadata validation.
  • Chores

    • Updated method signatures and internal logic to support new statistics and metadata features.
    • Made explicit discarding of unused return values to improve code clarity.
    • Adjusted subprocess output decoding to UTF-8 for improved compatibility.

Copy link
Contributor

coderabbitai bot commented Jun 13, 2025

## Walkthrough

The changes introduce structured archive statistics reporting by adding an `ArchiveStats` class, updating key methods in `ArchiveWriter` and related classes to return JSON metadata, and propagating this metadata through the archive writing and closing process. Tests and interfaces are updated to handle and verify these new return types.

## Changes

| File(s)                                                                                 | Change Summary                                                                                                    |
|-----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| `components/core/src/clp_s/ArchiveWriter.cpp`, `ArchiveWriter.hpp`                      | `ArchiveWriter::close`, `write_single_file_archive`, and `write_archive_metadata` now return structured JSON and stats. Added `ArchiveStats` class. Removed `print_archive_stats`. |
| `components/core/src/clp_s/JsonParser.cpp`, `JsonParser.hpp`                           | `JsonParser::store` now returns `std::vector<ArchiveStats>`. Internal stats vector added. Calls to `close()` updated to handle return value. |
| `components/core/src/clp_s/RangeIndexWriter.cpp`, `RangeIndexWriter.hpp`               | `RangeIndexWriter::write` now takes and outputs JSON metadata via an additional parameter.                       |
| `components/core/src/clp_s/CMakeLists.txt`                                             | `nlohmann_json` linkage for `clp_s_archive_writer` changed from PRIVATE to PUBLIC.                               |
| `components/core/src/clp_s/clp-s.cpp`                                                  | Explicitly ignores the return value of `parser.store()`.                                                         |
| `components/core/tests/clp_s_test_utils.cpp`, `clp_s_test_utils.hpp`                    | `compress_archive` now returns `std::vector<ArchiveStats>`. Declaration and documentation updated.               |
| `components/core/tests/test-clp_s-end_to_end.cpp`, `test-clp_s-search.cpp`             | Explicitly discards the return value of `compress_archive` in tests.                                             |
| `components/core/tests/test-clp_s-range_index.cpp`                                     | Refactored and extended metadata validation: new helper functions, validation from both archive and stats.       |
| `components/job-orchestration/job_orchestration/executor/compress/compression_task.py`| Changed subprocess stdout decoding from ASCII to UTF-8 before JSON parsing.                                      |

## Sequence Diagram(s)

```mermaid
sequenceDiagram
    participant User
    participant JsonParser
    participant ArchiveWriter
    participant RangeIndexWriter

    User->>JsonParser: store()
    JsonParser->>ArchiveWriter: close()
    ArchiveWriter->>ArchiveWriter: write_single_file_archive() / write_archive_metadata()
    ArchiveWriter->>RangeIndexWriter: write(..., metadata)
    RangeIndexWriter-->>ArchiveWriter: JSON metadata
    ArchiveWriter-->>JsonParser: ArchiveStats (with JSON metadata)
    JsonParser-->>User: std::vector<ArchiveStats>

Possibly related PRs

  • y-scope/clp#847: Implements the write-side logic for the archive range index, including RangeIndexWriter and its integration into ArchiveWriter. The current PR builds directly on this functionality, modifying method signatures to propagate metadata.
  • y-scope/clp#819: Removes metadata database handling from ArchiveWriter and JsonParser and decouples archive metadata updates, complementing this PR’s refactoring of metadata reporting and statistics.
  • y-scope/clp#563: Introduces the initial single-file archive writing logic, including the methods now being updated to return JSON metadata and stats. The current PR extends and refines this work.

Suggested reviewers

  • kirkrodrigues

<!-- walkthrough_end -->


---

<details>
<summary>📜 Recent review details</summary>

**Configuration used: CodeRabbit UI**
**Review profile: ASSERTIVE**
**Plan: Pro**


<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between f8b5f740cfdd40c39dfaaefa33269082d46c952d and 6ccfa038672e4b610630ac28f70c56f7520edc82.

</details>

<details>
<summary>📒 Files selected for processing (2)</summary>

* `components/core/src/clp_s/ArchiveWriter.hpp` (4 hunks)
* `components/core/src/clp_s/RangeIndexWriter.hpp` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>📓 Path-based instructions (1)</summary>

<details>
<summary>`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == <expression>` rather than `!<expression>`.</summary>

> `**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == <expression>` rather than `!<expression>`.

- `components/core/src/clp_s/RangeIndexWriter.hpp`
- `components/core/src/clp_s/ArchiveWriter.hpp`

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms (6)</summary>

* GitHub Check: centos-stream-9-dynamic-linked-bins
* GitHub Check: centos-stream-9-static-linked-bins
* GitHub Check: ubuntu-jammy-dynamic-linked-bins
* GitHub Check: ubuntu-jammy-static-linked-bins
* GitHub Check: ubuntu-jammy-lint
* GitHub Check: lint-check (ubuntu-latest)

</details>

<details>
<summary>🔇 Additional comments (1)</summary><blockquote>

<details>
<summary>components/core/src/clp_s/ArchiveWriter.hpp (1)</summary>

`149-154`: **LGTM!**

The updated signature properly returns archive statistics and the `[[nodiscard]]` attribute ensures callers handle the return value.

</details>

</blockquote></details>

</details>
<!-- internal state start -->


<!-- DwQgtGAEAqAWCWBnSTIEMB26CuAXA9mAOYCmGJATmriQCaQDG+Ats2bgFyQAOFk+AIwBWJBrngA3EsgEBPRvlqU0AgfFwA6NPEgQAfACgjoCEYDEZyAAUASpETZWaCrKNwSPbABsvkCiQBHbGlcSHFcLzpIACIAMxJqAAoGL24wRABKLgBRAA9ufEQPZwYEKT9MUhQMJVzINlw0Wmo0ap4KeAwaehKyj0RG8QH4BkQAbkgAMXwKAHdnHopSyRIwSKlfAeokcVGw/EYAakPIWY6aCkRoyDlIbCK+InhVSgBOAA4ABlfq3ApFbAMaSQMiwTBAthdZAESCwfCzdBLPprEgbeokRrNRooZBgmqRHo1EH5Qp0DSQADK3FE8FiIzQPlkABowrAPLZiQUitC2QouuxkPhYqzikiVhUMFVOrUWZ0UthaJ0iJA6ZEMGg2Ogif4vNQog0mi1ZVhcLzeJ1uojluUtkNdohyXBUAyvPCeR5uGgGABrNBVTD0fCmyh3B7Qg7Skh1XorFFog1Y1rMGYeEixeJiFZeWQaIwAQVoivE+HVjJZwesdn8sS9BEuIt+lFLfhIvGk7G2Jf4wpjUjjJE2gx2I2QAZVM3mFFoPNQ5AR/jbRS6nZNBwrZ3UHgE+GwNWc8GBQobAGFjgolI6EMhSpVU+qBJFkPdKILhevzqtEPAlJAvM8qC4+ynDM3qnOosBWsi6wDvYQ7DHsir+GI2YsrM4E7qE5B0EqQHMJg8DcN4ertBaUS2sOewzC2TSQHO2ZgBuuA0FgvbAp06DARQvr/Lu9B4Rg5AULmBjuHc3BYsUXiIAcTTTlqMRIAA+og3B/rg1yxLqyowhWrH9oO2zwQ6MBXiqWnVIqDB6sgsxssGfAVrqAwqvAkTVKQAxRBaMksWK5RMF0nTBMg3lakGbJ8KxLK8PgEjfjhsnqPAJYMnyNC5KEKgYZB4oBcxoR5UF6g5kYABqDLfiu44ObyN6SmxGDykoo6FlENADMgsRURWXVzAsUSsbBhn2lqfGYNgqUBXSFB4cWJpgqEFbmvyizWv0cEjXl2jzam+SiJaeXsJe7J2EwzCqQeNngWlHQCHgyVYEQ2DfgOnTAmOtD4MCGBBi2QTwP4kCfQwjgdnNYkSTIJCyCW9C2dQOLoF4/hNPIsn+IgRS0OSACSoSzkGA2jrRJZgAIKPejhdWkMJ+jGOAUBkIGPZ4IQpCCXq9BnZCnDtPwwj7SsMjyEwShUKo6haDodMmFATohaOLGs8QZDKAdLA81wVAIg4TiAbcovKBLmjaLoYCGPTpgGGdBTkFCAD0TD+PbiBLI7qTKfbeZ+SQADqH5CQw3DcBwBjROHBgWJAeY4yrHOWrreGAUe1PSG4vIAAbe2t/ubhQHAcCkpKJBkGfoqaiinGg154qQ9Awv4uDYBQLFYFnPsUoMiBl4IIhiKNJMIo03rvZAGfbvgvhKSpall56VANJQuMFZPkRiKOPtDXaI7VE1OE4wAIuW8BsFs52ICyX4AF7SCyY6tAAUhSADyAByEpSjUUaoQgpSI4IjQ3r0FiP8ZgIJwIhgzoxEgyklSREUqqGBrES49z4FAj8ilWKKQTC0FB5IABC4URRFHLnCOSsxKAeGTIqOkbUDiN2bixMeGBXSwH4hgAuQhpIYB7gLfuC4MbsBwrpTeVB6oWSjMdMe0DYGSngYgzBPsUGkMrj9ecGJGHukgE/N+EjoyxAuKcc4OEcHYjHIgxW9B/DJjipKBsg1EL7RmDmEyHh0GbkUWtbBGJDSNGUQ0MhVcnziU5kBL0QJuCZW0S/d+1ZKGNQ8DCVUvg0Kmg/h4SMdRExA2bsI3kYjP61CMUlOxnluADwYS3GcyAdHv17vtKRbY4o7kQNmewrZnDEQzstXAni+jKS7v4jEgT5jIGsbFOgd8iTqE6ruTMKU1LyFGY2IgVBLShQzkXIoyjbg9LyW47OfRO7UG7rBDodiBHtmXHNKRu5nz0DlF4BUHgjwZ2iBoDQ7tuAuz+AkZgSo+krHtseEsWwoQaFgMHaIZdlnjKkNjSAz8pBUB8OWWqtdgS7lpPIERa10l6NhAGP8djbiVIwDhdQ6AakxJVKA8uvjWiMRwrEOZc1LHtHwJ6IgxFKWmh4kQCCFYtken+ECTGLIyAqGJdpWqoK/iAnBmODChFQgpxYG2NkGAvzlEGuRIywlzCWDzF4C4K5wwNiUCkTpbLuychmJaKihEHwjBBIFcQacDBQAALLDMrl+Ig6om6A2wCE7oXAM4SHwN+aOPtc4XALsK5RgAkwjHmgVmMac4BwTa6bZE8p6IGUhdXAGRdB6AzUcruZd2KbPVSWAUjsUwuzdikbgntDkrDjUvIO3AM6esgD6iu9B/WBubh4ENElaDhsjdG9tUhO35w4DIr8ciYEKOQQMSdHApBiBmMAWdJBJiuRIDjDAXUy0BQGAAMhco+UukAU0ZzTTCfd86C5Lrgauo9gKpCJA3QXbddY90+0PZEE9Z6+RXpvdIEt5taKsPYZw7hVa242zrQ7J2JAm0MC+W22NAcNDdt7d631Q74ABuoKOiGnMp1RvoC+rNi6MFYNMWgRIIG/YB2vVg6BFBL64E3QB3d+72NgfwOeuV16LF3ofU+g49G85vqYz7bxmJcHsfnVx5TPG+MCacRQIDa0ROnrExB3Akmj2ZFLXB/AbDMAcI4FwksyGx6obtrgRADbnau2wy23Dma84EeDkR/tJGWw2LoDRmdeGFMcB6d+mBtpEAl1M85mt500Puc85h7zOGPPyYuIFntYcI6eqtq5+tGGsO5a9tFgrkKQ7FeiJHI1sd2ZqzIo4JO8gU4YsQPmWiJAERWsxmPfdxz3Mwurr8f4tBAR0NdVZbgDhdQ0ByuUFju8nmKjsYNQ+LIBAkCeK3egTMwgnxCBqJbLI5nqoxljAeNs7tkXgDfC+HFan4syVMnoNx8CTwSFgTSfoLL0nEHY2k9jN6oBnpLEyqBhvIBinFZq6AGBiuknwAJih2VJlC4+gtAxzlEGUTCB48AKo3xQO5tKApfgyWibownSphKiXbv5i4ZcEcAHJkCbJzSQIZg6gmMAxcAulZKcIRto2XBuGiW4JTbmNyt/M+64FQtdTA/BIkPVSnmgHPBOmLzQdPItMuIw1FBxkt8vInL43qp5YBR7EYw80G4WYBwsfThZO4mgsiiDyK/cgu9Y4fdIOUyxlBqFKH1EULSA89d6Fy6YRnFhNmEMOaQyr/aLZFxCLsbivon2v51E6J5GiR4Jd2OndjdOIrUTJXuG0y1up/B8Tx3FrBiXlGwpIOFsXLAGwI5Z2yEhqcQpdBm3Ns5CrR2rUL3qkaC57UK74rH2I8gR5o03kyuxnvzVkoZ+/FjaLKG3FBOCfZjYKA1iBNVNbHgd/KjHBtxuHRUQMgNc16OJq1YPXNRWZva1P/W1KMAoCgB1PgJ1P8BgV1cIS6IwKAAsJQLmJyXnJXE5GFa6C9eVHdXjVHdHKiPfb7VNAnP4JUQXQJSvJ/Q/M5HCS5JcIcLsI8XVDaEcXMb1NfePFRUjcjINEgcNfLSgbNYuO9VOfvMBKXaNRNO9GENnCtE5RgfnRIPNHEQtNSSAAAXhVAZCKFLnYP7U4P1FC2HQo38AENqyEMYw8WXX90/XgXXX43/T0wMz6CM3A2wPM1vU51F1pQH0kNoDLgP1wFkGpCAhT3gzs0QycxbCDRNBCJIH0K9UMLbyFxML4PMPZ0sJkWYx8UTDYyPQ03v0Um01gl01wJcJWDcJMw8Kg0yG8NvHELHmr0CKTzCHiLCNT1swEiiJ4RiMYTaOpH0JsF7wmRSLIQyL6FfVi3OV6U70GRLWwKs2aNzHDia1KzACMHK3Q0bRy18w82PC9TQBHgABkdgHRcAMpQ5Viv8Y4452sh1OtnButhQx9a8gZaRhRqE49gQKwM5GgKBSBeliVvRFI/xyZ9xpBOcNYAxOpupM49j4tiiA4ZdnBASbh5AbFL9wi087NFJHN7NOj2E8TM8wSAJ5AQEB8KxbAcZSo8xoBshfxOhfQqhVkdxykdJeQrAABVfBY4nGY8RkjAZkjwVkkNS8eHDFdETAHkBGbErojAYk6IgmBEKAkYNpYEqIXqFzD2AtbjZElkPCSmcHRaMRL8cQDYNGCQbQXUB8RJA4C9UGesI8U0VAf4wE8kV+A4cKEMMfICYEv0F5PgJQakL+RqS6fgNHZuNoF05ARAclYODEYSKOY1U1G1DkjwQAtZYAo8UA+1KIR1bAZ1GAoRd1PrPtT08gKRX0ksNpNAdMbPA4o4xJVEjEQU4UyAOKL8NQRZAeT6EmTKH/SKDAeQXM8A/MyAws6A2A4qIGUQFvM1NoaSZuW/Q2FYkrCADY62WtNzDzSrXYnU+2B+bhKwZwB4QrK4iOZM1rVWNZDrPWZ4kXW8Ms0SX0hcXUW/F0d4pCAqF0c1DOZgREnjc2aQzAtJDOP9DgMjH6QGLQgCvUvOYCpQvQuHZAL4ukKycGfASM+sdiZgbwcQVSD0d84EVJBAHaMeI8ksE8y4SgTnVA7g3neebZUub3Ji1dUBRSAGSPMeAYFMbikPF3eLPBVxdpfiXYEEes/uV0haBsA/K0p5F5a3NxEC7gxGUA6A9QNpKClMKxagCKVkTXRURAKyKcLyc6DSiIFxJApKBZFCBscCusAXO9T3HnewKC0wjJGuBo3wiQ5ooCKgjiBysowDdA9zPQDOH+aPSlNRdABMmoLRDGfC21PnEQs3MeOC5TRLMuMcMlEKTQSkE+VyZwOy8CotISu9GKtAOKuSCsRKk1ZKxNeVEgGQg4f8xErKm4NMFMFsTlMgS/QaHjD0r0+yX8fAJ4BgCVCgf4PgPEWgaVFkKiPKf4XwTSeER8+3REDJCfAEIEGvG4wchc9M2cq1LM0FEAkkMcwMCcos6c0shAkLVI9yvgqjMNJo2jbRY808yw3i/wZNVNdNSijAaih4AuH6pyqzCCwTfTUKxAcKtoNLW2CrHY5tA8wG4GrtILRrBAsrbcpGrzFG1tDzNGr6oSerC8prK8u428h4+821V4kSTOMGigyudiX44mmiigOi6uXEKbA7MgdauuHywKvyoI9o2XWIhKHixwrdZwmG8K8Ury8RaUS3GUxaXkT3QePoqpLeCiWEvgVESgLfPFU0BGZZRiZiMCCuPABbHcfkIneSSgaa8kayuaF0ZkDiOcEiK01bNgZgA7PgK0joKVNxDKrxDq509ooKpw8ouWsuMEKGfm2SebIVVA6KRKvCB8HFA4OELweuEfdaYaHeUiFuN2pMo1A6tMtcXkTMhcnMi6iAzwG6ks+AvtAdQJNIyjcdajN66NJm6TMeAAbQHp+iMpMtoAAF1x6cBSdHLktYNIbZaO4u44b2IEcKLPqObUstjMs9yCbPZ2azz6te1rj1jNjcbtj8afMDybBbwT1ah51zysbKa2tqb7BHjk4Xjes3jQ8y4NbXkb76o76ox51JtULkiwk0dWxMoWJCwbLmxlVrb54NQMRIEWMe43xI7CTIiM8SxL0M5yRfZNV0rFIClIS1LzpgiT9VLKrMYoLFYQQKH5APtnAqAs70rciWgy4DteoPAAqHBIHMYCGiG2rSHTktpS90k3s6zDFsCKAFVL9mGpq0B5B6C88n8fBJGqGNbua6HWRnQlG2H/yOHGh8GRKMZVUlLuDXLXRxrjQ95SlKByc/wr4qoYQfVMYAyTyfQB5H8Ozyd7KAAtDdEFc6O7GYCKsLbaJ8RqUXMu7/VM4Ao6mum1OusAhu1U4st1Fu4jR63gzu0NCLf6mEAB0gIB3IKY6BRIIJ/jEJxcDHa9HjGDMtbIKamYEFJQMuGTdNEp49YvCpj8Kp4J27aQep4pC4FkLB7onBjAa9FjJpyAFp6a9pkgZzNejOHpspkB+G7e3c5Gq+wm+2DZvp/DQjLG0+rc9LHcrLKrPYw52+45gLMmp+lrKm0JROJ4umr+hmtxaBX+0LVmzOI5++5ExQ7m4XLuy0GER5Z5eSRKV23wBB0IJBo3LgVoOJfwBJMJZhCIqZ/E3hVXWiZB+gIx1TExhWg3BeFBvgVAUiL+BPWESePOqhYxxlc4C2o6jOap2gWpsJzmqRFysZVo4I0I6xKJseRZtpxQFZ52hQKa6QW2bbZUYGUGa5B6BQVgdgYXJO+l5qBgW6RJXkT2klhlOeQ3Kloa/gEa30o6tewg0LZJ4AihQGPCC8IwQ1eJ3/M6pJucoAs61JvMq6xuqc5uj1HJ9up6/JidcNIekepAMeye6eg4SprlnlkZmYBpgOeZiVigZZ1Z+i9Z+54FvOLe8+nevZ6rIF4B/DI++9Qe4e2PYyhYBN2TMZgXFN4ZzGdN1tvAyZ+zfE2ZllrN1pnNqVvNsFgtwBh5jnbZ0t3Zy+itwtqtx5zGk+jcs+y5vG7LPe3c1IdIR+645+m8t59+h8+ml8yUtC8MoVT82QtioSPu5zX4x7EZsuFlRqcGM/fICypvONhYClanOShkYIUMSXCC7SmCils8vusYZDMvZmNy8ykWF0SXO9jQaD0x+WdAWgIQe4XAHmIGL6ZAH6AcwxRyMal1RaksP4SeMyNau6UIQ0n43kWlxaCMANbq2q1o+S4D9SkYdQaKQmQKN2t+4OO7KmdVI9PgScclSUUcbca2u5e7QDhSh0d1lMz1rVICAAn106zT/1y6/ga64NrJj1CshIs5tdi5xGi+zDdqHenUxSOzxSe6KSfdy8l5l+492mnrJ8t4t9+ZFDDt3Un2UBgWzU8XJPSXPy8WxhSWqOmW8ovYguWOqRfz8GGKzM0eV0KyXwQO8nW01NTK5XZ00yYVhIkSnL3wW9kmtD2eu9VAS9+bHRgNKnQV5bVjwrsO5XUvGgcvYUcDy/ZT4IckQ9UsOynqVlNV3Kzr/pDqhGCsJmKRcx21Cb99tVmxkYOxrbClDAFVCl5BwxWa+ar8vTEWFGOaaKU8pUYg7mygNlFkUViRuZWJtTiuxJqujMnT2u4UUc9Jycl1ENssqASYSbrsDu4NApydHurmILoSiCpnOxDwnTUGsgxHiTfbP7LwDHyeFkJLjgdjaAeI/ux9dNZ9zGOH6WhH5UJH0olH+2mnvNbHrHxQwmguAnoniG6WqG4APH2Omdjdmz+2OzndwmxzkIZz8QVz051dnGgXst52YX3LMXgYCX1yB0J5g9jzo9hOE9z53z752EBIMWKDbU0Xpzlz9XoLcFyHoCaFn8X44AKGuGtkJoEMMcX4wQ0mq3l343iksBCsJcpYDMgGU7qRB1pgyxtLtVhGu7eLULsQoW1ofwlo2IgYu0wKmNht+N8e190HrAAKpPhexLnU5Lpek5eW6OWB+FuytF1o3VjobXLsZZbVrT3kKPksVy5VnmFcS+akBgbFPJBGSlabhfHeLUg2wCMn+7ViOJ9T3T//aur7lJn7+u8coNgHkzoHqYPPtyvJiHqN6H9VupinzdKn0zS9ZHjgM/hnzHpn3Hkv/Ho9Qn6kYnzP0eptqeltqfk/un8TrVMzS/tf3R6/YceIA5nnj3Z4v9OewVXdLzzL5hUS2cvOdrZxCAi9lIyvXpBbwhQrt1ysvazvLxQEdQheIQMAHsTABMxHO+ARSIt27Tk0bi15eOHeS6x697cbxVOFhzkia5eO/fTKLQwDT4cSuvDLjkB0Ur2Vv+rEXPmty7CyEwO7HfwKBTIr2UbA2QAAIpckcYygxSK/GfjQAAAEjYGfi+xf6Xof4NGV5B2dGA1cZSru0QCkCgu5AjKFQDEBgAfoYAVahgXJaXsMKarMgA4AxgGU1aQg1Ptx08qchv28gd/qZVoD399w6+HbvlAHgFB8oTjWstXnE7nRJOVceXLJ3QDydQginKIEN2kBVkqOy1WjgiDHCO0qIR3Ogr3m2h3AYmDRO+FogsHDANGyKb4qrRJjEgIkNqZwAa3+CzAsAtwGPiMzj7O0q+OuMsPZWAAudiozvI3iGGWR28BogQ/YOUkEFQYWQf4EeG0lJwho0mDYZ8MlVkHQVpWr3BJl6w+7HV5yy/O1AZwLJN1N+RgMzmuTWKWcdm1zRXnZ1IE6kwApDTisXjc4U0tejAmmswJ86sCDeFgw2Hfg6FxCLkt4AlBtnYisQeaCdfPmmFrA6UB4UYZiMgSkQzAyMnQVKO3xQxsgfQiJCPCoT96cUKAd6HvOqDYD0sM4KMWgJghqCKRSgogEEjkVJasYaRHFAGMHmmTU5i6zYDbjAS1LyVKoYOGVLwyRGZJYCb+QVjRD9735aiwuPEU4PWRdB6cntNkKkBDBkiXMFI3kcpgBGZJf0XPZwnj0ra5BsgE+WQOJn/7XpLRxeJnj5TpHIVMOho6kHwBNH+jeozAZAEoEAQEgqUDwFJuRWvgkA+MzgaBidmmQW4xUxBDbLQlzqvhZKio4vMqMuhbcFQVMM0XrXsB9848MBTfOymDAAxOQ+0KICENDGtgmYOEaQZnFpEAxX2WkYSC7SmE18BsCIE0ZsjNFUiWWCCDil3iL6AY4Ba0cbLDXP5FFEsno9sfSNC4WhJ8u1ICDKIkhF4ikG2bUbWHC4D4WChdSiHwnczksTRecayPwGRSQ48Ui4jso41oRaJc8mLd3vESzGgE6xjuAcHJESAOUExgI2oBEwzgUDMkETEPCxgQQHhMxGcDILjzlRyN14O4qMHmOBDqjgwJCDMV7gHjBgsASgSINynahhFuRlIrBO6OAl34wx1pesRVCxAPRhIJ6BsHhFZohBLBRQKhpVxImw8JBwuBPjCEWzPUxwYNWSkniiCPoiuGBR8bgWdoGIfSQXB6BjzApsiOR7I0ieaK8QR5lxIo4lupJHH8ixxLAAZCckSBzETkLIbSdlUBiVd5sW44iAXnFDpi6UuAd3O8XrIYtQggfMVFwEcTIRySzk3kINDZFu8iQGE/JGJKHSsFRgwkJFMoFRSD5esMeWbC3hewGtmW/IjsnRKqiSi74hYBQMigDJ34Nsdk8GOqOPHbxRgxBCFmlLCBsSYQ8I+QNuDSReTDwnxFlugDkl8Ap+DEt1vtUuGadvWJ1b7vcL+5PC4CobbflIMxFMjCm/hEXDyP0kMplCmPL0cKM6aH8VJAYLkcOL5FLTBRRktaW0EwAlhZAyYe4IS1Piehb81aT4ZVm+EkCyBFEyRKc2B479tWkWLmDtPDyjjaRE4m0cX1Z4cA5a84sye5iXFCiVxR0n6MOTOlEdkGKkL0FtRcyzsvhqA4gQMF+GE1/ht4ICc9MxqvSppHA2adXnmlkSLROMq0ZONgEP97Rjov4M6PnFPTcg4Mg6ZDNRHQzTpLSC6fK0Rn898ByA9GZlh+GPSKZQI6XrgM3K3TG090jGWQKKC9BgR9A15jr286f19eokGEVYPPAP4psjXeltwM0oRDf2U4UScEJEErd4SPEkLiqB363BmuMnZULylaqnCUwCggFm4mUFqCNB2QLQToP0GGDjBerfAOSCsD+Bmkjed2tew0ayFxB1s5ZL5MsoG5MY82dZqoPUGaDtBeggwUYMtpwhragk3JHYioiRDFWrXLWlgBCHkt2Bn0b6L9AZCkcbc5HGAlRCqEzUiUOEejqCw6DPiAhps/oiEMRgsdex8gAoXtWTJvcrhFqJftmRX5pM1+GTW6i3TM4Wc8BGWDzEIEEBgAZg3IwnCuHtgbyBAikbeSPj+B7yowogVmBQAbShMRm18upg9Ec7VxvQGgbgLIDoGHswRb9VWWF2fLookR1UrRJaljxFzLGDgAQDFHRylFsoxKJGb8TkYKkW0kggLkLTzAUhjwOMHGEBC5LQBJgYAd4J1R4aQd/2VKQ/J4MMK4g1qFYRFoKQ8BhSPA3UsHoWQgUjNB5dtNsN0AE7JDhOLoeEMyioitByY+AV3nwFIZ68dRL4MwR4A+ydAVUs/CeQNOuHh89Os8gNoZ3X6ZNxpZZV4SvItiyxXU8HWTMrKiDcx2AWsNADrF14GwpW4sNQCbGljmwDAei7mOoE4rTgSG9eQbHQGMngEzYuihmJAHeDpgAArAAEYAAzCEqCWfAwlAAJloDvASAISz4LEAADsnwEJTEqCVoBXgMSmJWErCWfAGAMSgQK8EKUAA2FJa8CCVBLfFji/xTErKVhLaAoS2IAABYYlrwAQO8DKVxKwlrSqyCUs+CtKlAYS2IO8AECtKqlZSgQCEvaWvAyl9AGWPUs+AkAUlJAMpU0DCVoB2l7wBgK0pICZKUlDStHGEqyWvAUlsQPpe8A6VrKBAZSkgB4CWUQAAlJAIJWMrKVBK4laSnJSEtiBDK2laSgQLQAKUMBSl8S6pUMtoBArhlrS2pXoqiUkABAVkVJQIFGVoBqlrSgpb8veDxKUlRylJVsrCULK8lVShgEcsKVwr/FrwOsgwHTDJKglBKglZMviXZKvQBKyJWEoYBlLWlXStAGkrKWfB3gQq4UE8qgCvBQVISh5Z8AqW0AGAzQJoEkpiU8r9lSKsJUCteAFLplTQFZWjk+ACBKVzypVdSqCUTKflKSspaUtiAhK8VrweIN0uaBVLZlQS94H0rQCfAvQaAQJQaqgBjKBAbylJa0sKWxBCwgahgGEteC0AawaALEfkoaWlLrltAVpWUtBWfLFlDivRcmtpVuqwl3So5SQB5VlKkljS91UUsCVpKGAQSspaks+UrLZV1y2pXUueXOLek34AtGHIPAUJ2Rp2fQEAA -->

<!-- internal state end -->
<!-- finishing_touch_checkbox_start -->

<details open="true">
<summary>✨ Finishing Touches</summary>

- [ ] <!-- {"checkboxId": "7962f53c-55bc-4827-bfbf-6a18da830691"} --> 📝 Generate Docstrings

</details>

<!-- finishing_touch_checkbox_end -->
<!-- tips_start -->

---

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>

<details>
<summary>🪧 Tips</summary>

### Chat

There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=y-scope/clp&utm_content=1005):

- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
  - `I pushed a fix in commit <commit_id>, please review it.`
  - `Explain this complex logic.`
  - `Open a follow-up GitHub issue for this discussion.`
- Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples:
  - `@coderabbitai explain this code block.`
  -	`@coderabbitai modularize this function.`
- PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
  - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.`
  - `@coderabbitai read src/utils.ts and explain its main purpose.`
  - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.`
  - `@coderabbitai help me debug CodeRabbit configuration file.`

### Support

Need help? Create a ticket on our [support page](https://www.coderabbit.ai/contact-us/support) for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

### CodeRabbit Commands (Invoked using PR comments)

- `@coderabbitai pause` to pause the reviews on a PR.
- `@coderabbitai resume` to resume the paused reviews.
- `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
- `@coderabbitai full review` to do a full review from scratch and review all the files again.
- `@coderabbitai summary` to regenerate the summary of the PR.
- `@coderabbitai generate docstrings` to [generate docstrings](https://docs.coderabbit.ai/finishing-touches/docstrings) for this PR.
- `@coderabbitai generate sequence diagram` to generate a sequence diagram of the changes in this PR.
- `@coderabbitai resolve` resolve all the CodeRabbit review comments.
- `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository.
- `@coderabbitai help` to get help.

### Other keywords and placeholders

- Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed.
- Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description.
- Add `@coderabbitai` anywhere in the PR title to generate the title automatically.

### CodeRabbit Configuration File (`.coderabbit.yaml`)

- You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository.
- Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json`

### Documentation and Community

- Visit our [Documentation](https://docs.coderabbit.ai) for detailed information on how to use CodeRabbit.
- Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback.
- Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

</details>

<!-- tips_end -->

@gibber9809 gibber9809 marked this pull request as ready for review June 13, 2025 16:53
@gibber9809 gibber9809 requested review from wraymo and a team as code owners June 13, 2025 16:53
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🔭 Outside diff range comments (3)
components/core/tests/clp_s_test_utils.cpp (1)

43-57: 🧹 Nitpick (assertive)

Minor simplification & early initialisation

archive_stats needn’t be default-constructed and later overwritten; we can initialise directly:

-std::vector<clp_s::ArchiveStats> archive_stats;
-...
-REQUIRE_NOTHROW(archive_stats = parser.store());
+auto archive_stats = parser.store();                // already noexcept

Reduces a move and keeps the variable in the smallest scope.

components/core/src/clp_s/RangeIndexWriter.cpp (1)

46-71: 🧹 Nitpick (assertive)

Double serialisation source – use the already-built metadata

After constructing ranges_array, the code assigns it to metadata but still serialises ranges_array. This duplicates one large JSON object. Serialising metadata avoids the extra copy and keeps the two representations in sync:

-    nlohmann::json ranges_array;
+    nlohmann::json ranges_array;        // unchanged

...
-    metadata = ranges_array;
+    metadata = std::move(ranges_array);

-    nlohmann::json::to_msgpack(ranges_array, serialized_range_index);
+    nlohmann::json::to_msgpack(metadata, serialized_range_index);

No functional change, but a small memory win and removes a potential future divergence bug.

components/core/src/clp_s/JsonParser.cpp (1)

490-503: 🛠️ Refactor suggestion

Repeated m_archive_writer->close() error-handling blocks are verbose and may mask lost statistics

The same 3-line pattern

std::ignore = m_archive_writer->close();
return false;

is duplicated ~16 times. Two practical issues:

  1. Readability / maintenance – fixing the shutdown path later requires touching many sites.
  2. Lost data – by discarding the returned ArchiveStats, any partial work gathered before the failure disappears, although the PR’s main feature is to surface that metadata.

Consider refactoring into a small helper that:

[[nodiscard]] bool JsonParser::abort_with_error() {
    m_archive_stats.emplace_back(m_archive_writer->close());
    return false;
}

and replace each block with return abort_with_error();

This keeps the stats vector consistent and removes boiler-plate.

Also applies to: 560-562, 581-583, 607-608, 625-627, 645-646, 657-658, 663-664, 954-955, 975-976, 1051-1054, 1070-1073, 1099-1102, 1113-1116, 1131-1134, 1142-1145

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 412f6dc and 8ff5131.

📒 Files selected for processing (13)
  • components/core/src/clp_s/ArchiveWriter.cpp (6 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
  • components/core/src/clp_s/CMakeLists.txt (1 hunks)
  • components/core/src/clp_s/JsonParser.cpp (16 hunks)
  • components/core/src/clp_s/JsonParser.hpp (2 hunks)
  • components/core/src/clp_s/RangeIndexWriter.cpp (2 hunks)
  • components/core/src/clp_s/RangeIndexWriter.hpp (2 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/tests/clp_s_test_utils.cpp (3 hunks)
  • components/core/tests/clp_s_test_utils.hpp (2 hunks)
  • components/core/tests/test-clp_s-end_to_end.cpp (2 hunks)
  • components/core/tests/test-clp_s-range_index.cpp (4 hunks)
  • components/core/tests/test-clp_s-search.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/tests/test-clp_s-search.cpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/tests/clp_s_test_utils.hpp
  • components/core/src/clp_s/RangeIndexWriter.hpp
  • components/core/src/clp_s/RangeIndexWriter.cpp
  • components/core/src/clp_s/ArchiveWriter.hpp
  • components/core/src/clp_s/ArchiveWriter.cpp
  • components/core/tests/test-clp_s-range_index.cpp
🧠 Learnings (2)
components/core/tests/test-clp_s-search.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
components/core/tests/test-clp_s-end_to_end.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
🧬 Code Graph Analysis (3)
components/core/tests/test-clp_s-search.cpp (2)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (13-57)
  • compress_archive (13-19)
components/core/tests/test-clp_s-end_to_end.cpp (3)
  • get_test_input_local_path (26-26)
  • get_test_input_local_path (34-38)
  • get_test_input_local_path (34-34)
components/core/tests/test-clp_s-end_to_end.cpp (3)
components/core/tests/clp_s_test_utils.hpp (1)
  • compress_archive (22-28)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (13-57)
  • compress_archive (13-19)
components/core/tests/test-clp_s-search.cpp (3)
  • get_test_input_local_path (46-46)
  • get_test_input_local_path (64-68)
  • get_test_input_local_path (64-64)
components/core/src/clp_s/RangeIndexWriter.cpp (1)
components/core/src/clp_s/RangeIndexWriter.hpp (1)
  • writer (85-85)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
🔇 Additional comments (17)
components/core/src/clp_s/clp-s.cpp (1)

123-123: Explicit discard looks fine – double-check that callers don’t need the stats

std::ignore = parser.store(); makes the intent clear, but if any downstream tooling or metrics rely on the returned ArchiveStats, they’ll silently lose that information here. Confirm this is the only compression entry-point that drops the stats.

components/core/tests/test-clp_s-search.cpp (1)

230-238: Return value now ignored explicitly – OK

The test was updated consistently with the new signature and already includes <utility>, so everything compiles cleanly.

components/core/src/clp_s/CMakeLists.txt (1)

231-236: PUBLIC link of nlohmann_json may broaden transitive deps

Promoting nlohmann_json::nlohmann_json to PUBLIC will expose its include path and, on some build systems, its compile options to every downstream target of clp_s::archive_writer.
If that exposure is unintentional (e.g., only headers in archive_writer use nlohmann::json internally), consider keeping it PRIVATE and adding the include directory explicitly where required.

components/core/src/clp_s/JsonParser.hpp (1)

232-234: Member name is plural – ok, but verify lifetime semantics

m_archive_stats holds all stats accumulated across splits. Ensure:

  1. store() clears this vector after returning if the same JsonParser instance may be reused.
  2. Thread-safety is not a concern (vector is mutated in only one thread).
components/core/tests/test-clp_s-end_to_end.cpp (1)

106-114: Test updated consistently

The explicit std::ignore plus added <utility> header correctly handle the new return type; no further changes needed.

components/core/tests/clp_s_test_utils.hpp (1)

22-28: [[nodiscard]] may cause noise when tests intentionally ignore the value

Many callers use std::ignore = compress_archive(...) or simply drop the return value.
If that is intentional, consider dropping [[nodiscard]] or wrapping the return with a tiny proxy that suppresses the warning. Otherwise, plan to update all sites to explicitly acknowledge the discard to keep CI warnings clean.

components/core/src/clp_s/ArchiveWriter.hpp (4)

11-13: LGTM!

The new includes are appropriate for the JSON handling and metadata constants required by the refactored code.


36-78: Well-structured archive statistics representation.

The ArchiveStats struct provides a clean encapsulation of archive metadata with proper initialization and JSON serialization. The comment explaining the avoidance of brace initialization for the metadata field is particularly helpful.


128-132: Appropriate method signature update.

The close() method signature properly reflects its new responsibility of returning archive statistics, with the [[nodiscard]] attribute ensuring callers handle the returned value.


271-273: Consistent return type updates for metadata propagation.

Both methods now appropriately return JSON metadata, enabling the propagation of archive range index information through the call chain.

Also applies to: 281-283

components/core/src/clp_s/ArchiveWriter.cpp (3)

60-136: Proper implementation of archive statistics collection and return.

The close() method correctly collects comprehensive archive statistics from various sources and returns them in a structured format. The conditional printing of statistics maintains backward compatibility.


138-157: Clean implementation with helpful documentation.

The method properly collects and returns the archive range index. The comment on line 145 is particularly valuable for preventing potential JSON wrapping issues.


159-209: Correct metadata propagation implementation.

The method properly integrates JSON metadata collection during range index writing while maintaining error handling and existing functionality.

components/core/tests/test-clp_s-range_index.cpp (4)

21-23: Appropriate test structure updates.

The new includes and function declarations properly support the extended test coverage for archive statistics validation.

Also applies to: 44-50


141-166: Well-structured validation of archive statistics.

The function properly validates the JSON metadata structure from ArchiveStats objects and correctly delegates to the common validation logic.


127-139: Excellent refactoring for code reuse.

The extraction of validation logic to check_archive_range_index eliminates duplication and ensures consistent validation across different data sources.

Also applies to: 168-202


220-231: Comprehensive test coverage for archive statistics.

The test properly captures and validates archive statistics from both the archive files and the returned ArchiveStats objects, ensuring consistency between the two sources.

Comment on lines +82 to 85
* @return Statistics for every archive that was written without encountering an error.
*/
void store();
[[nodiscard]] auto store() -> std::vector<ArchiveStats>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider returning by const& or std::move to avoid an extra copy

store() currently returns std::vector<ArchiveStats> by value.
If callers usually just move the result, returning by value is fine (NRVO/move elision).
If not, returning const std::vector<ArchiveStats>& or marking the return [[nodiscard]] auto&& with an explicit std::move inside could save an extra copy.

🤖 Prompt for AI Agents
In components/core/src/clp_s/JsonParser.hpp around lines 82 to 85, the store()
method returns a std::vector<ArchiveStats> by value, which may cause an
unnecessary copy. To optimize, either return a const reference to the vector if
the caller only needs to read it without taking ownership, or explicitly use
std::move when returning to enable move semantics and avoid copying. Adjust the
return type and implementation accordingly to prevent extra copies based on
typical usage patterns.

Comment on lines +5 to 8
#include <vector>

#include "../src/clp_s/ArchiveWriter.hpp"
#include "../src/clp_s/InputConfig.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Header inclusion could be trimmed to speed up compilation

ArchiveWriter.hpp is only needed for the clp_s::ArchiveStats type used in the return-value. A forward declaration would avoid the heavy include and reduce rebuild times in the test-suite:

-#include "../src/clp_s/ArchiveWriter.hpp"
+namespace clp_s { struct ArchiveStats; }  // fwd-decl

The full header is already brought in by the implementation file, so this change is safe.

🤖 Prompt for AI Agents
In components/core/tests/clp_s_test_utils.hpp around lines 5 to 8, replace the
#include of "../src/clp_s/ArchiveWriter.hpp" with a forward declaration of the
clp_s::ArchiveStats struct to reduce compilation dependencies and speed up
rebuilds. Keep the #include for "../src/clp_s/InputConfig.hpp" as is. This
change is safe because the full ArchiveWriter.hpp is included in the
implementation file.

Comment on lines +1156 to 1157
m_archive_stats.emplace_back(m_archive_writer->close(true));
m_archive_options.id = m_generator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Capturing the result of close(true) would preserve split-boundary stats

In split_archive() you correctly append the stats from close(true), but if close(true) itself fails (e.g., write error) the exception/return-code is silently ignored. Capture the result and validate, or propagate an exception to avoid silently continuing with corrupt state.

🤖 Prompt for AI Agents
In components/core/src/clp_s/JsonParser.cpp around lines 1156 to 1157, the
result of m_archive_writer->close(true) is added to m_archive_stats but any
failure or error from close(true) is ignored. Modify the code to capture the
return value of close(true), check if it indicates an error or failure, and if
so, either throw an exception or handle the error appropriately to prevent
continuing with corrupt state.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/core/src/clp_s/RangeIndexWriter.hpp (1)

12-12: Remove still-unused ystdlib/error_handling/ErrorCode.hpp include

The header itself never references any symbol from ystdlib::error_handling; it only uses the project-local ErrorCode from "ErrorCode.hpp". Re-introducing this include brings back the redundant dependency flagged in the previous review and risks a name-collision if both headers declare a type named ErrorCode.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ff5131 and 263d51f.

📒 Files selected for processing (1)
  • components/core/src/clp_s/RangeIndexWriter.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/RangeIndexWriter.hpp
🧬 Code Graph Analysis (1)
components/core/src/clp_s/RangeIndexWriter.hpp (2)
components/core/src/clp_s/ArchiveWriter.hpp (4)
  • nodiscard (54-68)
  • nodiscard (226-237)
  • nodiscard (243-249)
  • writer (257-257)
components/core/src/clp_s/JsonParser.cpp (6)
  • nodiscard (48-51)
  • nodiscard (53-58)
  • nodiscard (60-66)
  • nodiscard (68-71)
  • nodiscard (73-76)
  • nodiscard (80-80)
⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: lint-check (ubuntu-latest)

Copy link
Member

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving archive statistics! Just have a few questions

epochtime_t end_timestamp,
size_t uncompressed_size,
size_t compressed_size,
nlohmann::json metadata,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's only for range index? Should we call it metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, range_index is probably a better name.

/**
* Closes the archive writer
* Closes the archive writer.
* @param is_split Whether the last file ingested into the archive is split.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess every archive except the last one is a split? Why do we need this information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be helpful for error handling in some scenarios -- e.g. if some file gets split across two archives and compression fails while the second archive is being compressed we can use this flag to tell that the last file in the first archive hasn't been fully ingested and handle the error appropriately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would looking at the printed ArchiveStats help in this case? As I understand it, regardless of whether an error occurs or not, the last archive is never a split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say a user is ingesting files A and B and we produce two archives, where B is split between the first and the second archive. Consider the scenario where the second archive fails compression such that no archive stats get printed for the second archive.

In this scenario, without the is_split flag, the user can see that the first archive contains the files "A" and "B", but has no reliable way of determining that "B" is not fully ingested. However, if they do have the "is_split" flag then they know that they should expect at least one more archive containing the rest of "B".

Right now, since we still close/print archive stats for failed archives it's not as useful, but if we improve the error handling in JsonConstructor I think it could be pretty useful. The c++ code-path doesn't have this issue since we're only forwarding archive stats for the succesful archives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a direct way to figure out which file is split across two archives if one archive contains multiple files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the range index is part of the statistics

m_archive_writer->close();
auto JsonParser::store() -> std::vector<ArchiveStats> {
m_archive_stats.emplace_back(m_archive_writer->close());
return std::move(m_archive_stats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually use this return value except for the test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not right now, but I figure it will be useful when we write c++ spider tasks to do ingestion.

@gibber9809 gibber9809 requested a review from wraymo June 17, 2025 19:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 263d51f and 20e7e6a.

📒 Files selected for processing (2)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
  • components/core/tests/test-clp_s-range_index.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/ArchiveWriter.hpp
  • components/core/tests/test-clp_s-range_index.cpp
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (1)
components/core/tests/test-clp_s-range_index.cpp (1)

141-166: Potential narrowing when extracting indices from JSON

start_index / end_index are deserialised into size_t, which is platform-dependent (32-bit on some targets).
If the archive ever stores >4 Gi log-event IDs on such platforms, the cast will overflow silently.

-        auto start_index = entry->at(...).template get<size_t>();
-        auto end_index   = entry->at(...).template get<size_t>();
+        auto start_index = entry->at(...).template get<uint64_t>();
+        auto end_index   = entry->at(...).template get<uint64_t>();

Upstream structs (RangeIndexEntry) use size_t, so this would require propagating the wider type there as well; flagging for consideration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
components/core/src/clp_s/ArchiveWriter.hpp (1)

36-53: Avoid the extra deep-copy of range_index in the ctor

ArchiveStats still takes nlohmann::json range_index by value, causing an unnecessary copy before the move. Switching the parameter to nlohmann::json&& (or const nlohmann::json& if a copy is genuinely required) removes the overhead and was already noted in a prior round of review.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20e7e6a and 8e5f865.

📒 Files selected for processing (1)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/ArchiveWriter.hpp
🔇 Additional comments (1)
components/core/src/clp_s/ArchiveWriter.hpp (1)

128-133: Verify all call-sites now consume the returned ArchiveStats

close() no longer returns void, so ignoring the return value will silently discard useful metadata. Please grep the codebase to make sure every invocation assigns or forwards the result.

#!/bin/bash
# List places where ArchiveWriter::close is called and the result is ignored
rg --line-number --pretty $'\.close\(' | grep -v '='

wraymo
wraymo previously approved these changes Jun 18, 2025
@kirkrodrigues kirkrodrigues self-requested a review June 20, 2025 18:58
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments after skimming the PR.

};
if (m_print_archive_stats) {
print_archive_stats();
std::cout << archive_stats.as_string() << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::cout << archive_stats.as_string() << std::endl;
std::cout << archive_stats.as_string() << '\n';

To resolve clang-tidy violation.

Comment on lines 70 to 76
std::string const id;
epochtime_t const begin_timestamp{};
epochtime_t const end_timestamp{};
size_t const uncompressed_size{};
size_t const compressed_size{};
nlohmann::json const range_index;
bool const is_split{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the reason you're const qualifying these variables is so that they're not independently modified by a caller? clang-tidy understandably has concerns about const-qualified members, so I wonder if we should just convert ArchiveStats to a full-fledged class so that these members are private? Is any caller actually using these members directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a rabbit suggestion. I agree it makes sense to just turn it into a full class at this point though.

#include <vector>

#include <nlohmann/json.hpp>
#include <ystdlib/error_handling/ErrorCode.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will remove.

/**
* Writes ranges to a `ZstdCompressor` then clears internal state.
*
* TODO: refactor this method to return the metadata as a `ystdlib::error_handling::Result`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODOs generally get forgotten so we prefer to open issues in the repo. That also functions as a good entry point for new contributors. So can we convert this to an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it up as #1025.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the TODO comment?

last_line_decoded = True
else:
stats = json.loads(line.decode("ascii"))
stats = json.loads(line.decode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, what motivated this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rabbit suggested just using utf-8 output instead of using the escape to ascii approach because it makes utf-8 content human readable, which I agree with. Since I'm dumping the range index into this metadata now we can expect it to contain some user-inserted kv-pairs, and that will probably be easier for users to spot-check if unicode content renders as unicode characters.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/core/src/clp_s/RangeIndexWriter.hpp (1)

84-84: Out-parameter pattern remains from previous review concerns.

This method still uses the out-parameter + error code pattern that was flagged in previous reviews as error-prone. The TODO comment acknowledges this technical debt and references issue #1025 for future refactoring to a Result type.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c91ee0 and 249a5b4.

📒 Files selected for processing (4)
  • components/core/src/clp_s/ArchiveWriter.cpp (6 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
  • components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)
  • components/core/tests/test-clp_s-range_index.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/RangeIndexWriter.hpp
  • components/core/src/clp_s/ArchiveWriter.hpp
  • components/core/tests/test-clp_s-range_index.cpp
  • components/core/src/clp_s/ArchiveWriter.cpp
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (7)
components/core/src/clp_s/ArchiveWriter.hpp (3)

36-99: Well-designed ArchiveStats class with proper encapsulation.

The ArchiveStats class follows good design principles with private data members and public accessor methods. The const correctness and move semantics in the constructor are appropriate for the JSON handling.


154-154: Breaking change to close() method is well-justified.

The method signature change from void to returning ArchiveStats aligns perfectly with the PR objectives to expose archive metadata as structured objects. The optional is_split parameter provides useful context about archive content continuity.


292-295: Method signature changes support structured metadata flow.

The updates to return nlohmann::json instead of void enable the propagation of range index metadata through the archive writing pipeline, which is essential for the new ArchiveStats functionality.

Also applies to: 303-305

components/core/src/clp_s/ArchiveWriter.cpp (2)

60-137: Excellent implementation of the new ArchiveStats functionality.

The close() method implementation successfully:

  • Captures JSON metadata from write operations
  • Constructs comprehensive ArchiveStats with all required fields
  • Maintains backward compatibility for archive statistics printing
  • Provides structured return value for programmatic access

The error handling and cleanup logic remain intact while adding the new functionality.


139-158: Consistent method updates support metadata propagation.

The updates to write_single_file_archive() and write_archive_metadata() properly capture and return the JSON range index metadata. The implementation correctly handles the flow from RangeIndexWriter::write() through to the final ArchiveStats construction.

Also applies to: 160-210

components/core/tests/test-clp_s-range_index.cpp (2)

44-50: Excellent test refactoring with comprehensive coverage.

The refactoring improves test organization by:

  • Extracting common validation logic into check_archive_range_index()
  • Adding check_archive_metadata_from_stats() for ArchiveStats validation
  • Providing dual validation (direct archive reading + structured stats)

This ensures the metadata consistency between both access methods.

Also applies to: 141-167, 169-203


221-232: Comprehensive test integration validates end-to-end functionality.

The test correctly captures the archive_stats vector returned by the updated compress_archive() utility and validates the metadata through both the traditional archive reading path and the new structured statistics path. This provides excellent coverage for the new archive metadata reporting mechanism.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
wraymo
wraymo previously approved these changes Jun 23, 2025
Copy link
Member

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the update.


// Methods
/**
* Dumps the contents of `ArchiveStats` as a JSON object in a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Seems redundant.

Suggested change
* Dumps the contents of `ArchiveStats` as a JSON object in a string.

/**
* Writes ranges to a `ZstdCompressor` then clears internal state.
*
* TODO: refactor this method to return the metadata as a `ystdlib::error_handling::Result`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the TODO comment?

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving changes I requested. Deferring to @wraymo's approval for everything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants