docs: Note non-deterministic archive contents across runs

## Bug

CLP produces non-reproducible archives (different sizes and compression ratios) when compressing the same input dataset across multiple runs. There are two independent sources of non-determinism:

### 1. Non-deterministic file listing in the compression scheduler

The compression scheduler's file listing via `Path.rglob("*")` is non-deterministic, leading to different task-to-file assignments across runs. This is caused by filesystem-dependent iteration order in `pathlib.rglob()` (backed by `os.scandir()`, which makes no ordering guarantees). The non-determinism propagates through incremental file partitioning: files are buffered and partitioned into compression tasks as they arrive, so different listing orders produce different dictionaries at the `clp-s` compression level.

### 2. Random archive creator ID in clp-s

Even with a deterministic file sequence, `clp-s` generates a fresh random UUID for each archive via `boost::uuids::random_generator` (`JsonParser.cpp:647`). This `archive_creator_id` is compressed into the archive's range-index metadata for every file, so the encoded data differs between runs. The same issue exists in `clp-text` (`glt/compression.cpp:101`). This means `clp-s`/`clp-text` archives are never bit-for-bit reproducible even when the input file order is fixed.

**This is not a code bug but a documentation gap** — the current docs do not mention that archive contents may vary between runs on identical input, nor do they explain the root cause (non-deterministic file listing + incremental partitioning). Users who expect bit-for-bit reproducible archives across runs may be surprised by this behaviour.

**What we expected:** The documentation to note that job-level archive partitioning is non-deterministic by design (acceptable trade-off in a distributed system), and that identical inputs may produce archives with slightly different sizes/ratios across runs even though `clp-s` itself is deterministic given the same file sequence.

## CLP version

`main` @ `3e6ef670` (0.11.1-dev)

## Environment

- **Host OS:** Debian 12 (bookworm/sid)
- **Kernel:** 6.8.0-106-generic
- **Docker:** 28.3.3
- **CLP container base:** Ubuntu 22.04 (Jammy)
- **Python (CLP container):** 3.10.12
- **Filesystem:** ext4 on Samsung SSD 970 EVO Plus 250 GB (NVMe)

## Reproduction steps

1. Prepare a directory with many files (e.g. 200 000 small log files).
2. Run `./sbin/compress.sh /path/to/directory` and record the output (archive size, compression ratio).
3. Stop CLP, clear archives and database, then repeat step 2.
4. Compare the archive sizes/ratios across runs — they will differ despite identical input.

**Root cause trace — non-deterministic file listing:**
- `compression_scheduler.py:149` — `for internal_path in path.rglob("*"):` yields files in arbitrary order.
- `partition.py:56-57` — files are buffered and partitioned incrementally when the buffer reaches `2 × target_archive_size`.
- `partition.py:180` — `group_files_by_similar_filenames()` sorts by filename *within the current buffer subset*, but the subset composition varies with listing order.
- `compression.py:73` — `files.sort(key=lambda x: x.path.name)` is a stable sort on the buffer subset, not the full input.

**Root cause trace — random archive creator ID:**
- `JsonParser.hpp:228` — `boost::uuids::random_generator m_generator` is a member of the parser.
- `JsonParser.cpp:647` — `auto archive_creator_id = boost::uuids::to_string(m_generator());` generates a fresh UUID per `ingest()` call.
- `JsonParser.cpp:656-668` — the UUID is passed to `ingest_json()` and `ingest_kvir()`.
- `JsonParser.cpp:758-765` — the UUID is compressed into the range-index metadata as `_archive_creator_id` for every file segment.
- `archive_constants.hpp:44` — `cArchiveCreatorId{"_archive_creator_id"}` defines the metadata key.
- `glt/compression.cpp:101` — same pattern in clp-text: `archive_user_config.creator_id = uuid_generator();`.

**Official Python docs confirming non-determinism:**
- [`pathlib.rglob()`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.rglob): *"This is like calling [`Path.glob()`][glob] with `**/` added in front of the given relative pattern."* — inherits the arbitrary ordering from `glob()`.
- [`pathlib.glob()`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob): *"The ordering of the results is arbitrary."*
- [`os.scandir()`](https://docs.python.org/3/library/os.html#os.scandir): *"The entries are yielded in an arbitrary order."*
- [`os.listdir()`](https://docs.python.org/3/library/os.html#os.listdir): *"The list is in arbitrary order."*

**Suggested documentation updates:**
1. In the compression user guide, add a note that archive contents are non-deterministic: identical inputs may produce different archive sizes and compression ratios across runs. This is due to two independent factors: (a) filesystem-dependent file listing order in the compression scheduler, and (b) a random archive creator UUID embedded in clp-s/clp-text archive metadata.
2. In the architecture/developer docs, explain the partitioning pipeline (rglob → buffer → incremental partition → task dispatch) and where the non-determinism enters, as well as the creator ID generation in the archive writer.
3. Optionally, note that a `sorted()` wrapper on `Path.rglob("*")` would make file listing deterministic at the cost of a full directory listing before any compression begins, which may not be desirable for very large directories or streamed inputs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Note non-deterministic archive contents across runs #2220

Bug

1. Non-deterministic file listing in the compression scheduler

2. Random archive creator ID in clp-s

CLP version

Environment

Reproduction steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docs: Note non-deterministic archive contents across runs #2220

Description

Bug

1. Non-deterministic file listing in the compression scheduler

2. Random archive creator ID in clp-s

CLP version

Environment

Reproduction steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions