Bug
CLP produces non-reproducible archives (different sizes and compression ratios) when compressing the same input dataset across multiple runs. There are two independent sources of non-determinism:
1. Non-deterministic file listing in the compression scheduler
The compression scheduler's file listing via Path.rglob("*") is non-deterministic, leading to different task-to-file assignments across runs. This is caused by filesystem-dependent iteration order in pathlib.rglob() (backed by os.scandir(), which makes no ordering guarantees). The non-determinism propagates through incremental file partitioning: files are buffered and partitioned into compression tasks as they arrive, so different listing orders produce different dictionaries at the clp-s compression level.
2. Random archive creator ID in clp-s
Even with a deterministic file sequence, clp-s generates a fresh random UUID for each archive via boost::uuids::random_generator (JsonParser.cpp:647). This archive_creator_id is compressed into the archive's range-index metadata for every file, so the encoded data differs between runs. The same issue exists in clp-text (glt/compression.cpp:101). This means clp-s/clp-text archives are never bit-for-bit reproducible even when the input file order is fixed.
This is not a code bug but a documentation gap — the current docs do not mention that archive contents may vary between runs on identical input, nor do they explain the root cause (non-deterministic file listing + incremental partitioning). Users who expect bit-for-bit reproducible archives across runs may be surprised by this behaviour.
What we expected: The documentation to note that job-level archive partitioning is non-deterministic by design (acceptable trade-off in a distributed system), and that identical inputs may produce archives with slightly different sizes/ratios across runs even though clp-s itself is deterministic given the same file sequence.
CLP version
main @ 3e6ef670 (0.11.1-dev)
Environment
- Host OS: Debian 12 (bookworm/sid)
- Kernel: 6.8.0-106-generic
- Docker: 28.3.3
- CLP container base: Ubuntu 22.04 (Jammy)
- Python (CLP container): 3.10.12
- Filesystem: ext4 on Samsung SSD 970 EVO Plus 250 GB (NVMe)
Reproduction steps
- Prepare a directory with many files (e.g. 200 000 small log files).
- Run
./sbin/compress.sh /path/to/directory and record the output (archive size, compression ratio).
- Stop CLP, clear archives and database, then repeat step 2.
- Compare the archive sizes/ratios across runs — they will differ despite identical input.
Root cause trace — non-deterministic file listing:
compression_scheduler.py:149 — for internal_path in path.rglob("*"): yields files in arbitrary order.
partition.py:56-57 — files are buffered and partitioned incrementally when the buffer reaches 2 × target_archive_size.
partition.py:180 — group_files_by_similar_filenames() sorts by filename within the current buffer subset, but the subset composition varies with listing order.
compression.py:73 — files.sort(key=lambda x: x.path.name) is a stable sort on the buffer subset, not the full input.
Root cause trace — random archive creator ID:
JsonParser.hpp:228 — boost::uuids::random_generator m_generator is a member of the parser.
JsonParser.cpp:647 — auto archive_creator_id = boost::uuids::to_string(m_generator()); generates a fresh UUID per ingest() call.
JsonParser.cpp:656-668 — the UUID is passed to ingest_json() and ingest_kvir().
JsonParser.cpp:758-765 — the UUID is compressed into the range-index metadata as _archive_creator_id for every file segment.
archive_constants.hpp:44 — cArchiveCreatorId{"_archive_creator_id"} defines the metadata key.
glt/compression.cpp:101 — same pattern in clp-text: archive_user_config.creator_id = uuid_generator();.
Official Python docs confirming non-determinism:
pathlib.rglob(): "This is like calling [Path.glob()][glob] with **/ added in front of the given relative pattern." — inherits the arbitrary ordering from glob().
pathlib.glob(): "The ordering of the results is arbitrary."
os.scandir(): "The entries are yielded in an arbitrary order."
os.listdir(): "The list is in arbitrary order."
Suggested documentation updates:
- In the compression user guide, add a note that archive contents are non-deterministic: identical inputs may produce different archive sizes and compression ratios across runs. This is due to two independent factors: (a) filesystem-dependent file listing order in the compression scheduler, and (b) a random archive creator UUID embedded in clp-s/clp-text archive metadata.
- In the architecture/developer docs, explain the partitioning pipeline (rglob → buffer → incremental partition → task dispatch) and where the non-determinism enters, as well as the creator ID generation in the archive writer.
- Optionally, note that a
sorted() wrapper on Path.rglob("*") would make file listing deterministic at the cost of a full directory listing before any compression begins, which may not be desirable for very large directories or streamed inputs.
Bug
CLP produces non-reproducible archives (different sizes and compression ratios) when compressing the same input dataset across multiple runs. There are two independent sources of non-determinism:
1. Non-deterministic file listing in the compression scheduler
The compression scheduler's file listing via
Path.rglob("*")is non-deterministic, leading to different task-to-file assignments across runs. This is caused by filesystem-dependent iteration order inpathlib.rglob()(backed byos.scandir(), which makes no ordering guarantees). The non-determinism propagates through incremental file partitioning: files are buffered and partitioned into compression tasks as they arrive, so different listing orders produce different dictionaries at theclp-scompression level.2. Random archive creator ID in clp-s
Even with a deterministic file sequence,
clp-sgenerates a fresh random UUID for each archive viaboost::uuids::random_generator(JsonParser.cpp:647). Thisarchive_creator_idis compressed into the archive's range-index metadata for every file, so the encoded data differs between runs. The same issue exists inclp-text(glt/compression.cpp:101). This meansclp-s/clp-textarchives are never bit-for-bit reproducible even when the input file order is fixed.This is not a code bug but a documentation gap — the current docs do not mention that archive contents may vary between runs on identical input, nor do they explain the root cause (non-deterministic file listing + incremental partitioning). Users who expect bit-for-bit reproducible archives across runs may be surprised by this behaviour.
What we expected: The documentation to note that job-level archive partitioning is non-deterministic by design (acceptable trade-off in a distributed system), and that identical inputs may produce archives with slightly different sizes/ratios across runs even though
clp-sitself is deterministic given the same file sequence.CLP version
main@3e6ef670(0.11.1-dev)Environment
Reproduction steps
./sbin/compress.sh /path/to/directoryand record the output (archive size, compression ratio).Root cause trace — non-deterministic file listing:
compression_scheduler.py:149—for internal_path in path.rglob("*"):yields files in arbitrary order.partition.py:56-57— files are buffered and partitioned incrementally when the buffer reaches2 × target_archive_size.partition.py:180—group_files_by_similar_filenames()sorts by filename within the current buffer subset, but the subset composition varies with listing order.compression.py:73—files.sort(key=lambda x: x.path.name)is a stable sort on the buffer subset, not the full input.Root cause trace — random archive creator ID:
JsonParser.hpp:228—boost::uuids::random_generator m_generatoris a member of the parser.JsonParser.cpp:647—auto archive_creator_id = boost::uuids::to_string(m_generator());generates a fresh UUID peringest()call.JsonParser.cpp:656-668— the UUID is passed toingest_json()andingest_kvir().JsonParser.cpp:758-765— the UUID is compressed into the range-index metadata as_archive_creator_idfor every file segment.archive_constants.hpp:44—cArchiveCreatorId{"_archive_creator_id"}defines the metadata key.glt/compression.cpp:101— same pattern in clp-text:archive_user_config.creator_id = uuid_generator();.Official Python docs confirming non-determinism:
pathlib.rglob(): "This is like calling [Path.glob()][glob] with**/added in front of the given relative pattern." — inherits the arbitrary ordering fromglob().pathlib.glob(): "The ordering of the results is arbitrary."os.scandir(): "The entries are yielded in an arbitrary order."os.listdir(): "The list is in arbitrary order."Suggested documentation updates:
sorted()wrapper onPath.rglob("*")would make file listing deterministic at the cost of a full directory listing before any compression begins, which may not be desirable for very large directories or streamed inputs.