Skip to content

docs: Note non-deterministic archive contents across runs #2220

@junhaoliao

Description

@junhaoliao

Bug

CLP produces non-reproducible archives (different sizes and compression ratios) when compressing the same input dataset across multiple runs. There are two independent sources of non-determinism:

1. Non-deterministic file listing in the compression scheduler

The compression scheduler's file listing via Path.rglob("*") is non-deterministic, leading to different task-to-file assignments across runs. This is caused by filesystem-dependent iteration order in pathlib.rglob() (backed by os.scandir(), which makes no ordering guarantees). The non-determinism propagates through incremental file partitioning: files are buffered and partitioned into compression tasks as they arrive, so different listing orders produce different dictionaries at the clp-s compression level.

2. Random archive creator ID in clp-s

Even with a deterministic file sequence, clp-s generates a fresh random UUID for each archive via boost::uuids::random_generator (JsonParser.cpp:647). This archive_creator_id is compressed into the archive's range-index metadata for every file, so the encoded data differs between runs. The same issue exists in clp-text (glt/compression.cpp:101). This means clp-s/clp-text archives are never bit-for-bit reproducible even when the input file order is fixed.

This is not a code bug but a documentation gap — the current docs do not mention that archive contents may vary between runs on identical input, nor do they explain the root cause (non-deterministic file listing + incremental partitioning). Users who expect bit-for-bit reproducible archives across runs may be surprised by this behaviour.

What we expected: The documentation to note that job-level archive partitioning is non-deterministic by design (acceptable trade-off in a distributed system), and that identical inputs may produce archives with slightly different sizes/ratios across runs even though clp-s itself is deterministic given the same file sequence.

CLP version

main @ 3e6ef670 (0.11.1-dev)

Environment

  • Host OS: Debian 12 (bookworm/sid)
  • Kernel: 6.8.0-106-generic
  • Docker: 28.3.3
  • CLP container base: Ubuntu 22.04 (Jammy)
  • Python (CLP container): 3.10.12
  • Filesystem: ext4 on Samsung SSD 970 EVO Plus 250 GB (NVMe)

Reproduction steps

  1. Prepare a directory with many files (e.g. 200 000 small log files).
  2. Run ./sbin/compress.sh /path/to/directory and record the output (archive size, compression ratio).
  3. Stop CLP, clear archives and database, then repeat step 2.
  4. Compare the archive sizes/ratios across runs — they will differ despite identical input.

Root cause trace — non-deterministic file listing:

  • compression_scheduler.py:149for internal_path in path.rglob("*"): yields files in arbitrary order.
  • partition.py:56-57 — files are buffered and partitioned incrementally when the buffer reaches 2 × target_archive_size.
  • partition.py:180group_files_by_similar_filenames() sorts by filename within the current buffer subset, but the subset composition varies with listing order.
  • compression.py:73files.sort(key=lambda x: x.path.name) is a stable sort on the buffer subset, not the full input.

Root cause trace — random archive creator ID:

  • JsonParser.hpp:228boost::uuids::random_generator m_generator is a member of the parser.
  • JsonParser.cpp:647auto archive_creator_id = boost::uuids::to_string(m_generator()); generates a fresh UUID per ingest() call.
  • JsonParser.cpp:656-668 — the UUID is passed to ingest_json() and ingest_kvir().
  • JsonParser.cpp:758-765 — the UUID is compressed into the range-index metadata as _archive_creator_id for every file segment.
  • archive_constants.hpp:44cArchiveCreatorId{"_archive_creator_id"} defines the metadata key.
  • glt/compression.cpp:101 — same pattern in clp-text: archive_user_config.creator_id = uuid_generator();.

Official Python docs confirming non-determinism:

  • pathlib.rglob(): "This is like calling [Path.glob()][glob] with **/ added in front of the given relative pattern." — inherits the arbitrary ordering from glob().
  • pathlib.glob(): "The ordering of the results is arbitrary."
  • os.scandir(): "The entries are yielded in an arbitrary order."
  • os.listdir(): "The list is in arbitrary order."

Suggested documentation updates:

  1. In the compression user guide, add a note that archive contents are non-deterministic: identical inputs may produce different archive sizes and compression ratios across runs. This is due to two independent factors: (a) filesystem-dependent file listing order in the compression scheduler, and (b) a random archive creator UUID embedded in clp-s/clp-text archive metadata.
  2. In the architecture/developer docs, explain the partitioning pipeline (rglob → buffer → incremental partition → task dispatch) and where the non-determinism enters, as well as the creator ID generation in the archive writer.
  3. Optionally, note that a sorted() wrapper on Path.rglob("*") would make file listing deterministic at the cost of a full directory listing before any compression begins, which may not be desirable for very large directories or streamed inputs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions