Skip to content

CLI and library to fingerprint directory contents for caching, sync, and integrity checks

License

Notifications You must be signed in to change notification settings

supermarsx/hash-folderoo

Repository files navigation

hash-folderoo-icon

hash-folderoo

Rust toolkit for hashing folders, comparing snapshots, and automating clean-up tasks.
This crate rebuilds the feature set as an idiomatic, multi-command CLI with strong typing, streaming IO, and cross-platform support. The long-term design and extra context live in spec.md.

Highlights

  • Fast hashing pipeline backed by Rayon workers, reusable buffer pools, and selectable memory modes (stream, balanced, booster).
  • Multiple commands in one binary: build hash maps, diff two states, copy only what changed, delete empty folders, batch-rename files, generate reports, and benchmark algorithms.
  • Built-in map formats (JSON + CSV) with deterministic ordering and metadata headers (root, timestamp, algorithm, etc.).
  • Config layering: system -> user -> project -> env vars -> CLI flags, with support for TOML/YAML/JSON configuration files and HASH_FOLDEROO_* overrides.
  • Batteries-included developer experience: cargo test runs unit + integration smoke tests (see tests/cli_smoke.rs).

Commands at a glance

Command Purpose Handy flags
hashmap Walk a directory, hash every file, and write a map in JSON or CSV. --path, --output, --format, --algorithm, --xof-length, --force-expand, --strip-prefix, --exclude, --threads, --mem-mode, --max-ram, --progress, --dry-run
compare Compare two maps or directories and classify files as identical, changed, moved, missing, or new. --source, --target, --format {json,csv}, --algorithm
copydiff Generate and optionally execute copy ops derived from a comparison. --source, --target, --plan <json>, --execute, --dry-run, --conflict {overwrite,skip,rename}, --preserve-times, --algorithm
removempty Delete now-empty directories (post-order) with glob exclusions. --path, --dry-run, --min-empty-depth, --exclude
renamer Apply a simple old->new replacement across filenames. --path, --pattern, --dry-run
report Summarize a hash map (stats, duplicates, largest files, etc.). --input, --format {json,text}, --include, --top-n
benchmark Benchmark supported algorithms over an in-memory buffer. --algorithm {blake3,shake256,all}, --size <bytes>

If you installed a prebuilt binary, invoke the binary directly for help and to run commands. If you're developing locally, using cargo run remains supported.

# show global help
hash-folderoo --help

# show per-command help (example)
hash-folderoo hashmap --help

Use --alg-list to print the currently compiled hashing algorithms (BLAKE3, BLAKE2b, BLAKE2bp, SHAKE256, TurboSHAKE256, ParallelHash256, XXH3-1024, WyHash-1024, KangarooTwelve).

Note about forcing expansion: algorithms that do not natively support XOF (e.g., BLAKE2b, BLAKE2bp) will reject requests for arbitrarily-long output unless you explicitly opt-in using --force-expand. When used, the tool performs a deterministic, non-standard expansion (chained hashing) to produce the requested number of bytes. This is intended for benchmarking and interoperability testing and is not a cryptographic XOF replacement.

xxh3-1024 and wyhash-1024 are non-cryptographic options that expand fast hashes into 1024-bit digests via deterministic counters, suitable for high-speed comparisons/benchmarks instead of integrity/security guarantees.

Installation

Prerequisites

  • Rust 1.75+ (edition 2021) and Cargo.
  • Windows, macOS, or Linux. The code avoids platform-specific assumptions.

Install prebuilt binaries (recommended)

We publish cross-platform release artifacts on GitHub Releases. Each automatic release uses the commit short SHA as the tag (for example v1a2b3c4) and artifacts are produced per target with names like hash-folderoo-<target>-<release>.tar.gz or .zip.

Examples (replace vSHORT with the release tag you want):

Linux x86_64

curl -Lo hash-folderoo-vSHORT.tar.gz \
  https://github.com/supermarsx/hash-folderoo/releases/download/vSHORT/hash-folderoo-x86_64-unknown-linux-gnu-vSHORT.tar.gz
tar -xzf hash-folderoo-vSHORT.tar.gz
sudo install -m 755 hash-folderoo /usr/local/bin/

macOS (Intel / Apple Silicon)

curl -Lo hash-folderoo-vSHORT.tar.gz \
  https://github.com/supermarsx/hash-folderoo/releases/download/vSHORT/hash-folderoo-x86_64-apple-darwin-vSHORT.tar.gz
tar -xzf hash-folderoo-vSHORT.tar.gz
sudo install -m 755 hash-folderoo /usr/local/bin/
# or to install into homebrew prefix on Apple Silicon:
sudo install -m 755 hash-folderoo /opt/homebrew/bin/

Windows (PowerShell)

$tag = 'vSHORT'
Invoke-WebRequest -Uri "https://github.com/supermarsx/hash-folderoo/releases/download/$tag/hash-folderoo-x86_64-pc-windows-msvc-$tag.zip" -OutFile zipfile.zip
Expand-Archive zipfile.zip -DestinationPath .\dist
# Move the executable somewhere in PATH, e.g. $HOME\bin
Move-Item -Path .\dist\hash-folderoo.exe -Destination $HOME\bin\hash-folderoo.exe

We recommend verifying checksums/signatures before installing. CI can be extended to generate checksums for each artifact and attach them to the release.

Homebrew / Brew (macOS & Linuxbrew)

If the project provides a Homebrew Tap (see packaging/homebrew) you can install directly with:

# tap repo and install (replace tap name if different)
brew tap supermarsx/homebrew-tap
brew install supermarsx/tap/hash-folderoo

Scoop (Windows)

If the project provides a Scoop bucket (see packaging/scoop) you can install with:

# add a bucket and install (example bucket URL)
scoop bucket add hash-folderoo https://github.com/supermarsx/scoop-bucket
scoop install hash-folderoo

Build from source (developer / optional)

If you prefer to build locally (development or debugging), you can still use Cargo:

# Build release binary at target/release/hash-folderoo
cargo build --release

# Install into your cargo bin directory
cargo install --path .

Quick start workflows

1. Hash a directory

hash-folderoo hashmap \
  --path ./sample-data \
  --output snapshots/sample.json \
  --format json \
  --algorithm blake3 \
  --xof-length 64 \
  --exclude "**/.git/**" \
  --progress \
  --mem-mode balanced
  • Paths recorded in the map are relative unless --strip-prefix is used.
  • Streams files through a bounded buffer pool so even booster mode stays bounded by --max-ram.

2. Compare two snapshots (or live folders)

hash-folderoo compare \
  --source snapshots/sample.json \
  --target ./backup-drive \
  --format json
  • Both --source and --target accept JSON/CSV maps or directories. When directories are given, the tool hashes them on the fly with the chosen algorithm.
  • Output as JSON (structured ComparisonReport) or CSV (flattened rows with status columns).

3. Copy only what changed

# Preview (default)
hash-folderoo copydiff --source ./src --target ./dst --algorithm blake3

# Execute plan with conflict handling
hash-folderoo copydiff \
  --source ./src \
  --target ./dst \
  --execute \
  --conflict rename \
  --preserve-times
  • Without --execute the plan is printed (dry-run). Add --execute to copy files.
  • --plan <file> lets you feed an existing JSON plan (matching the CopyPlan schema) instead of computing a diff.

4. Clean up empty directories

hash-folderoo removempty --path ./tmp --min-empty-depth 2 --dry-run

Glob exclusions are relative to the provided root (e.g., --exclude "**/node_modules/**").

5. Batch rename files

hash-folderoo renamer --path ./photos --pattern "-draft->" --dry-run

Patterns follow old->new; omitting -> means "replace with nothing".

6. Generate a report

hash-folderoo report \
  --input snapshots/sample.json \
  --format json \
  --include stats,duplicates,largest \
  --top-n 10

Reports compute totals, duplicate groups, wasted bytes, top extensions, and largest files. Text format prints a human summary; JSON is structured for automation.

7. Benchmark hashing throughput

hash-folderoo benchmark --algorithm all --size 134217728   # 128 MiB buffer

Use this to gauge algorithm speed on your hardware.

8. Discover algorithms at runtime

hash-folderoo --alg-list

Outputs default digest lengths, whether the algorithm is cryptographic, and XOF capabilities.

Configuration & environment

hash-folderoo merges configuration from several locations (lowest to highest precedence):

  1. /etc/hash-folderoo/{config.toml,config.yaml,config.json}
  2. ${XDG_CONFIG_HOME:-~/.config}/hash-folderoo/*
  3. ./config.{toml,yaml,json} in the current working directory
  4. HASH_FOLDEROO_CONFIG=/path/to/config.{toml,yaml,json}
  5. Environment variables (HASH_FOLDEROO_*)
  6. CLI flags

Each file can be TOML, YAML, or JSON. Unsupported keys are ignored.

Example config.toml

[general]
path = "D:/datasets"
output = "snapshots/datasets.json"
format = "json"
threads = 8
exclude = ["**/.git/**", "target/**"]
follow_symlinks = false
progress = true

[algorithm]
name = "shake256"
xof_length = 64

[memory]
mode = "stream"      # stream | balanced | booster
max_ram = 2147483648 # 2 GiB

Configuration keys

Section Keys Notes
[general] path (string), output (string), format (json or csv), threads (u32 > 0), strip_prefix (string), depth (u32 > 0), exclude (array of globs), follow_symlinks (bool), progress (bool), dry_run (bool) Matches CLI flags for hashmap; invalid formats or zero-valued counts are rejected during config validation.
[algorithm] name (string), xof_length (bytes > 0) name must map to a supported algorithm (blake3, blake2b, blake2bp, shake256, turboshake256, k12, …).
[memory] mode (stream, balanced, or booster), max_ram (bytes > 0) Controls the buffer-plan recommender; invalid modes result in a startup error.

Configs loaded from /etc, $XDG_CONFIG_HOME, the project directory, env overrides, and --config all go through the same validator so mistakes are caught early.

Supported environment variables

Variable Meaning
HASH_FOLDEROO_CONFIG Absolute path to a config file to merge last.
HASH_FOLDEROO_PATH, HASH_FOLDEROO_OUTPUT, HASH_FOLDEROO_FORMAT Override the corresponding general settings.
HASH_FOLDEROO_THREADS, HASH_FOLDEROO_DEPTH, HASH_FOLDEROO_STRIP_PREFIX Numeric overrides for CLI-style options.
HASH_FOLDEROO_EXCLUDE Comma-separated glob list (e.g., target/**,**/.git/**).
HASH_FOLDEROO_FOLLOW_SYMLINKS, HASH_FOLDEROO_PROGRESS, HASH_FOLDEROO_DRY_RUN Boolean toggles (true/false, 1/0, on/off).
HASH_FOLDEROO_ALG, HASH_FOLDEROO_XOF_LENGTH Select hashing backend and output length (bytes).
HASH_FOLDEROO_MEMORY_MODE, HASH_FOLDEROO_MAX_RAM Tune memory mode and total buffer budget (bytes).

Map and report formats

Hash maps written by the hashmap command follow this JSON shape:

{
  "version": 1,
  "generated_by": "hash-folderoo",
  "timestamp": "2025-11-23T19:19:42Z",
  "root": "/absolute/or/relative/root",
  "algorithm": {
    "name": "blake3",
    "params": { "xof_length": 64 }
  },
  "entries": [
    { "path": "foo/bar.txt", "hash": "<hex>", "size": 12345, "mtime": 1700000000 },
    { "path": "baz.bin", "hash": "<hex>", "size": 42 }
  ]
}

CSV output contains the same fields (path,hash,size,mtime) and is always sorted by path for deterministic diffs.

compare JSON output matches compare::ComparisonReport with arrays identical, changed, moved, missing, and new. CSV output flattens each row with a status column so it can be consumed by spreadsheets.

copydiff plans are serialized as:

{ "ops": [ { "src": "/src/file.txt", "dst": "/dst/file.txt", "op": "copy" } ] }

report JSON output bundles the requested sections (stats, duplicates, largest) and is safe to post-process.

Performance & memory modes

memory.rs encapsulates the heuristics used by the hashing pipeline:

  • stream - smallest buffers (~64 KiB), half of logical CPUs, minimal RAM footprint for slow disks or low-memory machines.
  • balanced (default) - moderates between throughput and memory: full logical CPUs, ~256 KiB buffers, glob prefetch disabled when RAM is tight.
  • booster - aggressive parallelism (up to 2x logical CPUs) with 1 MiB buffers and directory prefetching; ideal for SSDs and generous RAM. Specify --max-ram to keep it in check.

Use --threads and --max-ram to override the auto plan. The buffer pool enforces the byte budget so multiple commands can run concurrently without starving the system.

Development

  • Format/lint: cargo fmt and cargo clippy --all-targets.
  • Tests: cargo test covers units plus tests/cli_smoke.rs, which exercises the CLI end-to-end (hashing, comparing, copydiff, removempty, renamer, report, benchmark).
  • Logging is powered by env_logger; set RUST_LOG=debug for verbose traces while hacking.
  • See spec.md for the long-term blueprint (extra algorithms, GUI front-ends, richer copy planners, etc.).

CI & Releases ⚙️

This repository includes GitHub Actions workflows to ensure code quality and to publish cross-platform binaries:

  • Continuous Integration (CI) - .github/workflows/ci.yml runs on PRs and pushes to main and contains separate jobs for:

    • format checks (cargo fmt -- --check)
    • linting (cargo clippy --all-targets)
    • unit & integration tests (cargo test)
    • matrixed release-style builds that compile the binary for common targets (Linux, macOS, Windows) and both x86_64 and aarch64 where possible.
  • Automated release workflow - .github/workflows/release.yml triggers on pushes to main and automatically:

    • uses the commit short SHA as the release version (e.g. v1a2b3c4)
    • builds platform-specific binaries for: Linux (x86_64/aarch64), Windows (x86_64/aarch64), macOS (x86_64/aarch64)
    • uploads the compressed artifacts to a GitHub Release created for that commit

Notes:

  • GitHub runner capabilities determine whether every target can be built on the runners provided; cross-compilation is attempted via rust target tooling. If a specific target cannot be produced on a runner, the release workflow will fail for that matrix cell — this ensures each published artifact was actually built on a runner that produced it.
  • If you want to publish stable semver releases instead of short-sha releases, adapt .github/workflows/release.yml to trigger on tags (e.g., push tags) and set a semantic version.

Packaging helpers

This repo includes packaging templates to simplify adding distribution channels and CI automation for them:

  • packaging/scoop - a template Scoop manifest and a config/ README explaining how to wire an entry into a Scoop bucket (Windows user installer flow).
  • packaging/homebrew - a sample Homebrew formula and config/ notes for publishing a tap (macOS & Linuxbrew).

The release workflow generates SHA256 checksums (*.sha256) for each uploaded artifact and an aggregated SHA256SUMS file for the release so packaging manifests can be populated automatically with the correct checksum values. The workflow will also create a branch in this repository containing updated packaging manifests (Homebrew & Scoop) with concrete versions and checksums and open a pull request so you can review and merge them (or adapt the workflow to push directly to your taps/buckets).

Roadmap

The current binary ships with BLAKE3, BLAKE2b, BLAKE2bp, SHAKE256, TurboSHAKE256, ParallelHash256, XXH3-1024, WyHash-1024, and KangarooTwelve hashing backends plus the core CLI workflow. The design document (spec.md) covers upcoming work such as additional algorithms (MeowHash, etc.), richer booster-mode controls, persisted copy plans, and a GUI front-end. Contributions aligning with that plan are welcome - open an issue to discuss larger changes.

License

This project declares a dual license in Cargo.toml: MIT OR Apache-2.0. Choose the license that best fits your needs and consult the repository owner for any redistribution policies.

Releases

No releases published

Sponsor this project