Skip to content

avifenesh/scrump

scrump

Fast, format-aware secret scrubber for binary capture artifacts.

Born from a real incident: a perf profile tarball uploaded to a public GitHub issue leaked a GH_TOKEN, because the environment block of the captured process — including every API key in scope at runtime — sat inside the binary blob, and GitHub's native secret scanning couldn't see through the format.

scrump opens captures using the same on-disk specs the originating tools use, finds dangerous content with a 1,100+ rule pattern engine, zero-fills it in place, and returns the file in its original shape. The redacted perf.data still loads in perf report; the redacted nsys-rep still opens in Nsight Systems; the redacted SQLite still passes sqlite3 ".schema"; the redacted pcap still opens in Wireshark.

CI License: Apache 2.0 Rust: 1.75+ GitHub release

Formats

Format Crate What it understands
passthrough scrump-format-passthrough Any file — single raw chunk fallback
perf scrump-format-perf PERFILE2. Header + feature sections (HEADER_CMDLINE, etc.) + data section.
tar scrump-format-tar tar / tar.gz / tar.zst / zip. Each member is recursively format-dispatched.
sqlite scrump-format-sqlite SQLite format 3. Walks every user table, redacts TEXT/BLOB cells via UPDATE + VACUUM.
nsys scrump-format-nsys NVIDIA .nsys-rep / .ncu-rep. tar envelope + format-aware SQLite handling for the inner DB.
elf-core scrump-format-core 64-bit LE ELF ET_CORE. Walks PT_NOTE for NT_PRPSINFO cmdline + PT_LOAD env pages.
hprof scrump-format-hprof Java HPROF. JAVA PROFILE header + record stream; tight UTF8 STRING chunks.
jfr scrump-format-jfr Java Flight Recorder. Walks chunks via FLR\0 magic + chunk_size; refuses to touch chunk headers.
pcap scrump-format-pcap tcpdump pcap + pcapng. Per-packet payload chunks (Authorization headers, query strings); framing untouched.

Install

# From source (Rust 1.75+)
cargo install --path crates/scrump-cli

# Or grab a pre-built binary from the latest release (see the Releases tab
# for the current version; tarballs are signed and shasum'd):
gh release download --repo avifenesh/scrump --pattern '*-x86_64-unknown-linux-gnu.tar.gz'

Supported targets out of the box:

Target Tier
x86_64-unknown-linux-gnu tier-1 (cross-compiled in CI release)
aarch64-unknown-linux-gnu tier-1 (cross-compiled in CI release)
aarch64-apple-darwin tier-1 (cross-compiled in CI release)

Use

scrump scan some-file              # dry-run: report findings, never mutate
scrump scrub some-file             # redact in place (atomic tmp+rename)
scrump scrub some-file -o clean    # write clean copy elsewhere
scrump scrub some-file --backup    # also keep the original at *.orig
scrump scrub some-file --format perf    # force a specific format handler
scrump scrub some-file --rules-path my.yaml    # add custom rules

How detection works

Two-layer ruleset:

  1. Curated default rules (crates/scrump-rules/rules/default.yaml) — tightly-scoped patterns for the ML/inference ecosystem: GitHub PATs, HuggingFace, OpenAI, Anthropic, AWS, Slack, NVIDIA NGC, W&B, Stripe.
  2. Auto-extracted TruffleHog mirror (rules/trufflehog.yaml, regenerated by cargo run -p scrump-trufflehog-compat --bin th-extract) — 1,100+ rules covering every detector under pkg/detectors/.
  3. Hand-coded detectors for things regex can't express alone — currently JwtHsAware which base64-decodes the JWT header and rejects HMAC-signed tokens, mirroring TruffleHog's filtering.

The engine supports capture_index for keyword-proximity patterns (e.g. W&B's bare 40-hex token near a wandb keyword) and post_filter for semantic constraints beyond regex.

Format support — how it stays structure-preserving

Each format crate implements:

pub trait Format: Send {
    fn name(&self) -> &'static str;
    fn chunks<'a>(&'a self) -> Box<dyn Iterator<Item = Chunk<'a>> + 'a>;
    fn apply(&mut self, hits: &[Hit]) -> Result<()>;
    fn to_bytes(&self) -> Result<Vec<u8>>;
}

The format decides which byte ranges are scannable (cmdline strings, TEXT cells, packet payloads, chunk bodies) and which are structural (magic words, length prefixes, varints, checksums). apply refuses to redact structural bytes. The result: scrump can never produce a file its own format parser couldn't parse.

Compatibility evidence

scrump is validated against two third-party test corpora. The harnesses live under crates/scrump-{trufflehog,presidio}-compat/.

TruffleHog detectors

cargo run -p scrump-trufflehog-compat --bin trufflehog-compat walks every *_test.go under TruffleHog's pkg/detectors/, parses each parametrized test, and runs scrump against the test input.

Last full run: 2,335 of 2,536 cases pass across 864 providers (92.1%). The remaining 201 are negative-case false-positives where provider A's no-hit-expected input still trips provider B's auto-extracted PrefixRegex (e.g. a sugester test input fires the tableau rule). They are over-detection in a scrubbing context — nothing TruffleHog catches is missed by scrump. CI gates on SCRUMP_TH_MAX_FAILURES=201; lowering this number must accompany rule fixes, and any increase fails the build.

Microsoft Presidio (PII) cross-format

cargo run -p scrump-presidio-compat --bin presidio-compat takes every Presidio recognizer test (52 recognizers, 671 cases), then for every case embeds the test text into every binary format scrump supports (7 + passthrough = 8) and runs the detector against the embedded blob.

Last full run: 617 of 671 (92.0%) pass — identical pass count across all 8 formats, proving the format wrapper is transparent to detection. The remaining 54 failures are entirely Presidio patterns that use lookbehind / backreferences that Rust's regex crate doesn't support (IP recognizer, MAC, Canadian SIN with backref-bound separator).

FORMAT             PASS     FAIL     SKIP    PASS%
--------------------------------------------------------
passthrough         617       54        0    92.0%
tar                 617       54        0    92.0%
perf                617       54        0    92.0%
sqlite              617       54        0    92.0%
elf-core            617       54        0    92.0%
hprof               617       54        0    92.0%
jfr                 617       54        0    92.0%
pcap                617       54        0    92.0%

Local dev loop

If you have just installed:

just check                  # fmt + clippy + tests + docs
just e2e                    # all 8 phase-gate scripts
just compat-trufflehog      # 864-provider parity (clones vendor/trufflehog on first run)
just compat-presidio        # 52 recognizers × 8 formats
just deny                   # cargo-deny supply-chain audit
just ci                     # everything CI runs, in order

Without just, see the recipes in the Justfile for the underlying cargo invocations.

Phase gates

End-to-end gates live under tests/:

  • tests/e2e.sh — phase 0 (passthrough on a planted-token text file)
  • tests/e2e_phase1.sh — perf.data
  • tests/e2e_phase2.sh — tar / tar.gz / tar.zst / zip
  • tests/e2e_phase3.sh — sqlite + nsys-rep
  • tests/e2e_phase4.sh — ELF core
  • tests/e2e_phase5.sh — Java HPROF
  • tests/e2e_phase6.sh — JFR
  • tests/e2e_phase7.sh — pcap
  • tests/e2e_all.sh — master gate; runs all 8

Each gate plants known token shapes, runs scrump scrub, then asserts the file size is preserved, the format's magic / structural fields are untouched, the format's native tooling still parses it, and no token prefix remains in the raw bytes.

Project layout

scrump/
├── crates/
│   ├── scrump-core/                # Format trait, Hit, Dispatcher
│   ├── scrump-detect/              # regex + entropy engine
│   ├── scrump-rules/               # curated + auto-extracted rule sets
│   ├── scrump-cli/                 # the `scrump` binary
│   ├── scrump-format-passthrough/  # text-and-anything fallback
│   ├── scrump-format-perf/         # PERFILE2
│   ├── scrump-format-tar/          # tar / zip / gz / zst (recursive)
│   ├── scrump-format-sqlite/       # SQLite3
│   ├── scrump-format-nsys/         # NVIDIA nsys-rep / ncu-rep
│   ├── scrump-format-core/         # ELF core dumps
│   ├── scrump-format-hprof/        # Java HPROF
│   ├── scrump-format-jfr/          # Java Flight Recorder
│   ├── scrump-format-pcap/         # pcap / pcapng
│   ├── scrump-test-fixtures/       # spec-compliant generators
│   ├── scrump-trufflehog-compat/   # 864-provider parity harness
│   └── scrump-presidio-compat/     # 8-format × 52-recognizer harness
├── tests/                          # phase 0..7 e2e gates
└── docs/                           # architecture, threat model

See docs/ARCHITECTURE.md for the internal design, and CONTRIBUTING.md for the format/detector add-a-new-X checklists.

Security

scrump is a security tool — please report vulnerabilities privately via the process in SECURITY.md.

License

Apache-2.0. Inspired by — but does not wrap — TruffleHog and noseyparker (both Apache-2.0).

About

Fast, format-aware secret scrubber for binary capture artifacts (perf.data, nsys-rep, ELF core, hprof, JFR, pcap, SQLite, tar/zip).

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors