Fast, format-aware secret scrubber for binary capture artifacts.
Born from a real incident: a perf profile tarball uploaded to a public
GitHub issue leaked a GH_TOKEN, because the environment block of the
captured process — including every API key in scope at runtime — sat
inside the binary blob, and GitHub's native secret scanning couldn't see
through the format.
scrump opens captures using the same on-disk specs the originating tools
use, finds dangerous content with a 1,100+ rule pattern engine, zero-fills
it in place, and returns the file in its original shape. The redacted
perf.data still loads in perf report; the redacted nsys-rep still
opens in Nsight Systems; the redacted SQLite still passes sqlite3 ".schema"; the redacted pcap still opens in Wireshark.
| Format | Crate | What it understands |
|---|---|---|
passthrough |
scrump-format-passthrough |
Any file — single raw chunk fallback |
perf |
scrump-format-perf |
PERFILE2. Header + feature sections (HEADER_CMDLINE, etc.) + data section. |
tar |
scrump-format-tar |
tar / tar.gz / tar.zst / zip. Each member is recursively format-dispatched. |
sqlite |
scrump-format-sqlite |
SQLite format 3. Walks every user table, redacts TEXT/BLOB cells via UPDATE + VACUUM. |
nsys |
scrump-format-nsys |
NVIDIA .nsys-rep / .ncu-rep. tar envelope + format-aware SQLite handling for the inner DB. |
elf-core |
scrump-format-core |
64-bit LE ELF ET_CORE. Walks PT_NOTE for NT_PRPSINFO cmdline + PT_LOAD env pages. |
hprof |
scrump-format-hprof |
Java HPROF. JAVA PROFILE header + record stream; tight UTF8 STRING chunks. |
jfr |
scrump-format-jfr |
Java Flight Recorder. Walks chunks via FLR\0 magic + chunk_size; refuses to touch chunk headers. |
pcap |
scrump-format-pcap |
tcpdump pcap + pcapng. Per-packet payload chunks (Authorization headers, query strings); framing untouched. |
# From source (Rust 1.75+)
cargo install --path crates/scrump-cli
# Or grab a pre-built binary from the latest release (see the Releases tab
# for the current version; tarballs are signed and shasum'd):
gh release download --repo avifenesh/scrump --pattern '*-x86_64-unknown-linux-gnu.tar.gz'Supported targets out of the box:
| Target | Tier |
|---|---|
x86_64-unknown-linux-gnu |
tier-1 (cross-compiled in CI release) |
aarch64-unknown-linux-gnu |
tier-1 (cross-compiled in CI release) |
aarch64-apple-darwin |
tier-1 (cross-compiled in CI release) |
scrump scan some-file # dry-run: report findings, never mutate
scrump scrub some-file # redact in place (atomic tmp+rename)
scrump scrub some-file -o clean # write clean copy elsewhere
scrump scrub some-file --backup # also keep the original at *.orig
scrump scrub some-file --format perf # force a specific format handler
scrump scrub some-file --rules-path my.yaml # add custom rulesTwo-layer ruleset:
- Curated default rules (
crates/scrump-rules/rules/default.yaml) — tightly-scoped patterns for the ML/inference ecosystem: GitHub PATs, HuggingFace, OpenAI, Anthropic, AWS, Slack, NVIDIA NGC, W&B, Stripe. - Auto-extracted TruffleHog mirror (
rules/trufflehog.yaml, regenerated bycargo run -p scrump-trufflehog-compat --bin th-extract) — 1,100+ rules covering every detector underpkg/detectors/. - Hand-coded detectors for things regex can't express alone —
currently
JwtHsAwarewhich base64-decodes the JWT header and rejects HMAC-signed tokens, mirroring TruffleHog's filtering.
The engine supports capture_index for keyword-proximity patterns (e.g.
W&B's bare 40-hex token near a wandb keyword) and post_filter for
semantic constraints beyond regex.
Each format crate implements:
pub trait Format: Send {
fn name(&self) -> &'static str;
fn chunks<'a>(&'a self) -> Box<dyn Iterator<Item = Chunk<'a>> + 'a>;
fn apply(&mut self, hits: &[Hit]) -> Result<()>;
fn to_bytes(&self) -> Result<Vec<u8>>;
}The format decides which byte ranges are scannable (cmdline strings,
TEXT cells, packet payloads, chunk bodies) and which are structural
(magic words, length prefixes, varints, checksums). apply refuses to
redact structural bytes. The result: scrump can never produce a file
its own format parser couldn't parse.
scrump is validated against two third-party test corpora. The harnesses
live under crates/scrump-{trufflehog,presidio}-compat/.
cargo run -p scrump-trufflehog-compat --bin trufflehog-compat walks
every *_test.go under TruffleHog's pkg/detectors/, parses each
parametrized test, and runs scrump against the test input.
Last full run: 2,335 of 2,536 cases pass across 864 providers
(92.1%). The remaining 201 are negative-case false-positives where
provider A's no-hit-expected input still trips provider B's
auto-extracted PrefixRegex (e.g. a sugester test input fires the
tableau rule). They are over-detection in a scrubbing context —
nothing TruffleHog catches is missed by scrump. CI gates on
SCRUMP_TH_MAX_FAILURES=201; lowering this number must accompany rule
fixes, and any increase fails the build.
cargo run -p scrump-presidio-compat --bin presidio-compat takes every
Presidio recognizer test (52 recognizers, 671 cases), then for every
case embeds the test text into every binary format scrump supports
(7 + passthrough = 8) and runs the detector against the embedded blob.
Last full run: 617 of 671 (92.0%) pass — identical pass count across
all 8 formats, proving the format wrapper is transparent to detection.
The remaining 54 failures are entirely Presidio patterns that use
lookbehind / backreferences that Rust's regex crate doesn't support
(IP recognizer, MAC, Canadian SIN with backref-bound separator).
FORMAT PASS FAIL SKIP PASS%
--------------------------------------------------------
passthrough 617 54 0 92.0%
tar 617 54 0 92.0%
perf 617 54 0 92.0%
sqlite 617 54 0 92.0%
elf-core 617 54 0 92.0%
hprof 617 54 0 92.0%
jfr 617 54 0 92.0%
pcap 617 54 0 92.0%
If you have just installed:
just check # fmt + clippy + tests + docs
just e2e # all 8 phase-gate scripts
just compat-trufflehog # 864-provider parity (clones vendor/trufflehog on first run)
just compat-presidio # 52 recognizers × 8 formats
just deny # cargo-deny supply-chain audit
just ci # everything CI runs, in orderWithout just, see the recipes in the Justfile for the
underlying cargo invocations.
End-to-end gates live under tests/:
tests/e2e.sh— phase 0 (passthrough on a planted-token text file)tests/e2e_phase1.sh— perf.datatests/e2e_phase2.sh— tar / tar.gz / tar.zst / ziptests/e2e_phase3.sh— sqlite + nsys-reptests/e2e_phase4.sh— ELF coretests/e2e_phase5.sh— Java HPROFtests/e2e_phase6.sh— JFRtests/e2e_phase7.sh— pcaptests/e2e_all.sh— master gate; runs all 8
Each gate plants known token shapes, runs scrump scrub, then asserts
the file size is preserved, the format's magic / structural fields are
untouched, the format's native tooling still parses it, and no token
prefix remains in the raw bytes.
scrump/
├── crates/
│ ├── scrump-core/ # Format trait, Hit, Dispatcher
│ ├── scrump-detect/ # regex + entropy engine
│ ├── scrump-rules/ # curated + auto-extracted rule sets
│ ├── scrump-cli/ # the `scrump` binary
│ ├── scrump-format-passthrough/ # text-and-anything fallback
│ ├── scrump-format-perf/ # PERFILE2
│ ├── scrump-format-tar/ # tar / zip / gz / zst (recursive)
│ ├── scrump-format-sqlite/ # SQLite3
│ ├── scrump-format-nsys/ # NVIDIA nsys-rep / ncu-rep
│ ├── scrump-format-core/ # ELF core dumps
│ ├── scrump-format-hprof/ # Java HPROF
│ ├── scrump-format-jfr/ # Java Flight Recorder
│ ├── scrump-format-pcap/ # pcap / pcapng
│ ├── scrump-test-fixtures/ # spec-compliant generators
│ ├── scrump-trufflehog-compat/ # 864-provider parity harness
│ └── scrump-presidio-compat/ # 8-format × 52-recognizer harness
├── tests/ # phase 0..7 e2e gates
└── docs/ # architecture, threat model
See docs/ARCHITECTURE.md for the internal design,
and CONTRIBUTING.md for the format/detector
add-a-new-X checklists.
scrump is a security tool — please report vulnerabilities privately via
the process in SECURITY.md.
Apache-2.0. Inspired by — but does not wrap — TruffleHog and noseyparker (both Apache-2.0).