██████████▅▄▅▟███████████▇▆▇████▛▜▜▜██████████████▊▜██████████▛▀▘▔╵╵█████████▉▔▔▔ ╵▗██████
▊╴ ╵▀██████████╴ ╵▜█████████▍ ╵█████████████████████████╴ ▗█████████▎ ╶███████
█╴ ▐█████████▏ ▝█████████▋ ▂▄███████████████▛▀▀██████▋ ╷▟████████▉╴ ▗███████
█▅▁ ╷▃█████████▙▄▂╷▁▄██████████▅▄▟█████████▀▔ ▝▔╶▝▔ ▀████▇▆▅▆██████▛▀▘╵▅▇▆▆▇██████▛▛▀
████████▛▀▘▘▀▜██████████▘▔╵▔▘███████████████▅▃▃▃▄▄▅▆▇▆▅▃ ▜██████████╵ ▗█████████▘
████████▋ ╶▜████████▉╴ ▐███████████████████████████▖ ▜▞▜██████▍ ▟████████▌
████████▉╴ ▐█████████▁ ▗██▅▀▀██████████████████████▉ ╶▁▜▖▜████▀╷╷ ╷▂▅███████▜▀╴╷ ╷▂
█▛███████▇▅▅▄▆███▛███████▇▆▇▇██▉ ╶██████████████████████▉ ▜▞┱▁▖▔ ╶▐█████████▘╵ ╵▕▇█████
▉╴ ╵▐█████████╴ ▔█████████▌ ██████████████████████▉ ╷▜▇▞╴╴ ╶█████████▏ ▟██████
█╴ ╶█████████▎ ▜████████▌ ▃██▀▔▔▔▘▀▀▀▀▀▀▀▀▀██████╴ ▐╶╋▔╲ ▁████████▀╴ ╶███████
█▄╷ ╷▁▟████████▙▂╷╷╷▂█████████▙ ▖ ▀▜██▙ ▕▘╴ ╺▄▅████▛▛▀╵╶▗▆▆▆▆▆████▛▛▀╵
████████▋╵╵▔▝▀██████████╴╵ ╵▀██▌▕▇╴ ▁ ▄▖ ▝▀▀ ╶╷━▇████▉╵ ╶█████████▔
████████▊ ▐█████████╴ ╶██▋╵╶▛█▋▃▃▖ ▟██▖╻╷ ╷▃▖▅▆▇▍▗▆┘▗▎█████▋ ▗████████▋
████████▊╴ ╶█████████╴ ╶███▊ ▗████┓▆███▙▁▜██▇▇██▐███▁▄▋ ▝▐███▛▀╴╷ ▁███████▛▘╷╷ ▁
╻▀▀▜▜████▙▅▄▄▅▆█▀▀▀▀▜▜███▆▅▅▆███▀┑ ▔▀▀▀ ▔▔▜█▇▇████▛▃▖▘▔▔ ╵▗▇███████▛▔╵ ╶▅█████
▊╴ ▝█████████╴ █████████▋ ╹ ╶╹╴ ▄▇▇▄ ╶▕███████▆█▇╴ ▐████████▍ ▐██████
█╴ ▜████████╴ ▐████████▌ ╷ ╴▐██▜▋╶▂▜██▟┹▝▃██▋ ████████▘╴ ▕██████
█▅▁╷ ╷▄▜███████▄▂╷ ▁▟████████▍╶╶╴ ╶╷╴╵ ▀▝██▘╶▕▀▀▘▗▄▃▃▄▄▟███▛▀▀▔╵╶▄▄▅▆▆▆███▛▀▀▔
████████▌╵ ▔▜████████▉╴▁▃▅▆▀▔ ╵╵╵╵ ╵▝▛▔ ╶ ╵╺╸▀████████╴ ▐████████▏
████████▋ ╶█████████▛▘╵ ▗ ╶╌╴╵╶╵╶ ╵╷╷╴▗ ▔▀▜███▋ ▟███████▀
▜███████▍ ▁▂▃▅▆████▛▛▀▔ ▐▙ ╵ ▅▉ ▝▜▅▄▃▂▁╷▂██████▘╵ ╷
╌▝▘▀▀▀▀▛▜▀██▜▀▘╴ ▜ ▁▅██▘ ╵╵▔▀▀▛▘╷▁ ▗▇▇▇██
▋╴ ▔╵ ╴ ▃▆███▛ ╶╵▔▘╶╶┎▀█████
▊ ╶ ╶▆█████▛ ▔ ▝████
▍╷ ╵ ╴╷╷╷ ▐████▛ ▝▀▀▔
███▇▍ ╶╷╴ ████
████▘ ╵▐ ███▍
Smiley Is Suspicious (sis) is a PDF analyser that inventories PDF attack surface, detects suspicious or exploitable constructs, and produces grouped findings with evidence spans. It is designed for interactive speed without trading away parser correctness.
Key goals:
- Viewer-tolerant parsing with recovery scanning for malformed PDFs.
- Evidence spans for raw bytes and decoded artefacts.
- Two-phase analysis: fast triage by default, deeper decoding on demand.
- Deterministic, stable finding IDs with reproducible evidence pointers.
- Viewer-tolerant parsing, document deviation tracking, and deterministic, reproducible findings.
- Stream decoding with cached results, filter recovery, and evidence spans for both raw and decoded bytes.
- Action-chain inference that links triggers (actions, annotations) to payloads (JavaScript, embedded files, font gadgets).
- Content-first pipeline covering JavaScript, vector and raster payloads, metadata/phishing signals, and rich media.
- Font analysis for Type 1, TrueType, OpenType, and variable fonts (see
docs/findings.mdfor CVE coverage). - Image/decoder scrutiny (JPEG, JPEG2000, PNG, TIFF, JBIG2, CCITT) with the new vector path anomaly detector.
- Filter-chain anomaly, entropy, and decoder budget detection plus embedded file classification.
- Queryable output (JSON/JSONL/SARIF) combined with CLI (
sis report,sis explain,sis extract) and optional ML scoring (ONNX).
- Stage 0 (Index + Parse) – Build the object graph, page tree, and preliminary indexes. Data is parsed once so detectors can reuse shared views.
- Stage 1 (Fast triage) – Run cheap detectors that do not decode large streams (metadata, actions, structure, table checks). This provides instant feedback inside
sis scan. - Stage 2 (Decoded payloads) – Optional (triggered via
--deepor detectors that request it). Streams are decoded, JavaScript is analyzed, embedded files commented, and vector/raster heuristics applied. - Stage 3 (Correlation & ML) – Build action chains, correlate findings, score with ML models (if configured), and emit enriched reports (
sis report, JSON output, SARIF).
- Actions & JavaScript – Detects
/OpenAction,/AA,/Launch,/GoToR,/URI,/SubmitForm, script payloads, and obfuscation signals (signature counts, entropy, AST hints). - Embedded content – Finds embedded files, font gists, XFA submissions, and now vector-heavy streams (
vector_graphics_anomaly), combining evidence spans and meta for tracing. - Images & decoders – Supports JPEG/JPX, PNG, TIFF, JBIG2, CCITT with deferred filter handling so image-analysis detectors own their filters; includes the new vector-path detector for suspicious Illustrator/EPS/SVG content.
- Fonts – Examines Type 1, TrueType, OpenType, and variable fonts for CVEs and stack anomalies, includes reader-impact reasoning.
- Entropy & resources – Tracks entropy metrics, decoding budgets, and filter-chain anomalies to catch obfuscation or DoS attempts.
- Query & reporting – CLI outputs (Markdown/JSON/SARIF),
sis queryfor structured exploration (pages,js,urls,events,filters), andsis explainfor per-finding breakdowns. - Stream queries & REPL access –
sis query sample.pdf stream <obj> [<gen>]accepts--decode,--hexdumpor--rawto control decoding, and the interactive REPL exposesstream <obj> <gen> --rawso the raw bytes can be piped or redirected straight into downstream tools.
All finding definitions live under docs/findings.md. Findings are grouped by weak spot:
- Actions & chains (URI, Launch, SubmitForm, JS chains)
- Embedded payloads (files, scripts, executables, vector anomalies)
- Streams & decoders (filters invalid, entropy, decompression ratio, corrupt data)
- Fonts & typography (Type 1 stack, TrueType VM, CFF/CFF2 tables)
- Metadata & phishing (XFA forms, URI classifications, structure anomalies)
Each finding carries severity, confidence, and impact, making it easy to score, chain, and filter through the CLI or ML outputs.
curl -fsSL https://raw.githubusercontent.com/michiel/sis-pdf/main/scripts/install.sh | shChange the install destination by setting SIS_INSTALL_DIR=/path/team/bin before running the script:
SIS_INSTALL_DIR=/opt/bin curl -fsSL https://raw.githubusercontent.com/michiel/sis-pdf/main/scripts/install.sh | shOn Windows (PowerShell), run:
irm https://raw.githubusercontent.com/michiel/sis-pdf/main/scripts/install.ps1 | iexBinary releases are also available under Releases. Keep the runtime current with:
sis update
Add --include-prerelease when you need nightly builds.
# Fast triage scan
sis scan sample.pdf
# Deep scan with Markdown report
sis report sample.pdf --deep -o report.md
# JSON/SARIF results for automation
sis scan sample.pdf --json
sis report sample.pdf --format=sarif
# Explain an interesting finding
sis explain sample.pdf vector_graphics_anomaly
# Extract JavaScript or embedded files defensively
sis extract js sample.pdf -o payloads/
sis extract embedded sample.pdf -o embedded/
# ML health check
sis ml health --ml-provider auto
# Query specific sections
sis query sample.pdf pages
sis query sample.pdf js --where "entropy > 7.5"
sis query sample.pdf urls --json
sis query sample.pdf filters --where "filter == '/FlateDecode'"
sis query sample.pdf events # interactive REPLdocs/findings.md– canonical taxonomy, severities, tags, and evidence guidance.docs/sis-pdf-spec.md– implementation notes, features, and content-stream parsing.docs/query-interface.md–sis querygrammar, predicates, and example workflows.docs/ml-features.md– exported ML features and normalization.README-DEV.md– development setup, cargo commands, and workspace tips.
Also check plans/ for long-lived project agendas (filters, chains, ML signals, etc.).
sis-pdf includes comprehensive font security analysis to detect exploits targeting PDF font renderers. This feature analyzes embedded fonts for known vulnerabilities, suspicious patterns, and exploit techniques.
- Type 1 (PostScript): BLEND exploit detection, dangerous operator analysis, stack depth tracking
- TrueType: Hinting program analysis, table validation, VM instruction budgets
- OpenType/CFF: Variable font validation, CFF2 table checks
- Variable Fonts: gvar/avar/HVAR/MVAR table anomaly detection
The analyzer includes signatures for known font vulnerabilities:
- CVE-2025-27163: hmtx/hhea table length mismatch
- CVE-2025-27164: CFF2/maxp glyph count mismatch
- CVE-2023-26369: EBSC table out-of-bounds
- BLEND Exploit (2015): PostScript Type 1 stack manipulation
CVE signatures are automatically updated weekly via GitHub Actions.
Font analysis is enabled by default. Configure via config.toml:
[scan.font_analysis]
enabled = true
dynamic_enabled = true
dynamic_timeout_ms = 5000
max_fonts = 100# Scan PDF with font analysis
sis scan suspicious.pdf
# View font findings
sis scan suspicious.pdf | grep "^font\."
# Detailed font analysis example
cargo run --example font_analysis suspicious.pdfFor all font finding definitions, see docs/findings.md.
Config defaults to the platform user config directory, or pass --config=PATH.
Linux: ~/.config/sis/config.toml
macOS: ~/.config/sis/config.toml
Windows: %APPDATA%\sis\config.toml
Generate a default config and validate it:
sis config init
sis config verify
Example (TOML):
[logging]
level = "warn"
[scan]
deep = true
parallel = true
ml_provider = "auto" # auto, cpu, cuda, migraphx, rocm, directml, coreml, onednn, openvino
ml_provider_order = ["migraphx", "cuda", "cpu"]
ml_ort_dylib = "/path/to/libonnxruntime.so"
ml_provider_info = truesis update
Or re-run the install script to pull the latest release.
To include prerelease builds:
sis update --include-prerelease
- Operator scenarios:
docs/scenarios.md - JavaScript detection catalogue:
docs/findings.md - Development notes and workspace details:
README-DEV.md