Surface rare, high‑signal log lines and explain why they matter.
Fast streaming novelty scoring + adaptive quantile alerting + transparent token/template explanations.
Logs are overwhelming. You want the needles, not the hay. Elaborlog ranks lines by:
- Token rarity (Shannon self-information from online frequencies)
- Template rarity (structure-aware masking of IDs, timestamps, IPs)
- Severity bonus (small, transparent bump for ERROR/WARN)
It also shows nearest neighbors for instant context.
pip install -e ".[dev]" # for local dev
# or
pipx install elaborlog # after you publish to PyPI(Optional) colorized output:
pip install "elaborlog[color]"Rank a file and print the top 20 most novel lines:
elaborlog rank examples/app.log --top 20Write the full novelty-ranked CSV for later triage:
elaborlog rank examples/app.log --out ranked.csvTail live logs and alert on the rarest 0.8 % of lines with neighbor context:
elaborlog tail /var/log/app.log --mode triageEscalate only the most exceptional outliers (≈0.5 %) and dedupe repeats:
elaborlog tail /var/log/app.log --mode page --dedupe-templateExplain why a single line popped, with the top surprisal tokens:
elaborlog explain examples/app.log --line "ERROR payment declined code=402 user=9922" --top-tokens 8Show the most common templates:
elaborlog cluster examples/app.log --top 30Need a tuned preset? Try --profile web, --profile k8s, or --profile auth to load sensible window/quantile knobs for that source.
Prefer bigrams for rigid formats? Add --with-bigrams to rank, tail, or explain when you know the templates are stable.
Need to rebalance scoring influences? Override weights on any scoring-related command:
elaborlog rank production.log --w-token 1.2 --w-template 0.8 --w-level 0.4Stream alerts and also capture structured machine output:
elaborlog tail /var/log/app.log --mode triage --jsonl alerts.jsonlEach JSONL alert includes: novelty, raw score (and component bits), template probability, top token contributors (token, bits, probability, frequency), neighbor lines, quantile meta.
Emit rolling alert rate stats (stderr) every 10s:
elaborlog tail /var/log/app.log --stats-interval 10Batch JSON export:
elaborlog rank prod.log --json ranked.jsonStructured single-line explanation:
elaborlog explain prod.log --line "ERROR payment declined code=402 user=9922" --json explain.jsonNeed the full token contributor list (not just top N) for deeper offline analysis? Add --all-token-contributors to rank --json, score --json, tail --jsonl, or explain --json.
After collecting alerts with tail --jsonl, you can produce a compact statistical summary:
elaborlog summarize alerts.jsonl --out summary.json --top-templates 15 --top-tokens 15Sample summary JSON keys:
{
"alerts": 234,
"quantile": 0.998,
"novelty_min": 0.742,
"novelty_max": 0.963,
"novelty_mean": 0.884,
"novelty_p50": 0.881,
"score_mean": 12.37,
"threshold_mean": 0.902,
"threshold_last": 0.905,
"top_templates": [[42, "ERROR payment declined code=<num> user=<num>"], ...],
"top_tokens": [[123.7, "declined"], ...]
}
Use this to feed dashboards or compare regimes across deployments.
An optional script (scripts/bench_emit_intermediate.py, added in dev extras) measures overhead of --emit-intermediate vs baseline tailing. Run it after installing dev deps to quantify impact when tracking many quantiles.
Install server extras and launch the HTTP API:
pip install "elaborlog[server]"
elaborlog serve --host 0.0.0.0 --port 8080 --state-out state.jsonEndpoints:
GET /healthz– livenessPOST /observe– body{ "line": "..." }(updates frequencies)POST /score– body{ "line": "..." }returns scoring fieldsGET /stats– model cardinalities & countersGET /metrics– detailed internal metrics (counts, decay scale, renormalizations, guardrail counters)
You can warm start with --state-in and enable periodic snapshots with --state-out (every 60s by default, adjustable via --interval).
Use the bundled harness to gauge local throughput:
python bench/benchmark.py --lines 50000 --warm 5000 --measure 20000Or benchmark a real file:
python bench/benchmark.py --file examples/app.log --warm 1000 --measure 5000Output includes lines/sec, current & peak memory, and vocabulary sizes to help tune max_tokens/max_templates.
Cold starts are optional now. Any scoring command can save and reuse the frequency model:
# Warm-start a streaming tail and persist after shutdown
elaborlog tail /var/log/app.log --mode triage --state-out state.json
# Resume later with the saved state (and keep updating it)
elaborlog tail /var/log/app.log --mode triage --state-in state.json --state-out state.json
# Batch scoring can also reuse the same snapshot
elaborlog rank production.log --state-in state.json --state-out state.jsonSnapshots (version 3) include: config, token/template counts, decay scale factor (g), guardrail counters, and vocabulary sizes. Backward compatibility: older v1/v2 snapshots still load (new counters default to 0).
-
Canonicalization: timestamps
<ts>, IPs<ip>, UUIDs<uuid>, hex<hex>, emails<email>, URLs<url>, POSIX or Windows paths<path>, quoted strings<str>, and numbers<num>. -
Tokenization: single tokens by default; opt into bigrams with
--with-bigramsfor extra structure when false positives are low. -
Streaming stats: Laplace smoothing
$k = 1.0$ , lazy exponential decay (O(1) global scale) with default per-line factor0.9999, and vocab caps (~30k tokens / 10k templates) so memory stays flat. -
Novelty score: average token surprisal mapped to
$[0,1)$ via$1 - e^{-S}$ for an intuitive rarity gauge. -
Dynamic alerting: constant-memory P² quantile estimator by default (no O(W log W) sorts). Supply
--windowto use the legacy rolling window quantile. -
Modes / Profiles:
--mode triage⇒q=0.992;--mode page⇒q=0.995. Domain presets:--profile web|k8s|authtune both window (when specified) and burn-in. -
Guardrails: line length truncated at
max_line_length(default 2000 chars), tokens capped atmax_tokens_per_line(default 400) to prevent pathological lines from skewing the model. -
Operator presets:
--profile web,--profile k8s,--profile authtune windows/quantiles;--dedupe-templatesuppresses repeat spam.
- Canonicalize: mask volatile bits (timestamps, IDs, IPs, emails, URLs, file paths, quoted strings, numbers) so structurally similar lines collapse to the same template.
- Token frequencies: keep a decayed count of every token (plus optional bigrams). Common words like
infocontribute little; rare words likedeclinedcarry many bits. - Template rarity: the masked template (e.g.
ERROR payment declined code=<num> user=<num>) gets the same self-information treatment, rewarding surprising structures. - Severity tap: add a small, transparent bonus for WARN/ERROR levels.
The average token surprisal
Instead of scaling every count each decay step, Elaborlog maintains a global scale factor g and stores unscaled counts. Effective counts are stored * g. This makes decay O(1) per line and snapshot-friendly.
- Pure Python stdlib by default.
- Single pass, streaming-friendly.
- Exponential decay + LRU vocab caps keep the model fresh (no growing memory usage).
- Streaming P² quantile: constant memory, fast convergence after burn-in (see tests for statistical validation).
- Lazy decay: no vocabulary-wide scans; large vocabularies stay cheap.
- Guardrails: extreme line/token explosions capped early (tracked in snapshot counters).
Machine-readable schemas:
schemas/alert.schema.json(tail alerts JSONL)schemas/rank.schema.json(array output fromrank --json)schemas/explain.schema.json(single object fromexplain --json)
Validate with jsonschema (installed via dev extras) or any Draft 2020-12 validator. Informal examples below:
Alert JSONL (tail):
Rank JSON array (--json): list of similar objects minus threshold fields.
Explain JSON (explain --json): one object plus weights.
The service exposes GET /metrics for observability and health dashboards.
Example (trimmed):
{
"tokens": 18342,
"templates": 6123,
"total_tokens": 512340,
"total_templates": 176540,
"seen_lines": 176540,
"g": 0.2243119,
"renormalizations": 3,
"truncated_lines": 12,
"truncated_tokens": 4,
"config": {
"decay": 0.9999,
"max_tokens": 30000,
"max_templates": 10000
}
}Fields:
g: current global decay scale factor (lazy decay internal state)renormalizations: how many times counts were rescaled to avoid underflowtruncated_lines/truncated_tokens: guardrail activationstotal_*vs currenttokens/templates: decayed vs distinct vocabulary sizes
Use this to alert when renormalizations spikes unexpectedly, or when vocabulary approaches caps (max_tokens, max_templates).
A lightweight benchmark runs in CI (Python 3.11) and uploads bench-result.json. Treat early numbers as baselines; you can diff artifacts across commits to spot performance regressions.
CI enforces a minimum throughput ratio vs. a committed baseline (bench/baseline.json). After each benchmark run we execute:
python scripts/check_benchmark.py --current bench-result.json --baseline bench/baseline.json --min-ratio 0.90
If current lines_per_sec / baseline_lines_per_sec < 0.90, the job fails, flagging a likely regression. Workflow for updating the baseline after an intentional improvement:
- Run a representative local benchmark (repeat 3–5×, take median).
- Edit
bench/baseline.jsonwith the new stablelines_per_sec. - Commit and open a PR (include rationale: hardware, command, median, variance).
Setting an initial baseline: the placeholder ships with 0 (always passes). Replace it once numbers stabilize on your primary CI runner.
Tune sensitivity by adjusting --min-ratio (e.g. 0.95 for stricter, 0.85 for looser) in .github/workflows/ci.yml.
Mypy runs in CI with a moderately strict profile:
[tool.mypy]
disallow_untyped_defs = true
disallow_incomplete_defs = true
no_implicit_optional = true
warn_unused_ignores = true
warn_redundant_casts = true
strict_equality = true
Local invocation:
mypy src
Adding new modules? Prefer explicit return types. If a dynamic construct defies precise typing, isolate it and add a narrow # type: ignore[code] with justification.
By default tail uses the P² algorithm (Jain & Chlamtac, 1985) to maintain a target high percentile of novelty in constant space (5 markers) and O(1) amortized update time. This avoids keeping a large rolling buffer and performing sorts.
Key properties:
- Memory: constant (5 marker heights, positions, desired positions, increments)
- Update: a few arithmetic ops and conditional parabolic / linear adjustments
- Convergence: Fast after initial 5-sample bootstrap and burn-in; statistically validated in
test_streaming_quantile_alert_rate. - Early phase (<5 samples): falls back to exact interpolation of observed values.
Use a fixed rolling window quantile instead by specifying --window N; this keeps the last N scores (memory O(N)) and computes quantiles via a partial sort.
Supply multiple quantiles to tail for stricter alerting and richer telemetry:
elaborlog tail prod.log --quantiles 0.99 0.995 0.998Behavior:
- Maintains one P² estimator per supplied quantile.
- Determines the alert threshold using the highest quantile (strictest).
- Prints all intermediate quantile estimates in alert lines for observability.
- JSONL alerts contain the active quantile (
quantilefield) equal to the highest q.
Example live alert line (color disabled, P² mode):
2025-10-04T12:00:00Z [ERROR] novelty=0.912 (q0.990=0.845,q0.995=0.872,q0.998=0.901; using>=0.901) score=13.442 ERROR payment declined code=402 user=9922
-> neighbor (sim=0.67): ERROR payment failed code=500 user=9911
template=ERROR payment declined code=<num> user=<num> p~0.00042
When should you use multi-quantile mode?
- You want to monitor drift across several tail percentiles simultaneously.
- You’re tuning which high percentile best balances volume vs. signal before fixing on one.
Burn-in (--burn-in) still applies; no alerts are emitted until the specified number of lines passes (plus a minimum of 10 observations for numerical stability).
Practical tips:
- For very low traffic streams, prefer a rolling window (
--window) so early estimates remain stable. - If novelty distribution shifts (deployment, traffic change), P² adapts without clearing history; optionally shorten burn-in if you snapshot & resume.
If you provide both --window and multiple --quantiles, elaborlog now computes windowed quantiles for each q but still alerts only when exceeding the highest (strictest) one. Intermediate quantile estimates are displayed (and in color mode each is listed) so you can watch how the tail compresses.
Use --no-follow to process the existing file from the beginning exactly once and exit (ideal for CI and reproducible benchmarks):
elaborlog tail prod.log --quantiles 0.99 0.995 0.998 --no-follow --jsonl alerts.jsonlWithout --no-follow, tailing starts at end-of-file (like tail -f) and waits for new lines.
If you also want every individual quantile estimate serialized for machine analysis, add --emit-intermediate to include a quantile_estimates object in each JSONL alert.
To avoid UnicodeEncodeError on consoles lacking UTF-8 (e.g., some Windows code pages) threshold and neighbor glyphs were replaced with ASCII equivalents (>=, ->). If you depended on the previous symbols, update parsers accordingly.
{ "timestamp":"2025-10-04T12:00:00Z", "level":"ERROR", "novelty":0.912, "score":13.442,
"token_info_bits":9.12, "template_info_bits":3.02, "level_bonus":0.70,
"template":"ERROR payment declined code=<num> user=<num>",
"template_probability":0.00042,
"tokens":["error","payment","declined","code","402","user","9922"],
"token_contributors":[{"token":"declined","prob":0.05,"bits":4.30,"freq":1}],
"line":"ERROR payment declined code=402 user=9922",
"threshold":0.901,
"quantile":0.998,
"quantile_estimates": {"0.990":0.845, "0.995":0.872, "0.998":0.901},
"neighbors":[{"similarity":0.67,"line":"ERROR payment failed code=500 user=9911"}]
}| Aspect | Streaming P² | Rolling Window |
|---|---|---|
| Memory | O(1) per quantile | O(W) scores |
| Adaptation | Fast, smooth | Stepwise (window contents) |
| Historical inertia | Retains long-tail influence (decayed) | Strictly last W lines |
| Best for | Long-lived services, shifting distributions | Short batch analyses, low-volume logs |
| Multi-quantiles | Independent P² markers per q | Recompute per q from deque |
Heuristics:
- Prefer P² for continuous streams or when memory is tight.
- Prefer window when early-phase stability matters (small datasets / test fixtures) or you need exact quantiles over a bounded horizon.
Schemas now employ $defs for tokenContributor and a timestamp pattern (subset RFC3339). Downstream generators can rely on stable contributor object shape.
- Colorized/structured terminal output by default (optional rich).
- Windows Event Log reader and journalctl adapter.
- File globbing across rotated logs.
- Streaming clustering & online template evolution stats.
- Multi-source ingestion / metrics exporter.
- Optional approximate nearest-neighbor for large context windows.
Apache-2.0
Changelog: see CHANGELOG.md for versioned feature history.
{ "timestamp": "2025-10-01T12:34:56Z", "level": "ERROR", "novelty": 0.873, "score": 12.45, "token_info_bits": 9.12, "template_info_bits": 3.02, "level_bonus": 0.70, "template": "ERROR payment declined code=<num> user=<num>", "template_probability": 0.00042, "tokens": ["error","payment","declined","code","402","user","9922"], "token_contributors": [ {"token":"declined","bits":4.3,"prob":0.05,"freq":1} ], "threshold": 0.861, "quantile": 0.992, "neighbors": [ {"similarity":0.62,"line":"ERROR payment failed code=500 user=9911"} ] }