Skip to content

or4k2l/THUNBIT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THUNBIT

Research prototype. THUNBIT is an experimental detector for demand-regime instability in daily SKU-level demand series. It is not a forecasting model, not production-ready, and has not been formally validated on real-world data.

THUNBIT converts changes in demand behaviour into an operational confidence signal and state sequence (STABLE, DRIFT, SHIFT). thunbit_synthetic_explainer

THUNBIT is a research prototype for detecting when daily demand has stopped behaving like the pattern an inventory or forecasting workflow implicitly relies on. It combines three statistical evidence channels into a single confidence score and maps it to an operational state signal:

State Meaning
STABLE No meaningful evidence of distributional change.
DRIFT Sustained moderate evidence of change; review recommended.
SHIFT Sustained strong evidence of a structural break.

V4.5 keeps the V4.4 core logic but suppresses short, weak, isolated alert episodes through adaptive episode gating. thunbit_v44_vs_v45_example


What THUNBIT is

  • A demand-regime change indicator that tells you something has changed (or nothing has changed).
  • A research tool for exploring detection trade-offs across simulated demand scenarios.
  • A Python package you can import and run on a demand array in a few lines of code.

What THUNBIT is not

  • A forecasting model. It produces no demand forecasts, safety-stock levels, or reorder-point recommendations.
  • A validated production system. Quantitative benchmarks are from synthetic data; public-dataset plausibility checks (Favorita, Rossmann) are exploratory and non-labeled. Real-world performance is unknown.
  • A solved problem. False alerts on stable series are a known open limitation (see docs/limitations.md).

Quick start

pip install -e .          # install from repo root
python examples/basic_usage.py

Or open the walkthrough notebook:

pip install -e ".[dev]"   # includes jupyter
jupyter notebook notebooks/01_detector_walkthrough.ipynb

Or directly in Python:

import numpy as np
from thunbit import StabilizedDemandDetectorV45

# Your daily demand series.
# window_long + window_short = 111 observations are consumed before the first
# output row is produced; provide more data for meaningful detection results.
demand = np.array([...])

det = StabilizedDemandDetectorV45()
result = det.detect_rolling_stabilized(demand)
print(result[["t", "state", "confidence", "action"]].tail(10))

V4.5 is the current refinement release (V4.4 core + adaptive episode gating).


Project layout

thunbit/            Python package
  detector.py       Baseline DemandStateDetector (no state machine)
  stabilized.py     Stabilized variants: V4, V4.1, V4.2, V4.3, V4.4, V4.5
  _states.py        State constants

examples/
  basic_usage.py    Runnable end-to-end example (synthetic data)

notebooks/
  01_detector_walkthrough.ipynb    Import, run, and inspect all detector variants
  02_benchmark_iterations.ipynb   Reproducible benchmark across synthetic scenarios
  03_cost_simulation.ipynb        Illustrative cost framing (directional only)

docs/
  methodology.md      How the detector works
  benchmarking.md     Simulation results and iteration history
  limitations.md      Known open problems
  reproducibility.md  How to install, run, and reproduce results

data/               Placeholder for data files (currently synthetic only)
tests/              Lightweight unit tests

Methodology overview

Three statistical evidence channels are combined:

  1. Distribution change – two-sample KS statistic between a long reference window and a short current window.
  2. Volatility change – normalised absolute variance change between windows.
  3. Weekly-pattern instability – change in lag-7 autocorrelation between windows.

A weighted combination of the maximum and mean evidence values forms a raw_confidence score (0–1). The stabilized variants layer a state machine (hysteresis + smoothing + confirmation + cooldown) on top of this score to reduce noisy state flickering.

V4.2, V4.3, and V4.4 go further by normalizing the raw score against a rolling baseline of the SKU's own recent confidence values, producing a relative-to-own-noise signal. Benchmarks showed this is the key mechanism for reducing false alerts on stable series. V4.3 introduced the 25th-percentile (lower-quantile) baseline over a longer window plus warmup suppression. V4.4 (the V4.4b calibration) keeps that normalized-score design and applies a stricter state-machine calibration.

V4.5 is an operational refinement layer over true V4.4 episodes: it applies adaptive episode gating to suppress short, weak, isolated alerts (suppress_max_len=3, suppress_max_mean_conf=0.52, suppress_min_prev_gap=14, suppress_min_next_gap=14), with merge disabled by default. V4.5 is not a new state model and not a confidence-remapping step.

See docs/methodology.md for full details.


Benchmark summary

All results are from synthetic demand simulations across 10 random seeds. They are exploratory findings, not validated performance claims.

Iteration path:

Version Key change Stable mean_alert_days_pct Notes
Old baseline No state machine 32.4% Fast but noisy
V4 Hysteresis + smoothing + confirmation 27.5% Adds detection delay
V4.1 Cooldown, relaxed thresholds 32.0% Recovers some speed vs V4
V4.2 Baseline-normalized scoring (median) 3.2% Major improvement; over-damped on breaks
V4.3 Lower-quantile baseline + warmup ~10.6% Historical comparison point; more responsive
V4.4 (current, V4.4b calibration) V4.3 normalized score + stricter state calibration ~7.8% Current recommended conservative experimental operating point

V4.1 vs V4.2 – stable-series false alerts:

Metric Old baseline V4.1 V4.2
any_alert_rate 1.00 1.00 0.60
mean_alert_days_pct 32.4% 32.0% 3.2%
mean_fp_clusters 6.3 4.7 1.0

V4.2 confirmed that baseline-normalized scoring is the right design direction for reducing false-alert burden. However, it roughly tripled detection delay on cycle-break, gradual-drift, and intermittent scenarios.

V4.4 is now the recommended experimental operating point for a quieter / more conservative alerting posture. It keeps V4.3's lower-quantile normalized-score design and uses stricter state-machine calibration (V4.4b): drift_entry=0.42, drift_confirm_days=3, shift_entry=0.68, shift_confirm_days=1.

A completed synthetic benchmark comparison (all six scenarios, 10 seeds each) shows that V4.4 is a more conservative operating point, not a universally superior detector:

  • V4.4 reduces stable-series alert burden (~7.8% alert days vs ~10.6% for V4.3) and false-positive clustering (mean 2.4 FP clusters vs 3.6).
  • V4.4 also reduces break-detection sensitivity and increases mean detection delay across synthetic scenarios, especially for cycle_break (detection rate 0.70 vs 0.90; mean days late 47.3 vs 17.8).

V4.3 remains the more responsive historical comparison point and should be preferred when faster synthetic break detection is the priority.

The V4.4b calibration choice is further supported by exploratory public-dataset plausibility audits across two real-world retail datasets:

  • Favorita (Ecuador grocery, daily store_nbr × family panel) — currently the strongest public external plausibility check run so far. V4.4 again appeared as a quieter / more conservative operating point than V4.3. Alerts were frequently adjacent to transaction-shock windows and often near mapped holiday periods.
  • Rossmann (German drug-store chain, daily) — a secondary corroborating retail plausibility check. Alerts clustered near store closure and reopening transitions. V4.4 was again quieter than V4.3.

V4.5 adaptive gating follow-up checks on these same panels showed a robust refinement profile over V4.4 episodes rather than a fundamentally new detector:

  • Favorita — meaningful share of series changed, alert burden slightly reduced, and cluster fragmentation reduced.
  • Rossmann — smaller but still positive burden and fragmentation improvements.

Conclusion: V4.5 should be interpreted as a practical refinement layer over V4.4's core detector output.

These are exploratory event-linked plausibility audits, not gold-labeled real-world validation. Synthetic benchmarks remain the quantitative core and the primary controlled basis for comparing V4.3 and V4.4. False-alert calibration remains an active open problem.

See docs/benchmarking.md for the full iteration history and quantitative tables.


Current status

Item Status
Core detector and state machine ✅ implemented
V4.4 as recommended experimental detector ✅ V4.4b calibration on V4.3 normalized-score design
V4.5 adaptive episode-gating refinement layer ✅ implemented over true V4.4 output (merge disabled by default)
Synthetic benchmark (old vs. V4 vs. V4.1) ✅ complete
V4.2 score-normalization benchmark ✅ complete
V4.3 + V4.4 synthetic benchmark (direct comparison) ✅ complete — see docs/benchmarking.md
Benchmark and walkthrough notebooks ✅ committed (notebooks/)
Reproducibility documentation ✅ see docs/reproducibility.md
Stable-series false-alert calibration ❌ open problem (improved but unsolved)
Public plausibility audits ⚠️ exploratory — Favorita (strongest external check) + Rossmann (secondary corroboration); no formal validation
Automated parameter tuning ❌ not started

Known limitations

  • Stable-series false alerts are reduced by V4.2/V4.3/V4.4 but not eliminated. Score calibration remains the central open challenge.
  • Synthetic benchmarking remains the quantitative core; public-dataset checks (Favorita, Rossmann) are exploratory event-linked plausibility audits only (not gold-labeled real-world validation).
  • Detection delay on gradual-drift and intermittent demand is higher than the baseline at V4+ settings.
  • Parameter calibration was manual; no automated tuning is included.

See docs/limitations.md for the full list.


Roadmap

  • Quantitative V4.4 benchmark across all six scenarios and 10 seeds (complete — see docs/benchmarking.md)
  • Investigate lower-quantile baseline parameter sensitivity (quantile level, window length, excess scale)
  • Automated parameter sweep over stable / drift / shift trade-off space
  • Test on anonymised real SKU demand data
  • Document business-cost simulation methodology

Requirements

  • Python ≥ 3.9
  • numpy ≥ 1.22
  • pandas ≥ 1.4
  • scipy ≥ 1.8

License

Apache 2.0 – see LICENSE.