rtForecastEval: EVALuating Real-Time Probabilistic Forecast

April 9, 2026

Repository: github.com/chikuang/rtForecastEval

Authors

Chi-Kuang Yeh (Georgia State University)

Gregory Rice (University of Waterloo)

Joel A. Dubin (University of Waterloo)

Overview

rtForecastEval implements the methodology in Yeh, Rice, and Dubin (2022) (preprint arXiv:2010.00781, PDF). The paper develops tools for continuously updated probabilistic forecasts: calibration-style summaries, skill (relative performance of two forecasters) via a global delta test on L²[0,1] under squared (Brier) loss, and simple pointwise inference for the mean loss difference over time.

This package ships the statistical core used in the paper’s simulation sections. Full NBA replication (scraped data, calibration surfaces, competitor models) lives in a separate analysis repository and is not bundled here.

Package contents

Topic	Functions
Simulation designs	`df_gen()` — requires the sde package at runtime
Pointwise loss / variance	`calc_L_s2()`, `lin_interp()`
Global delta test	`calc_Z()`, `calc_eig()`, `calc_pval()`
Plotting	`plot_pcb()` (naive pointwise skill band); README below also shows reliability plots and a calibration-difference curve between forecasters

After installation, run help(package = "rtForecastEval") or ?df_gen for full documentation.

Mind map and workflow

The diagrams below use Mermaid. They render on GitHub; in other viewers you may see the source only.

Mind map (package scope)

mindmap
  root((rtForecastEval))
    Paper
      Yeh Rice Dubin 2022
      Real-time probabilistic forecasts
    Inputs
      df_gen simulation
      Your own long-format data
    Global skill test
      calc_Z delta statistic
      calc_eig covariance spectrum
      calc_pval MC p-values
    Pointwise skill
      calc_L_s2 loss and variance
      plot_pcb naive band
    Other
      lin_interp time grid

Analysis workflow

flowchart TD
  subgraph src["Data sources"]
    A["df_gen()<br>(simulation)"]
    B["Your forecasts<br>(long format)"]
  end
  C["Long tibble:<br>game, grid, Y,<br>phat_A, phat_B"]
  D["Centered / non-centered<br>differences"]
  E["calc_Z()"]
  F["calc_eig()"]
  G["calc_pval()"]
  H["calc_L_s2()"]
  P["plot_pcb()"]
  A --> C
  B --> C
  C --> D
  D --> E
  D --> F
  E --> G
  F --> G
  D --> H
  H --> P

Installation

rtForecastEval is distributed as an R package on GitHub (not yet on CRAN). After installation, load library(rtForecastEval).

Recommended (pak handles dependencies cleanly):

install.packages("pak")
pak::pak("chikuang/rtForecastEval")

Alternative: remotes::install_github("chikuang/rtForecastEval") after install.packages("remotes"). (As of devtools 2.5.0, devtools::install_github() is deprecated in favor of pak::pak() or remotes::install_github().)

If you clone the repository and work from the source tree, use devtools::load_all() or pkgload::load_all() from the package root instead of installing from GitHub.

Simulation via df_gen() also needs the suggested sde package (Brownian paths):

install.packages("sde")

Other languages (Python, Julia, Matlab) are not implemented in this repo; only R is supported for now.

Minimal example

After installing the package (and sde for df_gen()), load dplyr, tidyr, ggplot2, and rtForecastEval. The chunks below execute when you render README.qmd (e.g. quarto render README.qmd or Rscript tools/render-readme.R) with the package available — for example devtools::load_all() from a local clone or library(rtForecastEval) after pak::pak("chikuang/rtForecastEval").

library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(rtForecastEval)

set.seed(777)
nsamp <- 40   # interior steps; df_gen() uses a grid of nsamp + 1 points on [0, 1]
ngame <- 1000  # independent replicates (“games”); larger n → more games per calibration bin
D <- 8        # leading eigenvalues for the chi-square approximation
N_MC <- 2000  # Monte Carlo draws for the p-value

df_equ <- df_gen(N = nsamp, Ngame = ngame) |>
  group_by(grid) |>
  mutate(
    p_bar_12 = mean(phat_A - phat_B),
    diff_non_cent = phat_A - phat_B,
    diff_cent = phat_A - phat_B - p_bar_12
  ) |>
  ungroup()

ZZ <- calc_Z(df_equ, "phat_A", "phat_B", "Y", nsamp = nsamp, ngame = ngame)
eig <- calc_eig(df_equ, n_eig = D, ngame = ngame, nsamp = nsamp, cent = FALSE)
out <- calc_pval(ZZ, eig, quan = c(0.9, 0.95, 0.99), n_MC = N_MC)

Ltab <- calc_L_s2(df_equ, "phat_A", "phat_B")
plot_pcb(Ltab)

c(Z = ZZ, p_value = out$p_val, out$quantile)

         Z    p_value        90%        95%        99% 
0.05470115 0.70600000 0.39170427 0.55757635 1.01968217

Pointwise skill plot and a simple calibration (reliability) view

plot_pcb() plots the pointwise mean loss difference (relative skill) with a naive normal band — not a classical calibration diagram.

A complementary check is marginal calibration at one time: bin predicted probabilities at a single grid (here, closest to mid-game, t ≈ 0.5) and compare mean outcome to mean forecast (reliability diagram). Grey vertical segments show a 95% central range for the bin mean of Y under H₀ (perfect calibration in the bin): under the null, outcomes are Bernoulli with probability equal to the bin’s mean forecast p̄, so the bin sum is Binomial(n, p̄) and the segment endpoints use exact quantiles of that binomial (no normal approximation or asymmetric clipping to [0, 1], which can make bands look “anchored” at y = 0). Full calibration surfaces from the paper are in the NBA replication code; this is a minimal ggplot using the same long-format columns (phat_A, phat_B, Y, grid).

# Uses `df_equ` from the chunk above
g_mid <- df_equ |>
  distinct(grid) |>
  slice_min(order_by = abs(grid - 0.5), n = 1) |>
  pull(grid)

slice_t <- df_equ |>
  filter(grid == g_mid) |>
  transmute(game, Y, phat_A, phat_B)

rel_long <- slice_t |>
  pivot_longer(c(phat_A, phat_B), names_to = "forecaster", values_to = "phat") |>
  mutate(forecaster = ifelse(forecaster == "phat_A", "Forecaster A", "Forecaster B"))

rel_binned <- rel_long |>
  group_by(forecaster) |>
  mutate(bin = ntile(phat, 10)) |>
  group_by(forecaster, bin) |>
  summarise(
    mean_forecast = mean(phat),
    mean_outcome = mean(Y),
    n_games = n(),
    .groups = "drop"
  ) |>
  mutate(
    p_bin = pmin(pmax(mean_forecast, 0), 1),
    ci_lo = stats::qbinom(0.025, size = n_games, prob = p_bin) / n_games,
    ci_hi = stats::qbinom(0.975, size = n_games, prob = p_bin) / n_games
  )

ggplot(rel_binned, aes(mean_forecast, mean_outcome)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
  geom_errorbar(
    aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
    width = 0.02,
    color = "grey55",
    alpha = 0.85,
    linewidth = 0.35
  ) +
  geom_point(aes(size = n_games), alpha = 0.9, color = "darkred") +
  facet_wrap(~forecaster, nrow = 1) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
  labs(
    title = "Binned reliability (calibration) at one grid",
    subtitle = paste0(
      "Grid = ", signif(g_mid, 3),
      "; grey: exact 95% central range for mean(Y) under H0 (Binomial; see text)"
    ),
    x = "Mean forecast in bin",
    y = "Mean outcome (Y) in bin",
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold")
  )

Reliability with shared bins (two standard diagrams)

The facet plot above uses separate decile bins per forecaster, so bins are not aligned across A vs B. Here we bin the same games with equal-width cuts of the midpoint m = (phat_A + phat_B) / 2 (same definition as in the code chunk). For each bin we compute mean(Y) and each forecaster’s mean forecast in that bin, then draw one classical reliability diagram per forecaster (horizontal axis: mean forecast; vertical: mean outcome; both in [0, 1]). Grey vertical segments are the exact Binomial 95% central range for mean(Y) under H₀ at that forecaster’s bin mean forecast (same construction as the first figure). This avoids overlaying two forecasters in one square, which is easy to misread.

# Same grid as above; shared midpoint bins for the next two figures
cal_bins <- slice_t |>
  mutate(mid = (phat_A + phat_B) / 2) |>
  mutate(bin = cut(mid, breaks = seq(0, 1, length.out = 11), include.lowest = TRUE))

rel_joint <- cal_bins |>
  group_by(bin) |>
  summarise(
    mean_forecast_A = mean(phat_A),
    mean_forecast_B = mean(phat_B),
    mean_outcome = mean(Y),
    n_games = n(),
    .groups = "drop"
  ) |>
  filter(!is.na(bin)) |>
  mutate(
    p_A = pmin(pmax(mean_forecast_A, 0), 1),
    p_B = pmin(pmax(mean_forecast_B, 0), 1),
    ci_lo_A = stats::qbinom(0.025, size = n_games, prob = p_A) / n_games,
    ci_hi_A = stats::qbinom(0.975, size = n_games, prob = p_A) / n_games,
    ci_lo_B = stats::qbinom(0.025, size = n_games, prob = p_B) / n_games,
    ci_hi_B = stats::qbinom(0.975, size = n_games, prob = p_B) / n_games
  )

rel_joint_long <- rel_joint |>
  tidyr::pivot_longer(
    c(mean_forecast_A, mean_forecast_B),
    names_to = "which",
    names_prefix = "mean_forecast_",
    values_to = "mean_forecast"
  ) |>
  mutate(
    forecaster = ifelse(which == "A", "Forecaster A", "Forecaster B"),
    ci_lo = ifelse(which == "A", ci_lo_A, ci_lo_B),
    ci_hi = ifelse(which == "A", ci_hi_A, ci_hi_B)
  )

ggplot(rel_joint_long, aes(mean_forecast, mean_outcome)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
  geom_errorbar(
    aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
    width = 0.02,
    color = "grey55",
    alpha = 0.85,
    linewidth = 0.4
  ) +
  geom_point(aes(color = forecaster, size = n_games), alpha = 0.9) +
  facet_wrap(~forecaster, nrow = 1) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
  scale_color_manual(values = c("Forecaster A" = "steelblue", "Forecaster B" = "orangered")) +
  labs(
    title = "Reliability: shared midpoint bins (one classical diagram per forecaster)",
    subtitle = "Bins from cut((phat_A + phat_B) / 2); grey = exact 95% Binomial H0 range for mean(Y)",
    x = "Mean forecast in bin",
    y = "Mean outcome (Y) in bin",
    color = NULL,
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold"),
    legend.position = "right"
  ) +
  guides(color = "none")

Comparing calibration between forecasters (error curves)

The reliability diagram shows levels on the 45° plot; the plot below shows the same bins but calibration errors mean(Y) − mean forecast vs bin midpoint. Curves near 0 are better calibrated; the dashed line is mean(phat_B) − mean(phat_A) within each bin.

cal_compare <- cal_bins |>
  group_by(bin) |>
  summarise(
    mid_hat = mean(mid),
    cal_err_A = mean(Y) - mean(phat_A),
    cal_err_B = mean(Y) - mean(phat_B),
    n_games = n(),
    .groups = "drop"
  ) |>
  filter(!is.na(bin))

cal_long <- cal_compare |>
  pivot_longer(
    c(cal_err_A, cal_err_B),
    names_to = "which",
    values_to = "cal_err"
  ) |>
  mutate(
    which = ifelse(which == "cal_err_A", "A: mean(Y) - phat_A", "B: mean(Y) - phat_B")
  )

ggplot(cal_long, aes(mid_hat, cal_err, color = which)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey70") +
  geom_line(linewidth = 0.85) +
  geom_point(aes(size = n_games), alpha = 0.85) +
  geom_line(
    data = cal_compare,
    aes(mid_hat, cal_err_B - cal_err_A),
    inherit.aes = FALSE,
    color = "grey35",
    linetype = "dashed",
    linewidth = 0.65
  ) +
  labs(
    title = "Calibration comparison (same midpoint bins)",
    subtitle = "Solid: mean calibration error per forecaster; dashed: mean(phat_B) - mean(phat_A)",
    x = "Mean (phat_A + phat_B) / 2 in bin",
    y = "mean(Y) - mean(phat)",
    color = "Curve",
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold"),
    legend.position = "bottom"
  )

For the full workflow (explicit linear algebra vs. wrappers, larger grids, centered eigenvalues, and the same figures with narrative), open the vignette:

vignette("rtForecastEval-vignette", package = "rtForecastEval")

Roadmap

Optional pkgdown site and CRAN release
Optional Rcpp acceleration for heavy eigenvalue / Monte Carlo loops

References

Yeh, C.-K., Rice, G. & Dubin, J. A. (2022). Evaluating real-time probabilistic forecasts with application to National Basketball Association outcome prediction. The American Statistician, 76, 214–223. DOI 10.1080/00031305.2021.1967781.
Preprint: arXiv:2010.00781 (PDF).

License

GPL-3 — see DESCRIPTION.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
R		R
README_files/figure-commonmark		README_files/figure-commonmark
build		build
man		man
rtForecastEval.Rcheck		rtForecastEval.Rcheck
tools		tools
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
README.qmd		README.qmd
rtForecastEval.Rproj		rtForecastEval.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rtForecastEval: EVALuating Real-Time Probabilistic Forecast

Authors

Overview

Package contents

Mind map and workflow

Mind map (package scope)

Analysis workflow

Installation

Minimal example

Pointwise skill plot and a simple calibration (reliability) view

Reliability with shared bins (two standard diagrams)

Comparing calibration between forecasters (error curves)

Roadmap

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rtForecastEval: EVALuating Real-Time Probabilistic Forecast

Authors

Overview

Package contents

Mind map and workflow

Mind map (package scope)

Analysis workflow

Installation

Minimal example

Pointwise skill plot and a simple calibration (reliability) view

Reliability with shared bins (two standard diagrams)

Comparing calibration between forecasters (error curves)

Roadmap

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages