Add "low-rank" variational families (#76)

* add feature complete `MvLocationScaleLowRank` with tests * fix bugs and improve comments in `MvLocationScale` and lowrank * promote families.md into a higher category * add test for `MVLocationScale` with non-Gaussian * tighten compat bound for `Distributions` * fix base distribution standardization bug in `LocationScale` and `LocationScaleLowRank` * fix `LocationScale` interfaces to only allow univariate base dist * fix test comparison operator for families * fix scale lower bound to `1e-4` --------- Co-authored-by: Hong Ge <3279477+yebai@users.noreply.github.com> Co-authored-by: Markus Hauru <markus@mhauru.org>
TuringLang · Sep 13, 2024 · 57c9e58 · 57c9e58
1 parent 9e4c1cc
commit 57c9e58
Show file tree

Hide file tree

Showing 12 changed files with 745 additions and 170 deletions.
diff --git a/.github/workflows/Benchmark.yml b/.github/workflows/Benchmark.yml
@@ -13,6 +13,7 @@ concurrency:
 permissions:
   contents: write
   pull-requests: write
+  issues: write
 
 jobs:
   benchmark:
@@ -47,10 +48,10 @@ jobs:
           name: Benchmark Results
           tool: 'julia'
           output-file-path: bench/benchmark_results.json
-          summary-always: true
+          summary-always: ${{ !github.event.pull_request.head.repo.fork }} # Disable summary for PRs from forks
           github-token: ${{ secrets.GITHUB_TOKEN }}
-          comment-always: true
           alert-threshold: "200%"
           fail-on-alert: true
           benchmark-data-dir-path: benchmarks
+          comment-always: ${{ !github.event.pull_request.head.repo.fork }} # Disable comments for PRs from forks
           auto-push: ${{ !github.event.pull_request.head.repo.fork }} # Disable push for PRs from forks
diff --git a/Project.toml b/Project.toml
@@ -42,7 +42,7 @@ Accessors = "0.1"
 Bijectors = "0.13"
 ChainRulesCore = "1.16"
 DiffResults = "1"
-Distributions = "0.25.87"
+Distributions = "0.25.111"
 DocStringExtensions = "0.8, 0.9"
 Enzyme = "0.12.32"
 FillArrays = "1.3"

diff --git a/docs/make.jl b/docs/make.jl
@@ -16,8 +16,8 @@ makedocs(;
         "ELBO Maximization" => [
             "Overview" => "elbo/overview.md",
             "Reparameterization Gradient Estimator" => "elbo/repgradelbo.md",
-            "Location-Scale Variational Family" => "locscale.md",
         ],
+        "Variational Families" => "families.md",
         "Optimization" => "optimization.md",
     ],
 )

diff --git a/docs/src/families.md b/docs/src/families.md
@@ -0,0 +1,267 @@
+# [Reparameterizable Variational Families](@id families)
+
+The [RepGradELBO](@ref repgradelbo) objective assumes that the members of the variational family have a differentiable sampling path.
+We provide multiple pre-packaged variational families that can be readily used.
+
+## [The `LocationScale` Family](@id locscale)
+
+The [location-scale](https://en.wikipedia.org/wiki/Location%E2%80%93scale_family) variational family is a family of probability distributions, where their sampling process can be represented as
+
+```math
+z \sim  q_{\lambda} \qquad\Leftrightarrow\qquad
+z \stackrel{d}{=} C u + m;\quad u \sim \varphi
+```
+
+where ``C`` is the *scale*, ``m`` is the location, and ``\varphi`` is the *base distribution*.
+``m`` and ``C`` form the variational parameters ``\lambda = (m, C)`` of ``q_{\lambda}``.
+The location-scale family encompases many practical variational families, which can be instantiated by setting the *base distribution* of ``u`` and the structure of ``C``.
+
+The probability density is given by
+
+```math
+  q_{\lambda}(z) = {|C|}^{-1} \varphi(C^{-1}(z - m)),
+```
+
+the covariance is given as
+
+```math
+  \mathrm{Var}\left(q_{\lambda}\right) = C \mathrm{Var}(q_{\lambda}) C^{\top}
+```
+
+and the entropy is given as
+
+```math
+  \mathbb{H}(q_{\lambda}) = \mathbb{H}(\varphi) + \log |C|,
+```
+
+where ``\mathbb{H}(\varphi)`` is the entropy of the base distribution.
+Notice the ``\mathbb{H}(\varphi)`` does not depend on ``\log |C|``.
+The derivative of the entropy with respect to ``\lambda`` is thus independent of the base distribution.
+
+### API
+
+!!! note
+
+    For stable convergence, the initial `scale` needs to be sufficiently large and well-conditioned.
+    Initializing `scale` to have small eigenvalues will often result in initial divergences and numerical instabilities.
+
+```@docs
+MvLocationScale
+```
+
+The following are specialized constructors for convenience:
+
+```@docs
+FullRankGaussian
+MeanFieldGaussian
+```
+
+### Gaussian Variational Families
+
+```julia
+using AdvancedVI, LinearAlgebra, Distributions;
+μ = zeros(2);
+
+L = LowerTriangular(diagm(ones(2)));
+q = FullRankGaussian(μ, L)
+
+L = Diagonal(ones(2));
+q = MeanFieldGaussian(μ, L)
+```
+
+### Student-$$t$$ Variational Families
+
+```julia
+using AdvancedVI, LinearAlgebra, Distributions;
+μ = zeros(2);
+ν = 3;
+
+# Full-Rank 
+L = LowerTriangular(diagm(ones(2)));
+q = MvLocationScale(μ, L, TDist(ν))
+
+# Mean-Field
+L = Diagonal(ones(2));
+q = MvLocationScale(μ, L, TDist(ν))
+```
+
+### Laplace Variational families
+
+```julia
+using AdvancedVI, LinearAlgebra, Distributions;
+μ = zeros(2);
+
+# Full-Rank 
+L = LowerTriangular(diagm(ones(2)));
+q = MvLocationScale(μ, L, Laplace())
+
+# Mean-Field
+L = Diagonal(ones(2));
+q = MvLocationScale(μ, L, Laplace())
+```
+
+## The `LocationScaleLowRank` Family
+
+In practice, `LocationScale` families with full-rank scale matrices are known to converge slowly as they require a small SGD stepsize.
+Low-rank variational families can be an effective alternative[^ONS2018].
+`LocationScaleLowRank` generally represent any ``d``-dimensional distribution which its sampling path can be represented as
+
+```math
+z \sim  q_{\lambda} \qquad\Leftrightarrow\qquad
+z \stackrel{d}{=} D u_1 + U u_2  + m;\quad u_1, u_2 \sim \varphi
+```
+
+where ``D \in \mathbb{R}^{d \times d}`` is a diagonal matrix, ``U \in \mathbb{R}^{d \times r}`` is a dense low-rank matrix for the rank ``r > 0``, ``m \in \mathbb{R}^d`` is the location, and ``\varphi`` is the *base distribution*.
+``m``, ``D``, and ``U`` form the variational parameters ``\lambda = (m, D, U)``.
+
+The covariance of this distribution is given as
+
+```math
+  \mathrm{Var}\left(q_{\lambda}\right) = D \mathrm{Var}(\varphi) D + U \mathrm{Var}(\varphi) U^{\top}
+```
+
+and the entropy is given by the matrix determinant lemma as
+
+```math
+  \mathbb{H}(q_{\lambda}) 
+  = \mathbb{H}(\varphi) + \log |\Sigma|
+  = \mathbb{H}(\varphi) + 2 \log |D| + \log |I + U^{\top} D^{-2} U|,
+```
+
+where ``\mathbb{H}(\varphi)`` is the entropy of the base distribution.
+
+```@setup lowrank
+using ADTypes
+using AdvancedVI
+using Distributions
+using LinearAlgebra
+using LogDensityProblems
+using Optimisers
+using Plots
+using ReverseDiff
+
+struct Target{D}
+    dist::D
+end
+
+function LogDensityProblems.logdensity(model::Target, θ)
+    logpdf(model.dist, θ)
+end
+
+function LogDensityProblems.dimension(model::Target)
+    return length(model.dist)
+end
+
+function LogDensityProblems.capabilities(::Type{<:Target})
+    return LogDensityProblems.LogDensityOrder{0}()
+end
+
+n_dims     = 30
+U_true     = randn(n_dims, 3)
+D_true     = Diagonal(log.(1 .+ exp.(randn(n_dims))))
+Σ_true     = D_true + U_true*U_true'
+Σsqrt_true = sqrt(Σ_true)
+μ_true     = randn(n_dims)
+model      = Target(MvNormal(μ_true, Σ_true));
+
+d  = LogDensityProblems.dimension(model);
+μ  = zeros(d);
+
+L     = Diagonal(ones(d));
+q0_mf = MeanFieldGaussian(μ, L)
+
+L     = LowerTriangular(diagm(ones(d)));
+q0_fr = FullRankGaussian(μ, L)
+
+D     = ones(n_dims)
+U     = zeros(n_dims, 3)
+q0_lr = LowRankGaussian(μ, D, U)
+
+obj = RepGradELBO(1);
+
+max_iter = 10^4
+
+function callback(; params, averaged_params, restructure, stat, kwargs...)
+    q = restructure(averaged_params)
+    μ, Σ = mean(q), cov(q)
+    (dist2 = sum(abs2, μ - μ_true) + tr(Σ + Σ_true - 2*sqrt(Σsqrt_true*Σ*Σsqrt_true)),)
+end
+
+_, _, stats_fr, _ = AdvancedVI.optimize(
+    model,
+    obj,
+    q0_fr,
+    max_iter;
+    show_progress = false,
+    adtype        = AutoReverseDiff(),
+    optimizer     = Adam(0.01),
+    averager      = PolynomialAveraging(),
+    callback      = callback,
+); 
+
+_, _, stats_mf, _ = AdvancedVI.optimize(
+    model,
+    obj,
+    q0_mf,
+    max_iter;
+    show_progress = false,
+    adtype        = AutoReverseDiff(),
+    optimizer     = Adam(0.01),
+    averager      = PolynomialAveraging(),
+    callback      = callback,
+); 
+
+_, _, stats_lr, _ = AdvancedVI.optimize(
+    model,
+    obj,
+    q0_lr,
+    max_iter;
+    show_progress = false,
+    adtype        = AutoReverseDiff(),
+    optimizer     = Adam(0.01),
+    averager      = PolynomialAveraging(),
+    callback      = callback,
+); 
+
+t       = [stat.iteration for stat in stats_fr]
+dist_fr = [sqrt(stat.dist2) for stat in stats_fr]
+dist_mf = [sqrt(stat.dist2) for stat in stats_mf]
+dist_lr = [sqrt(stat.dist2) for stat in stats_lr]
+plot( t, dist_mf , label="Mean-Field Gaussian", xlabel="Iteration", ylabel="Wasserstein-2 Distance")
+plot!(t, dist_fr,  label="Full-Rank Gaussian",  xlabel="Iteration", ylabel="Wasserstein-2 Distance")
+plot!(t, dist_lr,  label="Low-Rank Gaussian",   xlabel="Iteration", ylabel="Wasserstein-2 Distance")
+savefig("lowrank_family_wasserstein.svg")
+nothing
+```
+
+Consider a 30-dimensional Gaussian with a diagonal plus low-rank covariance structure, where the true rank is 3.
+Then, we can compare the convergence speed of `LowRankGaussian` versus `FullRankGaussian`:
+
+![](lowrank_family_wasserstein.svg)
+
+As we can see, `LowRankGaussian` converges faster than `FullRankGaussian`.
+While `FullRankGaussian` can converge to the true solution since it is a more expressive variational family, `LowRankGaussian` gets there faster.
+
+!!! info
+
+    `MvLocationScaleLowRank` tend to work better with the `Optimisers.Adam` optimizer due to non-smoothness.
+    Other optimisers may experience divergences.
+
+### API
+
+```@docs
+MvLocationScaleLowRank
+```
+
+The `logpdf` of  `MvLocationScaleLowRank` has an optional argument `non_differentiable::Bool` (default: `false`).
+If set as `true`, a more efficient ``O\left(r d^2\right)`` implementation is used to evaluate the density.
+This, however, is not differentiable under most AD frameworks due to the use of Cholesky `lowrankupdate`.
+The default value is `false`, which uses a ``O\left(d^3\right)`` implementation, is differentiable and therefore compatible with the `StickingTheLandingEntropy` estimator.
+
+The following is a specialized constructor for convenience:
+
+```@docs
+LowRankGaussian
+```
+
+[^ONS2018]: Ong, V. M. H., Nott, D. J., & Smith, M. S. (2018). Gaussian variational approximation with a factor covariance structure. Journal of Computational and Graphical Statistics, 27(3), 465-478.
diff --git a/docs/src/locscale.md b/docs/src/locscale.md
diff --git a/src/AdvancedVI.jl b/src/AdvancedVI.jl
@@ -180,6 +180,10 @@ export MvLocationScale, MeanFieldGaussian, FullRankGaussian
 
 include("families/location_scale.jl")
 
+export MvLocationScaleLowRank, LowRankGaussian
+
+include("families/location_scale_low_rank.jl")
+
 # Optimization Rules
 
 include("optimization/rules.jl")