batorch.sgmcmc#

batorch.sgmcmc.diffusions#

batorch.sgmcmc.diffusions.Diffusion

Base class for implementing transition kernels of MCMC methods.

batorch.sgmcmc.diffusions.DiffusionSGLD

Implements stochastic gradient Langevin dynamics (SGLD) proposed in the Bayesian learning via stochastic gradient Langevin dynamics paper.

batorch.sgmcmc.diffusions.DiffusionpSGLD

Implements the preconditioned stochastic gradient Langevin dynamics (pSGLD) proposed in the Preconditioned stochastic gradient Langevin dynamics for deep neural networks paper.

batorch.sgmcmc.diffusions.DiffusionSGHMC

Implements the stochastic gradient Hamiltonian Monte Carlo (SGMCMC) algorithm of the Stochastic Gradient Hamiltonian Monte Carlo paper but without performing the change of variable.

batorch.sgmcmc.diffusions.DiffusionSGHMCSA

Implements the scale adaptive SGHMC algorithm proposed in the Bayesian optimization with robust Bayesian neural networks paper.

class batorch.sgmcmc.diffusions.Diffusion(*args: Any, **kwargs: Any)[source]#

Base class for implementing transition kernels of MCMC methods.

These methods result from the discretization of stochastic differential equations whose solutions are diffusion processes. It essentially inherits from torch.optim.optimizer.Optimizer although it does not perform an optimization.

Parameters
  • params – Generator that yields a network parameters (model.parameters())

  • step_size – Step size of the discretized stochastid differential equation.

  • weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.

init_state(p, state, group)[source]#

Function used to initialize optional state variables.

Parameters
  • p – A parameter (weight or bias).

  • state – State dict.

  • group – Group dict.

Raises

NotImplementedError

update_fn(p, d_p, state, group)[source]#

Function that updates the parameters.

Parameters
  • p – A parameter (weight or bias).

  • d_p – Gradient of the objective function with respect to parameter p.

  • state – State dict.

  • group – Group dict.

Raises

NotImplementedError

step(closure=None)#

Basic step function that loops over the model parameters.

Parameters

closure – Closure function that can be useful if the loss function has to be called several times, defaults to None

Returns

Objective function.

class batorch.sgmcmc.diffusions.DiffusionSGLD(*args: Any, **kwargs: Any)[source]#

Implements stochastic gradient Langevin dynamics (SGLD) proposed in the Bayesian learning via stochastic gradient Langevin dynamics paper.

Parameters
  • params – Generator that yields a network parameters (model.parameters())

  • step_size – Step size of the discretized stochastid differential equation.

  • weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.

init_state(p, state, group)[source]#

Function used to initialize optional state variables.

Parameters
  • p – A parameter (weight or bias).

  • state – State dict.

  • group – Group dict.

update_fn(p, d_p, state, group)[source]#

Perform the Langevin update.

\[\theta^{k+1} = \theta^k - \frac{\epsilon_k}{2}\hat{\nabla}U(\theta^k) + \sqrt{\epsilon_k}\Delta W^{k+1}\]

where \(\theta^k\) denotes the network parameters at the k-th iteration, \(\epsilon_k\) is the step size, \(\hat{\nabla}U(\theta^k)\) is the stochastic gradient of the potential function \(U\), and \(\Delta W^{k+1}\) is a centerered normalized Gaussian random variable. If the sequence \(\{\epsilon_k \}_{k\geq 0}\) and the potential function \(U\) satisfy a few conditions, the stationary distribution of the discrete Markov chain approximates the target distribution given by

\[p_{\Theta}(\theta) = c_0 \exp(-U(\theta))\]

where \(c_0\) is a (possibly unknown) normalization constant.

Parameters
  • p – A parameter (weight or bias).

  • d_p – Gradient of the objective function with respect to parameter p.

  • state – State dictionary, not used for SGLD.

  • group – Group dictionary that contains at least the key step_size.

class batorch.sgmcmc.diffusions.DiffusionCyclicSGLD(*args: Any, **kwargs: Any)[source]#

Implements the cyclic stochastic gradient Langevin dynamics (SGLD) proposed in the Cyclic Stochastic Gradient MCMC for Bayesian Deep Learning paper.

Parameters
  • params – Generator that yields a network parameters (model.parameters())

  • step_size – Step size of the discretized stochastid differential equation.

  • weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.

  • num_cycles – Number of cycles of the cyclic step size.

  • num_iterations – Total number of sampling iterations.

init_state(p, state, group)[source]#

Function used to initialize optional state variables.

Parameters
  • p – A parameter (weight or bias).

  • state – State dict.

  • group – Group dict.

update_fn(p, d_p, state, group)[source]#

Perform the Langevin update.

\[\theta^{k+1} = \theta^k - \frac{\epsilon_k}{2}\hat{\nabla}U(\theta^k) + \sqrt{\epsilon_k}\Delta W^{k+1}\]

where \(\theta^k\) denotes the network parameters at the k-th iteration, \(\epsilon_k\) is the step size, \(\hat{\nabla}U(\theta^k)\) is the stochastic gradient of the potential function \(U\), and \(\Delta W^{k+1}\) is a centerered normalized Gaussian random variable. If the sequence \(\{\epsilon_k \}_{k\geq 0}\) and the potential function \(U\) satisfy a few conditions, the stationary distribution of the discrete Markov chain approximates the target distribution given by

\[p_{\Theta}(\theta) = c_0 \exp(-U(\theta))\]

where \(c_0\) is a (possibly unknown) normalization constant.

Parameters
  • p – A parameter (weight or bias).

  • d_p – Gradient of the objective function with respect to parameter p.

  • state – State dictionary, not used for SGLD.

  • group – Group dictionary that contains at least the key step_size.

class batorch.sgmcmc.diffusions.DiffusionpSGLD(*args: Any, **kwargs: Any)[source]#

Implements the preconditioned stochastic gradient Langevin dynamics (pSGLD) proposed in the Preconditioned stochastic gradient Langevin dynamics for deep neural networks paper.

Parameters
  • params – Generator that yields a network parameters (model.parameters())

  • step_size – Step size of the discretized stochastid differential equation.

  • weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.

  • alpha – momentum factor with values in \([0,1]\)

  • eps – diagonal perturbation to avoid the preconditioner from degenerating

init_state(p, state, group)[source]#

Initializes a state variable for storing the following estimation of the squared stochastic gradient:

\[L(\theta^k) = \alpha L(\theta^{k-1}) + (1-\alpha) \hat{\nabla}U(\theta^{k-1}) \circ \hat{\nabla}U(\theta^{k-1})\]

The operator \(\circ\) denotes the element-wise product.

Parameters
  • p – A parameter (weight or bias).

  • state – State dict.

  • group – Group dict.

update_fn(p, d_p, state, group)[source]#

Performs the preconditioned Langevin update.

\[\theta^{k+1} = \theta^k - \frac{\epsilon_k}{2}\mathbf{D}(\theta^k)\hat{\nabla}U(\theta^k) + \sqrt{\epsilon_k\mathbf{D}(\theta^k)}\Delta W^{k+1}\]

where \(\theta^k\) denotes the network parameters at the k-th iteration, \(\epsilon_k\) is the step size, \(\hat{\nabla}U(\theta^k)\) is the stochastic gradient of the potential function \(U\), and \(\Delta W^{k+1}\) is a centerered normalized Gaussian random variable. The matrix \(\mathbf{D}\) is referred to as the preconditioner and takes the form

\[\mathbf{D}(\theta^k) = \mathrm{diag}\left(\lambda \mathbf{I} + \sqrt{L(\theta^k)}\right)^{-1}\]

It should be noted that this algorithm is missing an extra term, \(\Gamma(\theta^k)\), which is discarded for computational efficiency.

Parameters
  • p – A parameter (weight or bias).

  • d_p – Gradient of the objective function with respect to parameter p.

  • state – State dictionary that has the key square_avg.

  • group – Group dictionary that has the keys step_size, alpha, and eps.

class batorch.sgmcmc.diffusions.DiffusionSGHMC(*args: Any, **kwargs: Any)[source]#

Implements the stochastic gradient Hamiltonian Monte Carlo (SGMCMC) algorithm of the Stochastic Gradient Hamiltonian Monte Carlo paper but without performing the change of variable.

Parameters
  • params – Generator that yields a network parameters (model.parameters())

  • step_size – Step size of the discretized stochastid differential equation.

  • weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.

  • damping – Damping parameter. Default: 1.0.

init_state(p, state, group)[source]#

Initializes a state variable momentum used by the SGHMC algorithm and denote by \(v^k\).

Parameters
  • p – A parameter (weight or bias).

  • state – State dict.

  • group – Group dict.

resample_momentum()[source]#

Resamples the momentum state variable.

update_fn(p, d_p, state, group)[source]#

Performs the SGHMC update.

\[ \begin{align}\begin{aligned}& \theta^{k+1} = \theta^k + \epsilon_k \mathbf{M}^{-1} v^k\\& v^{k+1} = v^k - \epsilon_k\hat{\nabla}U(\theta^k) - \epsilon_k \mathbf{C}\mathbf{M}^{-1}v^k + \sqrt{2\mathbf{C}\epsilon_k} \, \Delta W^{k+1}\end{aligned}\end{align} \]

The mass and damping matrices are chosen as \(\mathbf{M} = \mathbf{I}\) and \(\mathbf{C} = f\mathbf{I}\), where \(f\) is the damping parameter.

Parameters
  • p – A parameter (weight or bias).

  • d_p – Gradient of the objective function with respect to parameter p.

  • state – State dictionary that has the key square_avg.

  • group – Group dictionary that has the keys step_size, alpha, and eps.

class batorch.sgmcmc.diffusions.DiffusionCyclicSGHMC(*args: Any, **kwargs: Any)[source]#

Implements the cyclic stochastic gradient Hamiltonian Monte Carlo (SGMCMC) algorithm proposed in the Cyclic Stochastic Gradient MCMC for Bayesian Deep Learning paper.

Parameters
  • params – Generator that yields a network parameters (model.parameters())

  • step_size – Step size of the discretized stochastid differential equation.

  • weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.

  • damping – Damping parameter. Default: 1.0.

  • num_cycles – Number of cycles of the cyclic step size.

  • num_iterations – Total number of sampling iterations.

init_state(p, state, group)[source]#

Initializes a state variable momentum used by the SGHMC algorithm and denote by \(v^k\).

Parameters
  • p – A parameter (weight or bias).

  • state – State dict.

  • group – Group dict.

resample_momentum()[source]#

Resamples the momentum state variable.

update_fn(p, d_p, state, group)[source]#

Performs the SGHMC update.

\[ \begin{align}\begin{aligned}& \theta^{k+1} = \theta^k + \epsilon_k \mathbf{M}^{-1} v^k\\& v^{k+1} = v^k - \epsilon_k\hat{\nabla}U(\theta^k) - \epsilon_k \mathbf{C}\mathbf{M}^{-1}v^k + \sqrt{2\mathbf{C}\epsilon_k} \, \Delta W^{k+1}\end{aligned}\end{align} \]

The mass and damping matrices are chosen as \(\mathbf{M} = \mathbf{I}\) and \(\mathbf{C} = f\mathbf{I}\), where \(f\) is the damping parameter.

Parameters
  • p – A parameter (weight or bias).

  • d_p – Gradient of the objective function with respect to parameter p.

  • state – State dictionary that has the key square_avg.

  • group – Group dictionary that has the keys step_size, alpha, and eps.

class batorch.sgmcmc.diffusions.DiffusionSGHMCSA(*args: Any, **kwargs: Any)[source]#

Implements the scale adaptive SGHMC algorithm proposed in the Bayesian optimization with robust Bayesian neural networks paper. This algorithms uses the burnin phase to adapt its hyparameters (not including the step size).

Parameters
  • params – Generator that yields a network parameters (model.parameters())

  • step_size – Step size of the discretized stochastid differential equation.

  • num_burnin_steps – Number of burn in steps used to estimate the algorithm hyperparameters.

  • weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.

  • mdecay – Momentum decay per time step. Default: 0.05.

init_state(p, state, group)[source]#

Initializes a momentum and three additional state variables :math:` au, g, hat{v}`.

Parameters
  • p – A parameter (weight or bias).

  • state – State dict.

  • group – Group dict.

update_fn(p, d_p, state, group)[source]#

Performs the scale adaptive SGHMC update.

\[ \begin{align}\begin{aligned}& \theta^{k+1} = \theta^k + v^k\\& v^{k+1} = -\epsilon^2_k\hat{\mathbf{L}}^{-1/2}\hat{\nabla}U(\theta^k) - \epsilon_k \hat{\mathbf{L}}^{-1/2}\mathbf{C} v^k + \sqrt{2\epsilon_k^3\hat{\mathbf{L}}^{-1/2}\mathbf{C}\hat{\mathbf{L}}^{-1/2}-\epsilon_k^4\mathbf{I}} \, \Delta W^{k+1}\end{aligned}\end{align} \]

where \(\theta^k\) denotes the network parameters at the k-th iteration, \(v^k\) is the momentum, \(\epsilon_k\) is the step size, \(\hat{\nabla}U(\theta^k)\) is the stochastic gradient of the potential function \(U\), and \(\Delta W^{k+1}\) is a centerered normalized Gaussian random variable. The stationary distribution of the discrete Markov chain approximates the target distribution

\[p_{\Theta,V}(\theta,v) = c_0 \exp\left(-U(\theta) - \frac{1}{2}v^T v \right)\]

The matrix \(\hat{\mathbf{L}}\) denotes the second-order moment of the gradient which is estimated with an exponential moving average during the burnin phase. The damping matrix \(\mathbf{C}\) is chosen such that \(\epsilon_k \mathbf{C}\hat{\mathbf{L}} = 0.05\mathbf{I}\).

Parameters
  • p – A parameter (weight or bias).

  • d_p – Gradient of the objective function with respect to parameter p.

  • state – State dictionary that has the key square_avg.

  • group – Group dictionary that has the keys step_size, alpha, and eps.

batorch.sgmcmc.samplers#

batorch.sgmcmc.samplers.SamplerFactory

Function that can be used to define samplers based on a chosen Discretization (gradient estimator, integrator), and a chosen type of Diffusion (transition kernel).

batorch.sgmcmc.samplers.SamplerSGLD

SGLD sampler.

batorch.sgmcmc.samplers.SamplerPSGLD

Preconditioned SGLD sampler.

batorch.sgmcmc.samplers.SamplerSGLDCV

SGLD sampler with control variates.

batorch.sgmcmc.samplers.SamplerSGLDSVRG

SGLD sampler with fixed point variance reduction.

batorch.sgmcmc.samplers.SamplerSGHMC

SGHMC sampler with a single leapfrog step.

batorch.sgmcmc.samplers.SamplerSGHMCSA

Scale adaptive SGHMC sampler with a single leapfrog step.

batorch.sgmcmc.samplers.SamplerSGHMCCV

SGHMC sampler with control variates and a single leapfrog step.

batorch.sgmcmc.samplers.SamplerSGHMCSVRG

SGHMC sampler with fixed point variance reduction and a single leapfrog step.

class batorch.sgmcmc.samplers.Sampler(DiffusionType: type, negloglikelihood: torch.nn.Module, neglogprior: Callable, init_params: Union[None, str, OrderedDict], dataloader: torch.utils.data.DataLoader, **kwargs)[source]#

Base class for implementing sampler.

step(x, y)[source]#

This method updates the parameters and returns the loss.

Parameters
  • x – Input data that will be passed to the negative log-likelihood.

  • y – Output tensor that will be compared to the prediction.

Raises

NotImplementedError

get_params()[source]#

Function that flattens the network parameters.

Returns

Flattened network parameters.

get_grads()[source]#

Function that flattens the network gradients.

Returns

Flattened network gradients.

set_params(vec_params)[source]#

Function that loads flatten parameters to the model.

Parameters

vec_params – Flattened network parameters

compute_gradients_log_target(dataloader)[source]#
class batorch.sgmcmc.samplers.EulerExplicit(DiffusionType: type, negloglikelihood: torch.nn.Module, neglogprior: Callable, init_params: Union[None, str, OrderedDict], dataloader: torch.utils.data.DataLoader, **kwargs)[source]#

This class implements a SGMCMC update using a standard stochastic gradient (SG) estimation and an Euler integrator.

Parameters
  • DiffusionType – type of Diffusion.

  • negloglikelihoodtorch.nn.Module object corresponding to a neg. log-likelihood function.

  • priortorch.distributions object corresponding to a prior distribution.

  • init_params – Optional path to a torch state_dict for initializing the model parameters.

  • dataloader – An already instantiated dataloader.

  • kwargs – Any optional keyword arguments that will passed to the Diffusion.

estimate_gradients(x, y)[source]#

Function that estimates the gradient of the log posterior with a standard mini-batched stochastic gradient.

\[\hat{\nabla}U(\theta) = -\frac{N}{|B_k|} \sum_{i\in B_k} \nabla_\theta\log(p(y_i|x_i,\theta)) - \nabla\log(p(\theta))\]

where

\(N\) is number of samples the whole dataset, \(B_k\) is a mini batch of indices drawn by the dataloader, \(p(y_i|x_i,\theta)\) denotes likelihood function, and \(p(\theta)\) is the prior distribution of the weights \(\theta\).

Parameters
  • x – Input data that will be passed to the negative log-likelihood.

  • y – Output tensor that will be compared to the prediction.

step(x, y)[source]#

Updates the network parameters according to the chosen Diffusion using the standard stochastic gradient approximation. If \(\varphi\) denotes the update function of the underlying Diffusion, then this function performs the following:

  1. Zero-out the gradients

  2. Estimate the stochastic gradient \(\hat{\nabla}U(\theta^k)\) for the current value of the network parameters

  3. Update the parameters:

\[\theta^{k+1} = \varphi(\theta^k, \hat{\nabla}U(\theta^k))\]

This sampler can be used with any Diffusion. Note that if DiffusionSGHMC is used here, then no leapfrogs will be performed. If you want to perform \(L > 1\) leapfrog steps, please see the specific LeapfrogStochasticGradient.

Parameters
  • x – Input data that will be passed to the negative log-likelihood.

  • y – Output tensor that will be compared to the prediction.

class batorch.sgmcmc.samplers.EulerExplicitControlVariates(DiffusionType: type, negloglikelihood: torch.nn.Module, neglogprior: Union[object, torch.distributions.Distribution], init_params: Union[None, str, OrderedDict], init_control_params: Union[str, OrderedDict], dataloader: torch.utils.data.DataLoader, **kwargs)[source]#

This class implements a SGMCMC update using the variance reduction technique proposed by the paper the paper Control variates for stochastic gradient MCMC, and an Euler integrator.

The centering value \(\hat{ heta}\) is provided as a state_dict by the user via the init_param parameter.

Parameters
  • init_params – Optional path to a torch state_dict for initializing the model parameters.

  • init_control_params – Control variate.

  • DiffusionType – type of Diffusion.

  • negloglikelihoodtorch.nn.Module object corresponding to a neg. log-likelihood function.

  • priortorch.distributions object corresponding to a prior distribution.

  • dataloader – An already instantiated dataloader.

  • kwargs – Any additional keyword arguments that will be passed to the Diffusion.

compute_full_center_gradient(dataloader)[source]#

Method that computes the true gradient of the log likelihood using the centering value \(\hat{\theta}\) chosen by the user.

\[\nabla U(\hat{\theta}) = -\sum_{i=1}^{N} \nabla_{\theta} \log(p(y_i|x_i,\hat{\theta}))\]

where \(N\) denotes the number of samples in the whole dataset and \(p(y_i|x_i,\hat{\theta})\) is the likelihood function.

estimate_gradients(x, y)[source]#

Method that estimates the gradient of the log likelihood with a variance reduction technique (control variates).

\[\hat{\nabla}U(\theta) = -\sum_{i=1}^{N} \nabla_{\theta} \log(p(y_i|x_i,\hat{\theta})) -\frac{N}{|B_k|} \sum_{i \in B_k} \left( \nabla_{\theta}\log(p(y_i|x_i,\theta)) - \nabla_{\theta}\log(p(y_i|x_i,\hat{\theta}))\right) - \nabla\log(p(\theta)\]

where \(N\) is number of samples the whole dataset, \(B_k\) is a mini batch of indices drawn by the dataloader, \(\theta \mapsto p(\cdot|\cdot,\theta)\) denotes likelihood function, and \(p(\theta)\) is the prior distribution of the weights \(\theta\), and \(\hat{\theta}\) denotes the control variates.

Parameters
  • x – Input data that will be passed to the negative log-likelihood.

  • y – Output tensor that will be compared to the prediction.

class batorch.sgmcmc.samplers.EulerExplicitFixedPoint(DiffusionType: type, negloglikelihood: torch.nn.Module, neglogprior: Union[object, torch.distributions.Distribution], init_params: None, init_control_params: str, dataloader: torch.utils.data.DataLoader, m_iter: int, **kwargs)[source]#

This class implements a SGMCMC update using the variance reduction technique proposed by the paper the paper Variance reduction in stochastic gradient Langevin dynamics, and an Euler integrator.

It extends the class EulerStochasticGradientCV with an additional method that updates the centering parameter \(\hat{ heta}\) every m_iter iterations.

Parameters
  • init_params – Optional path to a torch state_dict for initializing the model parameters.

  • init_control_params – Initial control variate.

  • DiffusionType – type of Diffusion.

  • negloglikelihoodtorch.nn.Module object corresponding to a neg. log-likelihood function.

  • priortorch.distributions object corresponding to a prior distribution.

  • datamodule – An already instantiated DataModule.

  • num_leapfrogs – Number of leapfrog steps to perform at each update.

  • m_iter – Number of iterations after which the centering parameters are replaced by the current values of the network parameters.

  • kwargs – Any additional keyword arguments that will be passed to the Diffusion.

step(x, y)[source]#

Method that estimates the gradient of the log likelihood with a variance reduction technique (control variates). The centering parameter \(\hat{\theta}\) is updated if the number of current iterations is proportional to m_iter.

Parameters
  • x – Input data that will be passed to the negative log-likelihood.

  • y – Output tensor that will be compared to the prediction.

batorch.sgmcmc.samplers.SamplerFactory(SamplerType, DiffusionType)[source]#

Function that can be used to define samplers based on a chosen Discretization (gradient estimator, integrator), and a chosen type of Diffusion (transition kernel).

Several samplers are already built-in. An example is shown below.

from batorch.sgmcmc.samplers import EulerExplicit
from batorch.sgmcmc.diffusions import DiffusionSGLD

SamplerSGLD = SamplerFactory(EulerExplicit, DiffusionSGLD)

sampler = SamplerSGLD(negloglikelihood=negloglikehood, neglogprior=neglogprior, datamodule=datamodule.train_dataloader(), init_params=None, step_size=1e-3)