batorch.sgmcmc#
batorch.sgmcmc.diffusions#
Base class for implementing transition kernels of MCMC methods. |
|
Implements stochastic gradient Langevin dynamics (SGLD) proposed in the Bayesian learning via stochastic gradient Langevin dynamics paper. |
|
Implements the preconditioned stochastic gradient Langevin dynamics (pSGLD) proposed in the Preconditioned stochastic gradient Langevin dynamics for deep neural networks paper. |
|
Implements the stochastic gradient Hamiltonian Monte Carlo (SGMCMC) algorithm of the Stochastic Gradient Hamiltonian Monte Carlo paper but without performing the change of variable. |
|
Implements the scale adaptive SGHMC algorithm proposed in the Bayesian optimization with robust Bayesian neural networks paper. |
- class batorch.sgmcmc.diffusions.Diffusion(*args: Any, **kwargs: Any)[source]#
Base class for implementing transition kernels of MCMC methods.
These methods result from the discretization of stochastic differential equations whose solutions are diffusion processes. It essentially inherits from
torch.optim.optimizer.Optimizeralthough it does not perform an optimization.- Parameters
params – Generator that yields a network parameters (model.parameters())
step_size – Step size of the discretized stochastid differential equation.
weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.
- init_state(p, state, group)[source]#
Function used to initialize optional state variables.
- Parameters
p – A parameter (weight or bias).
state – State dict.
group – Group dict.
- Raises
NotImplementedError –
- update_fn(p, d_p, state, group)[source]#
Function that updates the parameters.
- Parameters
p – A parameter (weight or bias).
d_p – Gradient of the objective function with respect to parameter p.
state – State dict.
group – Group dict.
- Raises
NotImplementedError –
- step(closure=None)#
Basic step function that loops over the model parameters.
- Parameters
closure – Closure function that can be useful if the loss function has to be called several times, defaults to None
- Returns
Objective function.
- class batorch.sgmcmc.diffusions.DiffusionSGLD(*args: Any, **kwargs: Any)[source]#
Implements stochastic gradient Langevin dynamics (SGLD) proposed in the Bayesian learning via stochastic gradient Langevin dynamics paper.
- Parameters
params – Generator that yields a network parameters (model.parameters())
step_size – Step size of the discretized stochastid differential equation.
weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.
- init_state(p, state, group)[source]#
Function used to initialize optional state variables.
- Parameters
p – A parameter (weight or bias).
state – State dict.
group – Group dict.
- update_fn(p, d_p, state, group)[source]#
Perform the Langevin update.
\[\theta^{k+1} = \theta^k - \frac{\epsilon_k}{2}\hat{\nabla}U(\theta^k) + \sqrt{\epsilon_k}\Delta W^{k+1}\]where \(\theta^k\) denotes the network parameters at the k-th iteration, \(\epsilon_k\) is the step size, \(\hat{\nabla}U(\theta^k)\) is the stochastic gradient of the potential function \(U\), and \(\Delta W^{k+1}\) is a centerered normalized Gaussian random variable. If the sequence \(\{\epsilon_k \}_{k\geq 0}\) and the potential function \(U\) satisfy a few conditions, the stationary distribution of the discrete Markov chain approximates the target distribution given by
\[p_{\Theta}(\theta) = c_0 \exp(-U(\theta))\]where \(c_0\) is a (possibly unknown) normalization constant.
- Parameters
p – A parameter (weight or bias).
d_p – Gradient of the objective function with respect to parameter p.
state – State dictionary, not used for SGLD.
group – Group dictionary that contains at least the key step_size.
- class batorch.sgmcmc.diffusions.DiffusionCyclicSGLD(*args: Any, **kwargs: Any)[source]#
Implements the cyclic stochastic gradient Langevin dynamics (SGLD) proposed in the Cyclic Stochastic Gradient MCMC for Bayesian Deep Learning paper.
- Parameters
params – Generator that yields a network parameters (model.parameters())
step_size – Step size of the discretized stochastid differential equation.
weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.
num_cycles – Number of cycles of the cyclic step size.
num_iterations – Total number of sampling iterations.
- init_state(p, state, group)[source]#
Function used to initialize optional state variables.
- Parameters
p – A parameter (weight or bias).
state – State dict.
group – Group dict.
- update_fn(p, d_p, state, group)[source]#
Perform the Langevin update.
\[\theta^{k+1} = \theta^k - \frac{\epsilon_k}{2}\hat{\nabla}U(\theta^k) + \sqrt{\epsilon_k}\Delta W^{k+1}\]where \(\theta^k\) denotes the network parameters at the k-th iteration, \(\epsilon_k\) is the step size, \(\hat{\nabla}U(\theta^k)\) is the stochastic gradient of the potential function \(U\), and \(\Delta W^{k+1}\) is a centerered normalized Gaussian random variable. If the sequence \(\{\epsilon_k \}_{k\geq 0}\) and the potential function \(U\) satisfy a few conditions, the stationary distribution of the discrete Markov chain approximates the target distribution given by
\[p_{\Theta}(\theta) = c_0 \exp(-U(\theta))\]where \(c_0\) is a (possibly unknown) normalization constant.
- Parameters
p – A parameter (weight or bias).
d_p – Gradient of the objective function with respect to parameter p.
state – State dictionary, not used for SGLD.
group – Group dictionary that contains at least the key step_size.
- class batorch.sgmcmc.diffusions.DiffusionpSGLD(*args: Any, **kwargs: Any)[source]#
Implements the preconditioned stochastic gradient Langevin dynamics (pSGLD) proposed in the Preconditioned stochastic gradient Langevin dynamics for deep neural networks paper.
- Parameters
params – Generator that yields a network parameters (model.parameters())
step_size – Step size of the discretized stochastid differential equation.
weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.
alpha – momentum factor with values in \([0,1]\)
eps – diagonal perturbation to avoid the preconditioner from degenerating
- init_state(p, state, group)[source]#
Initializes a state variable for storing the following estimation of the squared stochastic gradient:
\[L(\theta^k) = \alpha L(\theta^{k-1}) + (1-\alpha) \hat{\nabla}U(\theta^{k-1}) \circ \hat{\nabla}U(\theta^{k-1})\]The operator \(\circ\) denotes the element-wise product.
- Parameters
p – A parameter (weight or bias).
state – State dict.
group – Group dict.
- update_fn(p, d_p, state, group)[source]#
Performs the preconditioned Langevin update.
\[\theta^{k+1} = \theta^k - \frac{\epsilon_k}{2}\mathbf{D}(\theta^k)\hat{\nabla}U(\theta^k) + \sqrt{\epsilon_k\mathbf{D}(\theta^k)}\Delta W^{k+1}\]where \(\theta^k\) denotes the network parameters at the k-th iteration, \(\epsilon_k\) is the step size, \(\hat{\nabla}U(\theta^k)\) is the stochastic gradient of the potential function \(U\), and \(\Delta W^{k+1}\) is a centerered normalized Gaussian random variable. The matrix \(\mathbf{D}\) is referred to as the preconditioner and takes the form
\[\mathbf{D}(\theta^k) = \mathrm{diag}\left(\lambda \mathbf{I} + \sqrt{L(\theta^k)}\right)^{-1}\]It should be noted that this algorithm is missing an extra term, \(\Gamma(\theta^k)\), which is discarded for computational efficiency.
- Parameters
p – A parameter (weight or bias).
d_p – Gradient of the objective function with respect to parameter p.
state – State dictionary that has the key square_avg.
group – Group dictionary that has the keys step_size, alpha, and eps.
- class batorch.sgmcmc.diffusions.DiffusionSGHMC(*args: Any, **kwargs: Any)[source]#
Implements the stochastic gradient Hamiltonian Monte Carlo (SGMCMC) algorithm of the Stochastic Gradient Hamiltonian Monte Carlo paper but without performing the change of variable.
- Parameters
params – Generator that yields a network parameters (model.parameters())
step_size – Step size of the discretized stochastid differential equation.
weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.
damping – Damping parameter. Default: 1.0.
- init_state(p, state, group)[source]#
Initializes a state variable momentum used by the SGHMC algorithm and denote by \(v^k\).
- Parameters
p – A parameter (weight or bias).
state – State dict.
group – Group dict.
- update_fn(p, d_p, state, group)[source]#
Performs the SGHMC update.
\[ \begin{align}\begin{aligned}& \theta^{k+1} = \theta^k + \epsilon_k \mathbf{M}^{-1} v^k\\& v^{k+1} = v^k - \epsilon_k\hat{\nabla}U(\theta^k) - \epsilon_k \mathbf{C}\mathbf{M}^{-1}v^k + \sqrt{2\mathbf{C}\epsilon_k} \, \Delta W^{k+1}\end{aligned}\end{align} \]The mass and damping matrices are chosen as \(\mathbf{M} = \mathbf{I}\) and \(\mathbf{C} = f\mathbf{I}\), where \(f\) is the damping parameter.
- Parameters
p – A parameter (weight or bias).
d_p – Gradient of the objective function with respect to parameter p.
state – State dictionary that has the key square_avg.
group – Group dictionary that has the keys step_size, alpha, and eps.
- class batorch.sgmcmc.diffusions.DiffusionCyclicSGHMC(*args: Any, **kwargs: Any)[source]#
Implements the cyclic stochastic gradient Hamiltonian Monte Carlo (SGMCMC) algorithm proposed in the Cyclic Stochastic Gradient MCMC for Bayesian Deep Learning paper.
- Parameters
params – Generator that yields a network parameters (model.parameters())
step_size – Step size of the discretized stochastid differential equation.
weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.
damping – Damping parameter. Default: 1.0.
num_cycles – Number of cycles of the cyclic step size.
num_iterations – Total number of sampling iterations.
- init_state(p, state, group)[source]#
Initializes a state variable momentum used by the SGHMC algorithm and denote by \(v^k\).
- Parameters
p – A parameter (weight or bias).
state – State dict.
group – Group dict.
- update_fn(p, d_p, state, group)[source]#
Performs the SGHMC update.
\[ \begin{align}\begin{aligned}& \theta^{k+1} = \theta^k + \epsilon_k \mathbf{M}^{-1} v^k\\& v^{k+1} = v^k - \epsilon_k\hat{\nabla}U(\theta^k) - \epsilon_k \mathbf{C}\mathbf{M}^{-1}v^k + \sqrt{2\mathbf{C}\epsilon_k} \, \Delta W^{k+1}\end{aligned}\end{align} \]The mass and damping matrices are chosen as \(\mathbf{M} = \mathbf{I}\) and \(\mathbf{C} = f\mathbf{I}\), where \(f\) is the damping parameter.
- Parameters
p – A parameter (weight or bias).
d_p – Gradient of the objective function with respect to parameter p.
state – State dictionary that has the key square_avg.
group – Group dictionary that has the keys step_size, alpha, and eps.
- class batorch.sgmcmc.diffusions.DiffusionSGHMCSA(*args: Any, **kwargs: Any)[source]#
Implements the scale adaptive SGHMC algorithm proposed in the Bayesian optimization with robust Bayesian neural networks paper. This algorithms uses the burnin phase to adapt its hyparameters (not including the step size).
- Parameters
params – Generator that yields a network parameters (model.parameters())
step_size – Step size of the discretized stochastid differential equation.
num_burnin_steps – Number of burn in steps used to estimate the algorithm hyperparameters.
weight_decay – Optional L2 regularization that can be seen as placing a Gaussian prior distribution on the network parameters. Default: 0.0.
mdecay – Momentum decay per time step. Default: 0.05.
- init_state(p, state, group)[source]#
Initializes a momentum and three additional state variables :math:` au, g, hat{v}`.
- Parameters
p – A parameter (weight or bias).
state – State dict.
group – Group dict.
- update_fn(p, d_p, state, group)[source]#
Performs the scale adaptive SGHMC update.
\[ \begin{align}\begin{aligned}& \theta^{k+1} = \theta^k + v^k\\& v^{k+1} = -\epsilon^2_k\hat{\mathbf{L}}^{-1/2}\hat{\nabla}U(\theta^k) - \epsilon_k \hat{\mathbf{L}}^{-1/2}\mathbf{C} v^k + \sqrt{2\epsilon_k^3\hat{\mathbf{L}}^{-1/2}\mathbf{C}\hat{\mathbf{L}}^{-1/2}-\epsilon_k^4\mathbf{I}} \, \Delta W^{k+1}\end{aligned}\end{align} \]where \(\theta^k\) denotes the network parameters at the k-th iteration, \(v^k\) is the momentum, \(\epsilon_k\) is the step size, \(\hat{\nabla}U(\theta^k)\) is the stochastic gradient of the potential function \(U\), and \(\Delta W^{k+1}\) is a centerered normalized Gaussian random variable. The stationary distribution of the discrete Markov chain approximates the target distribution
\[p_{\Theta,V}(\theta,v) = c_0 \exp\left(-U(\theta) - \frac{1}{2}v^T v \right)\]The matrix \(\hat{\mathbf{L}}\) denotes the second-order moment of the gradient which is estimated with an exponential moving average during the burnin phase. The damping matrix \(\mathbf{C}\) is chosen such that \(\epsilon_k \mathbf{C}\hat{\mathbf{L}} = 0.05\mathbf{I}\).
- Parameters
p – A parameter (weight or bias).
d_p – Gradient of the objective function with respect to parameter p.
state – State dictionary that has the key square_avg.
group – Group dictionary that has the keys step_size, alpha, and eps.
batorch.sgmcmc.samplers#
Function that can be used to define samplers based on a chosen Discretization (gradient estimator, integrator), and a chosen type of Diffusion (transition kernel). |
|
|
SGLD sampler. |
|
Preconditioned SGLD sampler. |
|
SGLD sampler with control variates. |
|
SGLD sampler with fixed point variance reduction. |
|
SGHMC sampler with a single leapfrog step. |
|
Scale adaptive SGHMC sampler with a single leapfrog step. |
|
SGHMC sampler with control variates and a single leapfrog step. |
|
SGHMC sampler with fixed point variance reduction and a single leapfrog step. |
- class batorch.sgmcmc.samplers.Sampler(DiffusionType: type, negloglikelihood: torch.nn.Module, neglogprior: Callable, init_params: Union[None, str, OrderedDict], dataloader: torch.utils.data.DataLoader, **kwargs)[source]#
Base class for implementing sampler.
- step(x, y)[source]#
This method updates the parameters and returns the loss.
- Parameters
x – Input data that will be passed to the negative log-likelihood.
y – Output tensor that will be compared to the prediction.
- Raises
NotImplementedError –
- get_params()[source]#
Function that flattens the network parameters.
- Returns
Flattened network parameters.
- get_grads()[source]#
Function that flattens the network gradients.
- Returns
Flattened network gradients.
- class batorch.sgmcmc.samplers.EulerExplicit(DiffusionType: type, negloglikelihood: torch.nn.Module, neglogprior: Callable, init_params: Union[None, str, OrderedDict], dataloader: torch.utils.data.DataLoader, **kwargs)[source]#
This class implements a SGMCMC update using a standard stochastic gradient (SG) estimation and an Euler integrator.
- Parameters
DiffusionType – type of Diffusion.
negloglikelihood –
torch.nn.Moduleobject corresponding to a neg. log-likelihood function.prior –
torch.distributionsobject corresponding to a prior distribution.init_params – Optional path to a torch state_dict for initializing the model parameters.
dataloader – An already instantiated dataloader.
kwargs – Any optional keyword arguments that will passed to the Diffusion.
- estimate_gradients(x, y)[source]#
Function that estimates the gradient of the log posterior with a standard mini-batched stochastic gradient.
\[\hat{\nabla}U(\theta) = -\frac{N}{|B_k|} \sum_{i\in B_k} \nabla_\theta\log(p(y_i|x_i,\theta)) - \nabla\log(p(\theta))\]where
\(N\) is number of samples the whole dataset, \(B_k\) is a mini batch of indices drawn by the dataloader, \(p(y_i|x_i,\theta)\) denotes likelihood function, and \(p(\theta)\) is the prior distribution of the weights \(\theta\).
- Parameters
x – Input data that will be passed to the negative log-likelihood.
y – Output tensor that will be compared to the prediction.
- step(x, y)[source]#
Updates the network parameters according to the chosen
Diffusionusing the standard stochastic gradient approximation. If \(\varphi\) denotes the update function of the underlyingDiffusion, then this function performs the following:Zero-out the gradients
Estimate the stochastic gradient \(\hat{\nabla}U(\theta^k)\) for the current value of the network parameters
Update the parameters:
\[\theta^{k+1} = \varphi(\theta^k, \hat{\nabla}U(\theta^k))\]This sampler can be used with any
Diffusion. Note that ifDiffusionSGHMCis used here, then no leapfrogs will be performed. If you want to perform \(L > 1\) leapfrog steps, please see the specificLeapfrogStochasticGradient.- Parameters
x – Input data that will be passed to the negative log-likelihood.
y – Output tensor that will be compared to the prediction.
- class batorch.sgmcmc.samplers.EulerExplicitControlVariates(DiffusionType: type, negloglikelihood: torch.nn.Module, neglogprior: Union[object, torch.distributions.Distribution], init_params: Union[None, str, OrderedDict], init_control_params: Union[str, OrderedDict], dataloader: torch.utils.data.DataLoader, **kwargs)[source]#
This class implements a SGMCMC update using the variance reduction technique proposed by the paper the paper Control variates for stochastic gradient MCMC, and an Euler integrator.
The centering value \(\hat{ heta}\) is provided as a state_dict by the user via the init_param parameter.
- Parameters
init_params – Optional path to a torch state_dict for initializing the model parameters.
init_control_params – Control variate.
DiffusionType – type of Diffusion.
negloglikelihood –
torch.nn.Moduleobject corresponding to a neg. log-likelihood function.prior –
torch.distributionsobject corresponding to a prior distribution.dataloader – An already instantiated dataloader.
kwargs – Any additional keyword arguments that will be passed to the Diffusion.
- compute_full_center_gradient(dataloader)[source]#
Method that computes the true gradient of the log likelihood using the centering value \(\hat{\theta}\) chosen by the user.
\[\nabla U(\hat{\theta}) = -\sum_{i=1}^{N} \nabla_{\theta} \log(p(y_i|x_i,\hat{\theta}))\]where \(N\) denotes the number of samples in the whole dataset and \(p(y_i|x_i,\hat{\theta})\) is the likelihood function.
- estimate_gradients(x, y)[source]#
Method that estimates the gradient of the log likelihood with a variance reduction technique (control variates).
\[\hat{\nabla}U(\theta) = -\sum_{i=1}^{N} \nabla_{\theta} \log(p(y_i|x_i,\hat{\theta})) -\frac{N}{|B_k|} \sum_{i \in B_k} \left( \nabla_{\theta}\log(p(y_i|x_i,\theta)) - \nabla_{\theta}\log(p(y_i|x_i,\hat{\theta}))\right) - \nabla\log(p(\theta)\]where \(N\) is number of samples the whole dataset, \(B_k\) is a mini batch of indices drawn by the dataloader, \(\theta \mapsto p(\cdot|\cdot,\theta)\) denotes likelihood function, and \(p(\theta)\) is the prior distribution of the weights \(\theta\), and \(\hat{\theta}\) denotes the control variates.
- Parameters
x – Input data that will be passed to the negative log-likelihood.
y – Output tensor that will be compared to the prediction.
- class batorch.sgmcmc.samplers.EulerExplicitFixedPoint(DiffusionType: type, negloglikelihood: torch.nn.Module, neglogprior: Union[object, torch.distributions.Distribution], init_params: None, init_control_params: str, dataloader: torch.utils.data.DataLoader, m_iter: int, **kwargs)[source]#
This class implements a SGMCMC update using the variance reduction technique proposed by the paper the paper Variance reduction in stochastic gradient Langevin dynamics, and an Euler integrator.
It extends the class
EulerStochasticGradientCVwith an additional method that updates the centering parameter \(\hat{ heta}\) every m_iter iterations.- Parameters
init_params – Optional path to a torch state_dict for initializing the model parameters.
init_control_params – Initial control variate.
DiffusionType – type of Diffusion.
negloglikelihood –
torch.nn.Moduleobject corresponding to a neg. log-likelihood function.prior –
torch.distributionsobject corresponding to a prior distribution.datamodule – An already instantiated DataModule.
num_leapfrogs – Number of leapfrog steps to perform at each update.
m_iter – Number of iterations after which the centering parameters are replaced by the current values of the network parameters.
kwargs – Any additional keyword arguments that will be passed to the Diffusion.
- step(x, y)[source]#
Method that estimates the gradient of the log likelihood with a variance reduction technique (control variates). The centering parameter \(\hat{\theta}\) is updated if the number of current iterations is proportional to m_iter.
- Parameters
x – Input data that will be passed to the negative log-likelihood.
y – Output tensor that will be compared to the prediction.
- batorch.sgmcmc.samplers.SamplerFactory(SamplerType, DiffusionType)[source]#
Function that can be used to define samplers based on a chosen Discretization (gradient estimator, integrator), and a chosen type of Diffusion (transition kernel).
Several samplers are already built-in. An example is shown below.
from batorch.sgmcmc.samplers import EulerExplicit from batorch.sgmcmc.diffusions import DiffusionSGLD SamplerSGLD = SamplerFactory(EulerExplicit, DiffusionSGLD) sampler = SamplerSGLD(negloglikelihood=negloglikehood, neglogprior=neglogprior, datamodule=datamodule.train_dataloader(), init_params=None, step_size=1e-3)