Derivation and Use of Restricted Boltzmann Mechanisms (Nobel Prize 2024) | by Ryan D'Cunha | January, 2025

Investigating the work of Nobel prize winner Geoffrey Hinton and building from scratch using PyTorch

Another recipient of the 2024 Nobel Prize in Physics was Geoffrey Hinton for his contributions to the field of AI and machine learning. Many people know that he worked on neural networks and is called the “Godfather of AI”, but few understand his works. In particular, he pioneered Restricted Boltzmann Machines (RBMs) decades ago.
This article will be a tour of RBMs and hopefully provide some insight behind these complex mathematical machines. I'll provide some code on implementing RBMs from scratch in PyTorch after I get the hang of it.
RBMs are an unsupervised read type (only inputs are used for reading – no output labels are used). This means we can automatically extract meaningful features from the data without relying on the output. RBM is a network with two different types of neurons with binary input: visual, xand hidden, h. Active neurons take input data and hidden neurons learn to detect features/patterns.
In more technical terms, we say that RBM is a bipartite nonlinear image model with visible and hidden stochastic binary variables. The main goal of RBM is to reduce the energy of joint configurations E(x, h) often by using reverse learning (discussed later).
Energy function is not related to physical energy, but comes from physics/mathematics. Think of it as a scoring exercise. Power function E it gives the lowest (power) points in the configuration x which we want our model to prefer, and high scores in the configuration we want to avoid. Power function is something we get to choose as model designers.
For RBMs, the energy function is as follows (modeled after the Boltzmann distribution):
The power function consists of 3 terms. The first is the connection between the hidden and visible layer with weight, W. The second is the sum of the bias terms of the virtual units. The third is the sum of the bias terms of the hidden units.
With the power function, we can calculate the integration probability given by the Boltzmann distribution. With this probability function, we can model our units:
Z the partition function (also known as the normalization constant). It is the sum of e^(-E) over all possible configurations of visible and hidden units. The biggest challenge with Z is that it is often impossible to calculate because you need to include all possible configurations. v again h. For example, in binary units, if you have one m virtual units and n hidden units, you need to make a sum greater than 2^(m+n) configuration. Therefore, we need a way to avoid counting Z.
With the definition of those functions and distribution, we can look at some derivatives to understand before talking about training and implementation. We have already mentioned the inability to count Z in joint probability distributions. To get around this, we can use Gibbs Sampling. Gibbs Sampling is a Markov Chain Monte Carlo algorithm for sampling from multivariate probability distributions where direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more efficient. [2]. Therefore, we need a conditional distribution.
The main part about a prohibited Boltzmann versus a fully connected Boltzmann is the fact that there is no connection between the layers. This means that given the visible layer, all hidden units are conditionally independent and vice versa. Let's see if that makes it easier to get started p(x|h):
We can see the conditional distribution simplifies down to the sigmoid function where j is the linear jᵗʰ W. There is a very strong figure that I have included in the appendix proving the first line of this discovery. Reach out if you're interested! Now let's look at the conditional distribution p(h|x):
We can see this conditional distribution and it simplifies to a sigmoid function where uk is the linear kᵗʰ of W. Because of the restricted conditions in the RBM, the conditional distribution lends itself to a simple calculation of Gibbs Sampling at decision time. Once we understand what exactly RBM is trying to learn, we will implement this in PyTorch.
Like most deep learning, we try to minimize the log likelihood (NLL) to train our model. For RBM:
Taking the derivative of this yields:
The first term on the left side of the equation is called the positive phase because it pushes the model to lower the power of the real data. This term involves taking expectations over hidden units h given real training data x. The ideal class is easy to calculate because we have the actual training data xᵗ and we can calculate the expectation over h due to conditional independence.
The second term is called the negative phase because it suggests the potential to trigger what the model currently thinks is possible. This term involves taking expectations from both x again h under the current distribution of the model. It is difficult to calculate because we need to take a sample from the full joint distribution of the model P(x,h) (doing this requires Markov chains that cannot be successfully executed repeatedly in training). Some variants require a computer Z which we already thought was impossible. To solve this problem of calculating the wrong phase, we use different variations.
The main idea of discrete variance is to use reduced Gibbs Sampling to obtain a point estimate after k iterations. We can replace the expected negative phase with this point estimate.
Usually uk = 1, but the larger uk is, the less biased the gradient estimate will be. I will not show the derivation of the differential components with respect to the negative phase (with weight/bias updates), but they can be obtained by taking the derivative of E(x,h) with respect to variables. There is a concept of a contradictory continuous division where instead of initializing the chain to xᵗ, we initialize the chain to the negative sample of the last iteration. However, I won't go too deep into that either as the standard split works well enough.
Building an RBM from scratch involves combining all the concepts we've discussed in one class. In the __init__ constructor, we initialize the weights, the bias term for the visible layer, the bias term for the hidden layer, and the number of iterations for the different splits. All we need is the size of the input data, the size of the hidden variable, and k.
We also need to define the Bernoulli distribution from which to sample. The Bernoulli distribution is bounded to prevent an explosive gradient during training. Both of these distributions are used in forward (inverse) distributions.
class RBM(nn.Module):
"""Restricted Boltzmann Machine template."""def __init__(self, D: int, F: int, k: int):
"""Creates an instance RBM module.
Args:
D: Size of the input data.
F: Size of the hidden variable.
k: Number of MCMC iterations for negative sampling.
The function initializes the weight (W) and biases (c & b).
"""
super().__init__()
self.W = nn.Parameter(torch.randn(F, D) * 1e-2) # Initialized from Normal(mean=0.0, variance=1e-4)
self.c = nn.Parameter(torch.zeros(D)) # Initialized as 0.0
self.b = nn.Parameter(torch.zeros(F)) # Initilaized as 0.0
self.k = k
def sample(self, p):
"""Sample from a bernoulli distribution defined by a given parameter."""
p = torch.clamp(p, 0, 1)
return torch.bernoulli(p)
The next method to construct an RBM class is the conditional distribution. We've seen both of these situations before:
def P_h_x(self, x):
"""Stable conditional probability calculation"""
linear = torch.sigmoid(F.linear(x, self.W, self.b))
return lineardef P_x_h(self, h):
"""Stable visible unit activation"""
return self.c + torch.matmul(h, self.W)
The latter methods include forward pass initiation and free energy work. The energy function represents the active energy in the virtual units after summing all possible hidden unit configurations. The forward function is a classic variation of Gibbs Sampling. We initialize x_negative, then with k iteration: find h_k from P_h_x and x_negative, sample h_k from Bernoulli, find x_k from P_x_h and h_k, then find a new x_negative.
def free_energy(self, x):
"""Numerically stable free energy calculation"""
visible = torch.sum(x * self.c, dim=1)
linear = F.linear(x, self.W, self.b)
hidden = torch.sum(torch.log(1 + torch.exp(linear)), dim=1)
return -visible - hiddendef forward(self, x):
"""Contrastive divergence forward pass"""
x_negative = x.clone()
for _ in range(self.k):
h_k = self.P_h_x(x_negative)
h_k = self.sample(h_k)
x_k = self.P_x_h(h_k)
x_negative = self.sample(x_k)
return x_negative, x_k
We hope this has provided a theoretical foundation for RBMs and a basic coding class that can be used to train an RBM. For any code or other inconsistencies, feel free to reach out for more information!
Total release p(h|x) is the product of each conditional distribution:
[1] Montufar, Guido. “Restricted Boltzmann Mechanisms: Introduction and Review.” arXiv:1806.07066v1 (June 2018).
[2] https://en.wikipedia.org/wiki/Gibbs_sampling
[3] Hinton, Geoffrey. “Professional Training Products for Reducing Disparity.” Neural Computation (2002).