Exploring New Hyperparameter Dimensions with Laplace Finite Bayesian Optimization | by Arnaud Capitaine | January, 2025

nimda January 11, 2025

0 13 7 minutes read

Exploring New Hyperparameter Dimensions with Laplace Finite Bayesian Optimization | by Arnaud Capitaine | January, 2025

Is it better than grid search?

When I saw my model was too fitI often think, “It's time to get used to it“. But how do I decide which normalization method to use (L1, L2) and which parameters to choose? Usually, I do hyperparameter optimization by using grid search to select settings. However, what happens when the independent variables have different proportions or different levels of influence? Can I design a hyperparameter grid with different normalization coefficients for each variable? Is this type of development possible in high-dimensional environments? And are there other ways to design adaptation? Let's examine this with a hypothetical example.

My fictitious example is the use of bivariate classifications with 3 explanatory variables. Each of these variables is categorical and has 7 different categories. My reproducible use case is in this tube. The function that generates the dataset is the following:

import numpy as np
import pandas as pddef get_classification_dataset():
n_samples = 200
cats = ["a", "b", "c", "d", "e", "f"]
X = pd.DataFrame(
data={
"col1": np.random.choice(cats, size=n_samples),
"col2": np.random.choice(cats, size=n_samples),
"col3": np.random.choice(cats, size=n_samples),
}
)
X_preprocessed = pd.get_dummies(X)
theta = np.random.multivariate_normal(
np.zeros(len(cats) * X.shape[1]),
np.diag(np.array([1e-1] * len(cats) + [1] * len(cats) + [1e1] * len(cats))),
)
y = pd.Series(
data=np.random.binomial(1, expit(np.dot(X_preprocessed.to_numpy(), theta))),
index=X_preprocessed.index,
)
return X_preprocessed, y

For information, I deliberately chose 3 different values of the theta covariance matrix to show the advantage of the Bayesian optimization method approximated by Laplace. If the rates were more or less the same, the interest would be lower.

Once a simple baseline model that predicts the observed value for the training dataset (used for comparison purposes), I chose to design a slightly more complex model. I decided to one-hot encode three independent variables and run logistic regression model on top of this basic processing. For practice, I chose the L2 structure and aimed to find the correct adaptation coefficient using two techniques: grid search again Laplace is an approximate Bayesian optimizationas you might expect by now. Finally, I tested the model on a test dataset using two (arbitrarily chosen) metrics: log loss and AUC ROC.

Before presenting the results, let's first take a closer look at the Bayesian model and how to prepare it.

In the bayesian framework, the parameters are no longer fixed constants, but random variables. Instead of maximizing the probability of estimating these unknown parameters, we now optimally fit the posterior distribution of the random parameters, given the observed data. This requires us to choose, often arbitrarily, the structure and prior parameters. However, it is also possible to treat the prior parameters as random variables themselves – as in To begin withwhere layers of uncertainty keep piling on top of each other…

For this study, I chose the following model:

Logically I chose the bernouilli model for Y_i | θ, the mean normal corresponding to the L2 normal at θ | Σ and finally for Σ_i^{-1}, I chose the Gamma model. I chose to model the precision matrix instead of the covariance matrix as it is customary in the literature, as in scikit read the user guide for Bayesian linear regression [2].

In addition to this written model, I assumed that Y_i and Y_j are conditionally independent (on θ) as well as Y_i and Σ.

Opportunities

According to the model, the probability can be written as a result:

To prepare, we need to test almost all terms, except P(Y=y). Numerical expressions can be evaluated using the selected model. However, the remaining term in the denominator cannot. This is where the Laplace approximation come in to play.

Laplace approximation

To evaluate the first term of the lowest value, we can use the Laplace approximation. We estimate the distribution of θ | Y, Σ is:

with θ* being the mode of the density distribution of θ | Y, Σ.

Even if we do not know the density function, we can evaluate the Hessian component due to the following decomposition:

We only need to know the first two terms of the value to evaluate the Hessian we generate.

For those interested in further explanation, I recommend section 4.4, “The Laplace Approximation”, from Pattern Recognition and Machine Learning from Christopher M. Bishop [1]. It helped me a lot to understand the measurement.

Laplace probability

Finally the Laplace optimization probability is:

Once we estimate the density function for θ | Y, Σ, we can finally check the probability for any θ we want if the approximation was accurate everywhere. For simplicity and because the approximation is only accurate near the mode, we evaluate the probability of θ*.

Here after the function that evaluates this given loss (scale) σ²=1/p (in addition to the given observations, X and y, and design values, α and β).

import numpy as np
from scipy.stats import gammafrom module.bayesian_model import BayesianLogisticRegression
def loss(p, X, y, alpha, beta):
# computation of the loss for given values:
# - 1/sigma² (named p for precision here)
# - X: matrix of features
# - y: vector of observations
# - alpha: prior Gamma distribution alpha parameter over 1/sigma²
# - beta: prior Gamma distribution beta parameter over 1/sigma²
n_feat = X.shape[1]
m_vec = np.array([0] * n_feat)
p_vec = np.array(p * n_feat)
# computation of theta*
res = minimize(
BayesianLogisticRegression()._loss,
np.array([0] * n_feat),
args=(X, y, m_vec, p_vec),
method="BFGS",
jac=BayesianLogisticRegression()._jac,
)
theta_star = res.x
# computation the Hessian for the Laplace approximation
H = BayesianLogisticRegression()._hess(theta_star, X, y, m_vec, p_vec)
# loss
loss = 0
## first two terms: the log loss and the regularization term
loss += baysian_model._loss(theta_star, X, y, m_vec, p_vec)
## third term: prior distribution over sigma, written p here
out -= gamma.logpdf(p, a = alpha, scale = 1 / beta)
## fourth term: Laplace approximated last term
out += 0.5 * np.linalg.slogdet(H)[1] - 0.5 * n_feat * np.log(2 * np.pi)
return out

In my use case, I chose to satisfy using Adam optimizer, which code is taken from this repo.

def adam(
fun,
x0,
jac,
args=(),
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
eps=1e-8,
startiter=0,
maxiter=1000,
callback=None,
**kwargs
):
"""``scipy.optimize.minimize`` compatible implementation of ADAM -
[http://arxiv.org/pdf/1412.6980.pdf].
Adapted from ``autograd/misc/optimizers.py``.
"""
x = x0
m = np.zeros_like(x)
v = np.zeros_like(x)for i in range(startiter, startiter + maxiter):
g = jac(x, *args)
if callback and callback(x):
break
m = (1 - beta1) * g + beta1 * m  # first  moment estimate.
v = (1 - beta2) * (g**2) + beta2 * v  # second moment estimate.
mhat = m / (1 - beta1**(i + 1))  # bias correction.
vhat = v / (1 - beta2**(i + 1))
x = x - learning_rate * mhat / (np.sqrt(vhat) + eps)
i += 1
return OptimizeResult(x=x, fun=fun(x, *args), jac=g, nit=i, nfev=i, success=True)

For this setting we need a derivative of the previous loss. We cannot have an analytical form so I decided to use a numerical approximation of the derivative.

Once the model is trained on the training dataset, it is necessary to make predictions on the test dataset to evaluate its performance and compare different models. However, it is not possible to directly calculate the actual distribution of the new point, since the calculation is not measurable.

It is possible to measure the results with:

considering:

I chose the naive prior over the precise random variable. The parsimonious model performs poorly, with a log loss of 0.60 and an AUC ROC of 0.50. The second model performs better, with a log loss of 0.44 and an AUC ROC of 0.83, both when overfitting using grid search and bayesian optimization. This shows that the logistic regression model, which includes the dependent variable, outperforms the parsimonious model. However, there is no advantage to using bayesian optimization over grid search, so I will continue with grid search for now. Thanks for reading.

… But wait, I guess. Why are my parameters normalized by the same coefficient? Shouldn't my former depend on the dependent variables? Perhaps the parameters of the first dependent variable can take high values, while those of the second dependent variable, which has a small effect, should be close to zero. Let's explore this new dimension.

So far we have considered two techniques, grid search and Bayesian optimization. We can use these same methods for higher measurements.

Considering the new dimensions would greatly increase the number of nodes in my grid. That's why Bayesian optimization makes sense with higher ratios to get the best correlation coefficients. In the considered use case, I assumed that there are 3 normalization parameters, one for each independent variable. After encoding one variable, I assumed that the new variables generated all shared the same normalization parameter. So the number of normalization parameters is 3, even if there are more than 3 columns as logistic regression input.

I updated the previous loss function with the following code:

import numpy as np
from scipy.stats import gammafrom module.bayesian_model import BayesianLogisticRegression
def loss(p, X, y, alpha, beta, X_columns, col_to_p_id):
# computation of the loss for given values:
# - 1/sigma² vector (named p for precision here)
# - X: matrix of features
# - y: vector of observations
# - alpha: prior Gamma distribution alpha parameter over 1/sigma²
# - beta: prior Gamma distribution beta parameter over 1/sigma²
# - X_columns: list of names of X columns
# - col_to_p_id: dictionnary mapping a column name to a p index
# because many column names can share the same p value
n_feat = X.shape[1]
m_vec = np.array([0] * n_feat)
p_list = []
for col in X_columns:
p_list.append(p[col_to_p_id[col]])
p_vec = np.array(p_list)
# computation of theta*
res = minimize(
BayesianLogisticRegression()._loss,
np.array([0] * n_feat),
args=(X, y, m_vec, p_vec),
method="BFGS",
jac=BayesianLogisticRegression()._jac,
)
theta_star = res.x
# computation the Hessian for the Laplace approximation
H = BayesianLogisticRegression()._hess(theta_star, X, y, m_vec, p_vec)
# loss
loss = 0
## first two terms: the log loss and the regularization term
loss += baysian_model._loss(theta_star, X, y, m_vec, p_vec)
## third term: prior distribution over 1/sigma² written p here
## there is now a sum as p is now a vector
out -= np.sum(gamma.logpdf(p, a = alpha, scale = 1 / beta))
## fourth term: Laplace approximated last term
out += 0.5 * np.linalg.slogdet(H)[1] - 0.5 * n_feat * np.log(2 * np.pi)
return out

In this way, the metrics tested on the test dataset are the following: 0.39 and 0.88, which are better than the first model developed by using grid search and the bayesian method with only one start for all independent variables.

The metrics are achieved in different ways in my use case.

The use case can be reproduced with this manual.

I created an example to demonstrate the usefulness of the technique. However, I couldn't find a suitable real-world dataset to fully demonstrate its potential. While working with a real dataset, I could not find significant benefits using this technique. If you come across one, please let me know — I'd love to see a real-world application of this adaptation.

In conclusion, using Bayesian optimization (with Laplace approximation if needed) to determine the best tuning parameters may be an alternative to conventional hyperparameter tuning methods. By using probabilistic models, bayesian optimization not only reduces the computational cost but also improves the probability of finding the correct normalization values, especially in high dimensions.