Machine Learning

Bonus 2 Machine Learning “Advent Calendar”: Gradient Descent Variations in Excel

use it gradient descent to find the correct values ​​for their weights. Linear regression, logistic regression, neural networks, and large language models all rely on this principle. In previous articles, we used simple gradient descent because it is easy to demonstrate and easy to understand.

The same principle can be seen in the estimation of today's large languages, where training requires adjusting millions or millions of parameters.

However, actual training rarely uses the basic version. It is usually very slow or very unstable. Modern systems use variants of gradient descent that improve speed, stability, or convergence.

In this bonus article, we focus on this exception. We look at why they exist, what problem they solve, and how they are changing the law of review. We are not using a dataset here. We use one variable and one function, only to make the behavior visible. The goal is to show the movement, not to train the model.

1. Gradient Descent and Update Mechanism

1.1 Problem setting

To illustrate these ideas, we will not use a dataset here, because datasets introduce noise and make it difficult to observe behavior directly. Instead, we'll use a single function:
f(x) = (x – 2)²

We start at x = 4, and the gradient is:
gradient = 2*(x – 2)

This simple setup eliminates distractions. The goal is not to train the model, but to understand how different optimization rules change the motion to a minimum.

1.2 Structure after all revisions

Every development method that follows in this article is built on the same loop, and the internal logic becomes more complex.

  • First, we read the current value of x.
  • Then, we calculate the gradient with the expression 2*(x – 2).
  • Finally, we update x according to the specific rule defined by the selected variable.

The destination remains the same and the gradient always points in the right direction, but the path we take along this path changes from one path to another. This change in movement is the essence of each variation.

1.3 Gradient descent as a basis

Basic gradient descent uses a direct update based on the current gradient and a fixed learning rate:

x = x – lr * 2*(x – 2)

This is the most intuitive type of learning because the renewal rule is easy to understand and easy to apply. The method is slow, but usually slow, and can be difficult if the level of study is not carefully chosen. It represents the foundation on which all other forms are built.

Gradient descent in Excel – all images by the author

2. Decay in learning rate

Deterioration in Reading Level does not change the review rule itself. It changes the size of the learning rate in every iteration so that the optimization is more stable near the minimum. Larger steps are useful when ux is far from the target, but smaller steps are safer when ux is closer to the target. Decompression reduces the risk of overshoot and produces a smoother landing.

There is no single decomposition formula. Several schedules are available in practice:

  • exponential decay
  • inverse decomposition (the one shown in the spreadsheet)
  • step-based decomposition
  • linear decomposition
  • cosine or cyclical schedules

All of this follows the same logic: the learning rate gets smaller over time, but the pattern depends on the program chosen.

In the spreadsheet example, the decomposition formula is the inverse form:
lr_t = lr / (1 + decay * iteration)

For review rule:
x = x – lr_t * 2*(x – 2)

This schedule starts with a full learning rate for the first repetition, then gradually decreases. At the beginning of the adjustment, the step size is large enough to move quickly. As ux approaches minimum, the learning rate slows down, stabilizing the update and avoiding drift.

In the chart, both curves start at x = 4. The constant learning rate version moves faster at first but approaches the minimum with less stability. The decay version is slower but always controlled. This ensures that decay does not change the update path. It only changes the step size, and that change affects the behavior.

3. Impulse Methods

The Gradient Descent moves correctly but can be slow on flat surfaces. Momentum methods address this by adding inertia to the update.

They accumulate direction over time, creating rapid progress where the gradient remains constant. This family includes the standard Momentum, which builds speed, and the Nesterov Momentum, which anticipates the next position to reduce overshoot.

3.1 Normal pressure

General momentum introduces the concept of inertia to the learning process. Instead of reacting only to the current gradient, the update keeps a memory of previous gradients in the form of speed differences:

speed = 0.9speed +2(x – 2)
x = x – lr * velocity

This method speeds up learning when the gradient remains constant over multiple iterations, which is especially useful for flat or shallow surfaces.

However, the same inertia that creates velocity can also lead to a smaller shot, which creates oscillations around the target.

3.2 Nesterov Momentum

Nesterov Momentum is a refinement of the previous method. Instead of only updating the speed at the current location, the method first estimates where the next location will be, and then checks the slope at that expected location:

speed = 0.9speed +2((x – 0.9*velocity) – 2)
x = x – lr * velocity

This forward-looking behavior reduces the overshooting effect that can occur from normal Momentum, resulting in a smoother path with fewer spins. It maintains the advantage of speed while introducing a feeling of careful control.

4. Adaptive Gradient Methods

Adaptive Gradient methods adjust the update based on the information gathered during training. Instead of using a constant learning rate or relying only on the current gradient, these methods adapt to the rate and behavior of recent gradients.

The goal is to reduce the step size when the gradients become unstable and to allow a general progression when the surface is predictable. This method is useful in deep networks or in irregular lossy areas, where the gradient can change in magnitude from one step to another.

4.1 RMMSrop (Root Mean Square Propagation)

RMProp stands for Root Mean Square Propagation. It stores the effective mean of squared gradients in the cache, and this value influences how the update is applied:

cache = 0.9cache + (2(x – 2))²
x = x – lr / sqrt(store) * 2*(x – 2)

The cache becomes larger if the gradients are not strong, which reduces the update size. When the gradients are small, the cache grows slowly, and the update is always close to the regular step. This makes RMMSrop perform well in cases where the gradient scale is inconsistent, which is common in deep learning models.

4.2 Adam (Adaptive Moment Estimation)

Adam stands for Adaptive Moment Estimation. It combines the concept of Momentum with the dynamic behavior of RMMSrop. It stores a moving average of gradients to capture direction, and a moving average of squared gradients to capture scale:

m = 0.9m + 0.1(2(x – 2)) v = 0.999v + 0.001(2(x – 2))²
x = x – lr * m / sqrt(v)

The variable m behaves like the speed in momentum, and the variable v behaves like the buffer in RMMSrop. Adam updates both values ​​at every iteration, allowing it to speed up when progress is clear and slow down when the gradient becomes unstable. This balance between speed and control is what makes Adam a common choice for neural network training.

4.3 Other Adaptations

Adam and RMMSrop are the most common methods of flexibility, but they are not the only ones. There are several related methods available, each with a specific purpose:

  • AdaGrad adjusts the reading rate based on the full history of squared gradients, but the rate can decrease very quickly.
  • AdaDelta corrects AdaGrad by limiting how much the historical gradient affects the update.
  • Adamax uses a continuous process and can be more stable on very large gradients.
  • Nadam adds a Nesterov-style forward approach to Adam.
  • RAdam tries to stabilize Adam in the first phase of training.
  • AdamW separates weight decomposition from gradient update and is recommended in many modern frameworks.

These methods follow the same idea as RMProp and Adam: to adjust the update on the behavior of the gradients. They represent refinements or extensions of the concepts presented above, and belong to the same broad family of performance optimization algorithms.

The conclusion

All the methods in this article aim at the same goal: to move ux to a minimum. The difference is the method. Gradient Descent provides a basic rule. Momentum adds speed, and Nesterov improves control. RMProp adapts the step to the gradient scale. Adam combines these ideas, and Learning Rate Decay adjusts the step size over time.

Each method solves a particular limitation of the past. None of them take the place of the foundation. They expand. Actually, excellence is not a single rule, but a set of methods that work together.

The goal remains the same. The movement becomes more successful.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button