Machine Learning

Thread, Clearly Defined | About Data Science

There are many good resources explaining transformer design online, however Rotary Position Embedding (RoPE) is often misinterpreted or skipped altogether.

RoPE was first introduced in the paper RoFormer: An Advanced Transformer with Rotary Position Embedding, and although the mathematical operations involved are relatively straightforward – primarily matrix rotation and matrix multiplication – the real challenge lies in understanding how it works. I will try to provide a way to visualize what it does to vectors and explain why this method is so effective.

I think you have a basic understanding of transformers and how to pay attention throughout this post.

ROPE Intuition

Since transformers lack an inherent understanding of order and distances, researchers have developed spatial embeddings. Here's what embedding positions need to accomplish:

  • Tokens that are close to each other should have high weights, while distant tokens should have low weights.
  • The position between the series should not matter, that is, if two words are close to each other, they should face higher weights regardless of whether they appear at the beginning or at the end of a long sequence.
  • To achieve these goals, embedded relative position are more useful than absolute local embedding.

Important understanding: LLMs should focus on the relative position between two tokens, which is what is really important to pay attention to.

If you understand these concepts, you're already halfway there.

Before the ROPE

The original structural embedding from the seminal paper Attention Is All You Need was described by a closed-form equation and then added to the semantic embedding. Combining location and semantics signals in a hidden state was not a good idea. A recent study confirmed that LLMs were memorizing (overfitting) rather than generalizing positions, resulting in rapid degradation when the sequence length exceeded the training data. But using a closed form formula makes sense, it allows us to extend it indefinitely, and RoPE does the same.

Another successful strategy in deep learning was: if you are not sure how to integrate the useful features of a neural network, let the network learn itself! That's what models like GPT-3 do — they learn the embeddings of their positions. However, giving too much freedom increases the risk of overuse and, in this case, creates strict limits on context windows (you cannot extend them beyond the qualified context window).

Best practices focus on adjusting the attention pattern so that nearby tokens receive high attention weights while distant tokens receive low weights. By separating spatial information from the attentional pathway, it preserves the hidden state and keeps us focused on semantics. These strategies are mainly trying to change smartly Q again K so their dot products will show proximity. Many papers have tried different approaches, but RoPE is the one that best solved the problem.

Rotation Intuition

RoPE is changing Q again K by using circulation to them. One of the best features of rotation is that it stores vector modules (size), which can carry semantic information.

Allow q it was a token question guess as well k it has been an important guess for another. For nearby tokens in the text, a small rotation is applied, while far away tokens are subjected to large rotation changes.

Think of two parallel vectors – any rotation would make them too far apart. That's exactly what we want.

Image by author: RoPE Rotation Animation

Now, here's a potentially confusing situation: if two vectors are already far apart, rotation can bring them closer together. That's it not what we want! They are rotated because they are far from the text, so they should not receive high attention weights. Why does this still work?

  • In 2D, there is only one plane of rotation (xy). You can only rotate clockwise or counterclockwise.
  • In 3D, there are many planes of rotation, making it very difficult for a rotation to bring two vectors closer together.
  • Modern models work in very high dimensional environments (10k+ dimensions), making this even more impossible.

Remember: in deep learning, opportunities matter most! It is acceptable to be wrong sometimes as long as the odds are small.

Angle of Rotation

The rotation angle depends on two factors: m again i. Let's examine each one.

Perfect Brand Position m

The rotation increases as the total position of the token m is increasing.

I know what you're thinking: “m it is an absolute position, but didn't you say that relative positions are more important?”

Here's the trick: consider a 2D plane where you rotate one vector by 𝛼 and another by β. The angular difference between them becomes 𝛼-β. The absolute values ​​of 𝛼 and β do not matter, only their differences matter. So with two tokens in positions m again nrotation changes the angle between them proportionally so that m-n.

Photo by author: Relative distance after rotation

For simplicity, we can assume that we only rotate q (this is mathematically correct as we care about final distances, not coordinates).

Hidden State Index i

Instead of applying the same rotation to all hidden state measurements, RoPE processes two measurements at a time, using different rotation angles for each pair. In other words, it breaks a long vector into multiple pairs that can be rotated in 2D by different angles.

We rotate the size of the hidden state differently — the rotation is higher if i is low (start of vector) and low if i is up (vector end).

Understanding this function is straightforward, but understanding why we need it requires a little more explanation:

  • It allows the model to choose what to have short or long term effects.
  • Consider vectors in 3D (xyz).
  • I x again y axes represent anterior dimensions (bottom i) which are highly variable. The most popular tokens in the x again y you must be very close to attend with great determination.
  • I z on the axis, where i high, slightly rotating. The most popular tokens in the z you can go even when you are far away.
Photo by author: We use rotation in xy the plane. The two vectors include information mainly in z stay close despite rotation (tokens to be had despite long distances!)
Image by the author: Two vectors encoding information x again y they are too far apart (tokens are close enough that they shouldn't care about each other).

This plot captures the complex nuances of human language – pretty cool, right?

Again, I know what you're thinking: “after a lot of swings, they're getting close again”.

That's fine, but here's why it still works:

  1. We visualize it in 3D, but this actually happens in much higher dimensions.
  2. While some measurements are getting closer, others that are turning slowly are getting further and further apart. Hence the importance of rotating dimensions at different angles.
  3. RoPE is not perfect – due to its rotational nature, spatial dimensions are possible. See a theoretical chart from the original authors:
Source: Su et al., 2021. Theoretical curve provided by the authors of the RoFormer paper.

The theoretical curve has some crazy bumps, but in practice I found it to be very well behaved:

Photo by author: Grades from zero to 500.

The idea that came to me was cutting the angle of rotation so that the similarity decreases strongly with increasing distance. I've seen clipping used in other techniques, but not in RoPE.

Remember that the cosine similarity tends to increase (albeit slowly) as the distance increases more than our base value (later you will see exactly what this formula base is). The simple solution here is to increase the base, or let techniques like local or window attention take care of it for you.

Photo by author: Extends to 50k range.

Key point: LLM learns to make long-range and short-range meaning influence in various dimensions q again k.

Here are some practical examples of long-range and short-range dependence:

  • LLM processes the Python code where the first transformation is applied to the data frame df. This important information should carry a long distance and influence the embedding of the content of the stream df tokens.
  • Adjectives often have adjectives nearby nouns. It says “A beautiful mountain stretches beyond the valley”, an adjective good explains directly mountainnot the villagetherefore it should primarily affect i mountain embedding.

The Angle Formula

Now that you understand the concepts and have a solid idea, here are the maths. The rotation angle is defined as:

[text{angle} = m times theta]
[theta = 10,000^{-2(i-1)/d_{model}}]

  • m the location of the perfect sign
  • i ∈ {1, 2, …, d/2} representing the size of the hidden state, since we process two measurements in the time it takes to repeat d/2 rather than d.
  • dmodel hidden condition size (eg, 4,096)

Note that if:

[i=1 Rightarrow theta=1 quad text{(high rotation)} ]
[i=d/2 Rightarrow theta approx 1/10,000 quad text{(low rotation)}]

The conclusion

  • We have to find smarter ways to introduce knowledge to LLMs rather than letting them learn everything independently.
  • We do this by providing the appropriate functions that the neural network needs to process the data – attention and modulation are good examples.
  • Closed-form equations can be extended indefinitely since you don't need to learn the embedding of each point.
  • This is why RoPE offers excellent sequence length flexibility.
  • A very important feature: attention weights decrease as relative distances increase.
  • This follows the same feeling as spatial attention to changing architectural attention.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button