Views of Transformation Motivation: patterns, messages, residue spread … and lSSM

How to talk about how the translation is viewed in converts to converts without meeting. Here, I will get so deep in this idea and show how it comes to the ideas from LSSMs, and how this is described how new doors are allowed to understand.
To reduce: Transformers relies on a series of repeated matrixs including the question (Q), key (v), and Autput Proveniven Matrix (V). Traditionally, each head involves independent attention, the results are included, and is available at O. But it is better seen that the final division is made of mass matrix o actually using each head (Compared with the native vision of combination headaches and insert). This subtle change means that the heads are independent and separated until the end.
Patterns and messages
Short Analogy on Q, K and V: Each matrix is a direct guess of embedding ie. Then, the Tokens of Q Khokens may be thought of asking a question “K, to-like key (such as the Hashmap of the actual information contained in the keys stored in V
In fact, q and k decreased complianceand v hold content. These communication tells each of the keyword that one should go, and how much. Now let's see how to see heads as the private leads to the viewpoint view of each head and the value of the two independent processes, that is patterns and messages.
Get steps to attention:
- Incing matry-e and wprescription Getting the Vector Q. Similar Question Find the Vector K key to Vector V with multiplication with wP and wv
- Multiply with Q and KT. From traditional viewpoint, this applies is seen as determined which other convention tokens are most relevant to the current mark.
- Apply SoftMax. This ensures that related scores or the same are calculated on the previous stage can do more than 1, thus give weight importance to other current mode tokens.
- Multiply via V This provides us the rich symbol of a symbolic sign that includes information that chronological tokens relate to them.
- Finally, the effect looked back to the nose of the model using o
The last ignition count is: Title to the scoreTKind
Now, instead of seeing this as ((QKT) V) oMachine translation to see this as postponed (QKT) (Voto) when Title to the scoreT creates a pattern including Kind creates a message. Why? Because it allows to distinguish the cleanness of two concept policies:
Messages (Vo): To find out what are you Transfer (content).
Patterns (Qkᵀ): Detection where appearance (compliance).
Deeply debit, remember that IQ and Fully to fully since Matrix at Shullding E. Therefore, we can also write an equation as:
(Ewprescription) (WTPE)
The equipment translation refers to WprescriptionW wP concerning W wkind with a pattern mass matrix. Here, EWkind It can focus as producing the pattern that has been compared against embedded against another e, getting points that can be used in the milk regulations. This is basically, this changes the profitability of the calculation such as “pattern matches” and gives us direct relationship between the same census.
Similarly the VO can be seen as EWvO is the Each headache Viewers of value, taken from egoddings and shown in the model space. Also, this transformation gives us a direct relationship between the final embedded and output, instead of seeing the attention as a step. Another difference is that while traditional idea is to be regarded saying that the information contained in V is extracted Using questions represented by Q, the verb view allows us to think that the information should be attached fax By dedicating themselves, and only weighs patterns.
Finally, patterns – using patterns-messages terminology is: each token at the pursuit of patterns found to determine the following message to deliver the following token.

This makes it possible: the circulation left
From my past article, where we saw a more accurate conversion of the heads and this time when it turns the attention directly according to the protests, we can view each performance as a person Additional to instead of Revolution the first embryo. The remaining connections to the transformation transformers as a customary translation can be defined as a residual stream Adoping and embarking on and features as many headaches and mlp read, do something and add to return to the eMbeddings. This makes it possible for each work to an update to persistent memory, not a change of revolution. To view so it is very easy, and it is still keeping full equity of mathematics. More from this here.

How does this relate to LSTM?

Reducing: LSSMs, or short-term memory, a short-term memory for RNN type designed to handle a horrible problem every RNNS by storing data for data. LSTM cell (seen above) has two provinces – cell condition c with long-time memory and hidden state h short-term memory.
It also gates – forget, input and output that control the flow of information is outside the cell. Automatically, the gateway works as a lever to find out how long the alternatives you do not submit; The installation gateway acts as a lever to determine how much of the current installation from the hidden country is to add to long-term memory; and the gateway is active active as a lever to find out how much converted The long-term memory sends forward in a hidden state of the next step.
The basic difference between LSTM and transformer is the LSTM in a row and the area only works on a single token at a time and transformer works in accordance with all sequence. But they are the same because both of them Methods of Renewing State Status Statusespecially when a transformer is viewed from the mechanical lens. Therefore, analoomy is:
- Cell State is the same as the left stream; to do as long-term memory in everything
- The installation gate performs the same function as the corresponding pattern or similarity found in determining what relevant information is appropriate for the current sign; The only difference to change this is the same in all the tokens in order
- The gateway is the same as messages and determines what is the issue of issue and how strong it is.
By looking at the patterns (QKᵀ), and the transactions reserves as the consumption of continuous listening, the interpretation of the transactions provide a powerful form of composing transformers. This is not only improving your attention, but understands the attention of the details of the Data Processing Paradigms – bringing near steps closely to the specified version of the Lssms.



