Machine Learning

Simple initialization of the path of attention from scratch

The path of attention is commonly associated with the construction of transformer, but it is already used in rnns. In a machine intercressing or MT (eg.

I will not come into the RNNS details, but the attention helps these models to reduce the Gradient problem and to find a number of dependent on words.

Sometime, we understood that the only important thing was a way of being ignored, and the whole rnn construction was not over. Therefore, attention is everything you need!

To pay attention to transformers

Classical attention shows where the words in the output should focus on the words in the installation order. This is important for sequential activities – sequence as MT.

This page Ignorance It is a form of attention. It works between any two things in the same order. It provides details of “how” the words are in the same situation.

To get a given token (or name) respectively, adequacy is creating a list of checked instruments all other tokens in succession. This process is used in each token in the sentence, to find the matrix of the attention of attention (such as in the picture).

This is the common sense, in familiarize yourself because we want to add multiple parameters to our neural network, let's see how.

K, v, q presentations

Our model installation is a sentence such as “My name is Marcello Politi “. Through the process of distributeThe sentence is changed to the list of numbers such as [2, 6, 8, 3, 1].

Before feeding the sentence on transformer we need to create a dense Sotoken presentation.

How can you build this proxy? We repeat each token with matrix. The matrix is ​​read during training.

Let's add some of these problems now.

In each sense, we create 3 vectors instead, calling this vectors: key, the amount including complain. (We see later how we build 3 vectors).

Mentally these 3 tokens have some meaning:

  • The Vector key represents important information by the Token
  • The Vector value photographed the full details of the token
  • The Vector question, is a question about compliance with the Telephone for current work.

So the idea is to focus on a specific Token I, and we want the importance of other tokens in the sentence about the symbol I reflected.

This means we take the Vector Q_I (we ask a question about I) in the Token, and we do some math activities and all other tokens k_J! = I). This is like wondering what other tokens in a look is really important to understand the Description of Telk i.

What is this use of magic magic?

We need repetition (DOT product this we do with each k_J token.

In this way, we get one point (q_i, k_J). We make this list the distribution possible through the performance of the softmax. Good now we have found Attention!

With the attention of attention, we know what the importance of each token kkhela K_J in Undestandin token i. So now we repeat the value of the Verctor V_J associated with each token with its weight and hides vector. This way we get the last Vector knowing the context of Token_i.

If we cry for a maximum token_1 Count:

Z1 = A11 * v1 + A12 * v2 + … + A15 * V5

When A1J computer attention, and V_J is the categories of value.

Done! About …

I didn't cover ourselves to find videors k, v and Q of each of the sign. We need to describe certain W_K matricians, w_v and w_Q to when we multiply:

  • Token * w_k -> k
  • Token * w_Q -> q
  • Token * w_V -> v

These 3 matricis are randomly set and read during training, which is why we have many parameters in today's models like the llMS.

A lot of headaches in transformers (MHSA)

We are sure that the previous attitude is able to take all the important relationship between the tokens (names) and create veeres and one of those tokens?

In fact, it cannot always work. What if if minimize the error we also use all this item 2 times with a new W_Q, W_K and W_V Matric and somewhat mixed minor Vaectors. In this way, perhaps a person who is able to treat a particular relationship and the other was able to take other relationships.

Yes, this really happened at Emhza. The newly discussed case contains two heads because it contains two sets of W_Q, W_K and W_V matric. We can have more heads: 4, 8, 16 etc.

The only complex object is that all these heads are treated alike, we process everything in the same complication that uses the issues.

The path that combines in the cramped vector in each of the heads, reducing them (that is why the size of each Vector will be small to get the first size we want), and conveying the vector found in another matrix found.

Hands

Suppose you had a sentence. After walking, each token (a simple word) corresponds to the index (number):

Before feeding the sentence on the transofrmer we need to create a dense Sotoken presentation.

How can you build these outstretched? We repeat each token with matrix. This matrix is ​​read during training.

Let's build this embedded matrix.

If we repeat our organized sentence by embarking, we find a dense presentation of 16 in each Token

To use the attention method we need to create 3 new to describe 3 matrices of w_q, w_k and w_v. When we repeat one time the w _q telegram we receive IQ Q. The same with W_K and W_V.

The weight of attention

Now let's mix the first punctuation mark of the sentence.

We need to increase the Vector of the Token1 question (question_1) and all the vectors buttons.

So now we need to combine all buttons (key_2, key_2, key_4, key_5). But wait, we can put all this at the same time by repeating the_the w _k matrix times.

Let's do the same thing as prices

Let's write First part of the formula.

import torch.nn.functional as F

With weight we know what the importance of each sign. So now we repeat the vector of the amount associated individually with its weight.

To find the last Token_1 vector.

In the same way we can put the twisting text for many of the other tokens. Now we always use the same matric w_k, W_Q, W_V. We say we use one head.

But we can have many matric trips, many heads. That is why it is called a variety of headaches.

Viewtor is thick input tokens, given to Oputut each of their lower head and transformed directly to obtain last vector.

Getting started initiating a multifundise-attention

Similar steps as before …

We will explain a lot of headaches on the H heads (Suppose 4 heads for this example). Each head will have W_Q, w_k, and w_v matric, and each head out of captivity will be captured and passed a straight layer.

As the heading will be captured, and we want the last size of D, each head size requires D / H. Additionally for each cited vector will leave even though the exact revolution, so we need another matrix w_ouptut as you can see in the formula.

With 4 heads, we want 4 copies in each matrix. Instead of copies, we include size, which is the same, but we do only one work. (Consider the matric on top of each other, the same thing).

I am using an easy-Tolch Eismum. If you aren't familiar with my check blog in the post office.

EIFUCTION WORKING torch.einsum('sd,hde->hse', sentence_embed, w_query) In Pytorch uses letters to describe the recycling method and reorganize numbers. Here is each part mentioned:

  1. Input desires:
    • sentence_embed in the sense 'sd':
      • s represents the number of words (the length of the order), 5.
      • d It represents the number of numbers per word (installation size), 16.
      • The formation of this personal [5, 16].
    • w_query in the sense 'hde':
      • h represents the number of heads, 4.
      • d represents the masturbation size, and is 16.
      • e Represents new NEED size (D_K), 4.
      • The formation of this personal [4, 16, 4].
  2. TENSOR REST:
    • Release is a sign 'hse':
      • h represents 4 heads.
      • s represents 5 names.
      • e represents 4 numbers per headaches.
      • The shape of the output stem [4, 5, 4].

This Esisim equation makes a dot product between the questions and the converted buttons (hek) to find many situations [h, seq_len, seq_len]when:

  • H -> the number of heads.
  • S and K -> Length of sequence (number of tokens).
  • e -> each head size (D_K).

Distinction with (D_K *******) scores scores to help gradientents. Softmax and used to find ignorant instruments:

We now agree with all heads in the Token 1

Let's keep the last w_utPut such as the above formula

The last thoughts

In this post office I used a simple version of the path of attention. This is not a way that is really used to modern structures, but my greatness is to provide some reasonableness to allow anyone to understand how this works. In the coming articles I will travel to all the initiation of transformer construction.

Follow me on TDS if you like this article! 😁

💼 Lickimin ️ | 🐦 x (Twitter) | 💻 website


Unless otherwise noted, photos are the author

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button