Machine Learning

Changes (and attention) are just an additional machinery

New new field in AI, focusing on the understanding of neural networks work through technical engineering and their subtractions, which aims to translate it on algoriths. This differs and passed the traditional definition strategies like a shape and lime.

Shape Stims for Shdistant Additive exKindroutes. It covers the contribution of each aspect in model's prediction, in the area of the international and global, one and the rest of the data. This allows the design to determine the importance of a feature generally in order to secure a use of use. LIME, currently operating on one example of predicting the example where it includes an example and uses soft and its measuring consequences instead of a black box. Thus, both of these factors at the level of feature and give us the description and humility to measure that each installation of this summer affects its predictions or the result.

On the other hand, the translation of app understands for a greater level because it is able to provide in the way that the specified feature is learned from the neural network, and how to read how it comes from more than the context on the network. This makes it skillful in tracking methods within a network feature and including To see how that feature affects the result.

Shape and stop, answer the question “Which factor shows the outcome?” While the mechanical translation answers the question “What factor do neurons work in, and how does that feature come from and how does it affect the network effect?

As the general explanation is a problem with deep networks, this sub-field works well with deep models such as converts. There are a few places where the service translation is different from the opposite, one of them Attention that has many heads. As we will see, this difference is in renewing and performance concatentations as described in “maintaining all you need” the installation of new opportunities.

But first, the restoration of transformer buildings.

Transformer Architecture

Photo by the writer: Transformmer structure

These are the size of our work:

  • Batch_Selate B = 1;
  • the length of s = 20;
  • Vocab_Size v = 50,000;
  • Hidin_Dims d = 512;
  • heads h = 8

This means that the number of size in Q, K, V VECEMectors is 512/8 (l) = 64. (In case you don't remember, a understandable question metaphor, Key and Value: The idea is that token in a given place (k)

These are steps that go up to understanding the attention of transformer. (Construction of customs is considered as an example of better understanding. Italic numbers represent the size where matrix is repeated.)

Step Operation 1 input (Shape) Add 2 Dims (Shape) Out of the dims (Shape)
1 N / a B xsxv
(1 x 20 x 50,000)
N / a B xsxv
(1 x 20 x 50,000)
2 Get embeddish B xsxv
(1 x 20 x 50,000Selected
V XD
(50,000 x 512)
B xsxd
(1 x 20 x 512)
+ Add Automatic Inspiration B xsxd
(1 x 20 x 512)
N / a B xsxd
(1 x 20 x 512)
4 Copy the embrying to Q, K, v B xsxd
(1 x 20 x 512)
N / a B xsxd
(1 x 20 x 512)
The knee is purchased Convert the line Each head H = 8 B xsxd
(1 x 20 x 512Selected
D xl
(512 x 64Selected
Bhxxsxl
(1 x 1 x 20 x 64)
6 Scatter Dot Product (Q @ K ') Each head Bhxxsxl
(1 x 1 x 20 x 64Selected
(Lxsdrxb)
(64 x 20 x 1 x 1)
Bhxsxs
(1 x 1 x 20 x 20)
And the bought the lurch + Rated Dot Product (Counting Attention) Q @ K'V Each head Bhxsxs
(1 x 1 x 20 x 20Selected
Bhxxsxl
(1 x 1 x 20 x 64)
Bhxxsxl
(1 x 1 x 20 x 64)
8 Concat in all heads H = 8 Bhxxsxl
(1 x 1 x 20 x 64Selected
N / a B xsxd
(1 x 20 x 512)
9 The guessing of a line B xsxd
(1 x 20 x 512)
D xD
(512 x 512)
B xsxd
(1 x 20 x 512)
Tabern Views of Shape Transformation regarding the merchandialization of transformer

The table described it in detail:

  1. We start with one sentence of the installation of the 20-shame that is attached to representing the words in the existing order. Coverages (BXSXV): (1 x 20 x 50,000)
  2. We repeat this installation of the matrix w ₑ matrix formation (VXD) to get inspection. Match (BXSXD): (1 x 20 x 512)
  3. Next the accessible matrix of the same situation is added to the Meddings
  4. The following embryos are copied in the matric q, K and V. D the size. Match (BXSXD): (1 x 20 x 512)
  5. MATTER OF Q, K and VO PERSONS INTER ELEGH Transform repeated layer of individual studies (D XL) WQ, Wₖ and W ᵥ, respectively (one copy of H = 8 heads). Bhxsxsxl) (1 x 1 x 20 x 64) when I H = 1, as this is the following campaign Each head.
  6. Next, we give attention to the attention of the limited product of DOT when IQ and K (transferred) is first in the Each head. Shape (b X H XSXL) X (LXSHXB) → (BhxSxs): (1 x 1 x 20 x 20).
  7. There is the next and the following complex step I have exceeded in understanding the different way of looking for Mhaa. So, next we repeat qk with V Each head. Shape (b X H XSxs) x (Bhxxsxl) → (Bhxsssxl): (1 x 1 x 20 x 64)
  8. Concat: Here, we agree with the results of attention from all heads in the Lisest Limitation (BXSXD) → (1 X 20 x 512)
  9. This result has also made clearly using another weight learned matrix wₒ status (DXD). The final state ends with (bxsxd): (1 x 20 x 512)

Attention to a lot of headaches

Image by the author: To pay attention to a lot of head

Now, let's see how the field translation field is looking at this, and we will also see why it is equal. Right in the picture above, you see a module repeating a lot of your head.

Instead of reducing outgoing attention, we continue to repeat “Inter” The heads itself is now the Construction of Wₒ (LXD) and multiply with QK'V status (b XHXSXL) to find the result of status (BX 20 x 1 x 512). Then, we reach over the H and ends in the form (BXSX D): (1 x 20 x 512).

From the table above, the two last steps are what variables:

Step Operation 1 input (Shape) Add 2 Dims (Shape) Out of the dims (Shape)
8 MATRIX repetition Each head H = 8 Bhxxsxl
(1 x 1 x 20 x 64Selected
L XD
(64 x 512)
BXSXXXD
(1 x 20 x 1 x 512)
9 Sum over heads (h size) BXSXXXD
(1 x 20 x 1 x 512)
N / a B xsxd
(1 x 20 x 512)

Side note: This “Summing Over” reminds that a variety of channels are happening on CNNs. In CNNS, each filter is working on installation, then we Reduce the results They have abroad. The same here – each head may seem like a channel, and the model reads the weight matrix for each headache in the last spare space.

But why The project + is a sum statistically equivalent The Conference of Concat +? In short, because speculation instruments in a machine is simply cut into traditional diet (suspended across D the size and split to match each head).

Photo by the writer: Why is the reorganization applicable

Let us focus on H and D conflict before repeating Wₒ. From the picture above, each head now locks a tip of 64 size with weight matrix (64 x 512). Let's show the result by R and head by H.

For R₁₁, we have this equation:

R₁, ₁ = h₁, ₁ x wₒ₁, ₁ + H₁, ₂ x wₒ₂, ₁ + …. + h₁ₓ₆₄ x wₒ₆₄, ₁

Now let's say we had headaches that get head seen (1 X 512) and the weight matrix (512, 512) Then equation would be:

R₁, ₁ = H₁, ₁ x wₒ₁, ₁ + H₁, ₂ x wₒ₂, ₁ + …. + H₁ₓ₅₁₂ x wₒ₅₁₂

So, part of H₁ₓ₆x wₒ₆+ … + H₁ₓ₅₁₂ x wₒ₅₁₂Would be added. But this section is added to a part there is a part of the heads in Modolo 64 fashion. Means another method, if no concatenation, wₒ₆ The amount of Wₒ₁, ₁ in the second head, WB, ₒ₁₂₉, is the price behind Wₒ₁, ₁ on the third head if we think each head prices stay behind. Therefore, or without cleaning, “over the head to summarize” results in the same use.

In conclusion, this understanding lays the basis for viewing the transformers as the addition models that can detect them in all functions in the transformer take it to the original. This idea is opening new opportunities such as follows as they are read with false Hypocritical (Circuit Traceng) Which Menzactic call is to show in my next articles.


We have shown that this idea is equivalent to the figurative vision of the most different viewpoint with a lot of head, separating the IQ, k, v divorldLeves and makes the similarities of attention. Learn more about this blog here with the actual paper introducing these points here.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button