Static and Dynamic Attention: Implications for Graph Neural Networks | by Hunjae Timothy Lee | January, 2025

Graph Attention Network (GAT)
The Graph Attention Network (GAT), as introduced in [1]closely follows the emerging work [3] in setting its attention. The GAT design also bears many similarities to the now infamous paper transformer [4]both papers were published months apart from each other.
Attention to graphs is used position or measure the relative importance of all neighboring nodes (keys) with respect to each source node (query). These attention points are calculated for every node feature in the graph and its corresponding neighbors. Node properties, defined by
undergo a linear transformation with the indicated weight matrix
before the attention method is used. For the linearly transformed node features, raw attention points are calculated as shown in Equation (1). To calculate the average attention score, the softmax function is used as shown in Equation (2) which is similar to the calculation of attention in [4].
In the paper (GAT), attention method Note () used is a single feedforward neural network with parameter is a followed by LeakyReLU non-linearity, as shown in Equation (3). The || sign means to combine on the side of the element. Note: the multi-headed attention construct is intentionally omitted in this article because it is not concerned with the attention construct itself. Both GAT and GATv2 increase multi-headed attention in their use.
As can be seen, it is a readable attention parameter a is presented as a linear combination of the transformed node properties Wh. As explained in the following sections, this setup is known as static attention is also a key limiting factor for GAT, although for reasons that are not immediately apparent.
Fixed Attention
Consider the following graph below where the node h1 is a query node with the following neighbors (keys) {h2, h3, h4, h5}.
Calculates the raw attention score between the query node and h2 following the formulation of GAT is shown in Equation (4).
As mentioned before, it is a readable attention parameter a is linearly combined with a query associated with key nodes. This means that the contributions of a with respect to Wh1 again Wh2 there is separately in lines like a = [a1 || a2]. Using a1 again a2Equation (4) can be rearranged as follows (Equation (5)).
It calculates the raw attention scores of every neighbor with respect to the query node h1a pattern begins to emerge.
From Equation (6), it can be seen that the query word
it is repeated each time in the calculation of attention points e. This means that although the query term is technically included in the attention calculation, it affects all neighbors equally and does not affect their relative ordering. Only the main goals
find the relative order of the points of attention with respect to each other.
This type of attention is called static attention with [2]. This design means that the situation is important to the neighbors
is determined globally for all nodes except for specific query nodes. This limitation prevents GAT from capturing nuanced relationships in an environment where different nodes may prioritize different subsets of neighbors. As stated in [2],”[static attention] it cannot model situations where different keys have different relevance in different queries”.