Sparse Graph Navigation: A Distributed Learning Method for Q

nimda February 3, 2026

0 9 8 minutes read

Sparse Graph Navigation: A Distributed Learning Method for Q

about the Small-World Experiment, conducted by Stanley Milgram in the 1960s. He designed an experiment in which a letter was given to a volunteer in the United States, with the instruction to forward the letter to a contact who might know another person – the target – in the same country. Next, the recipient of the letter will be asked to forward the letter again until the target is reached. Although most of the bullets never reached the target, among those that did (survival bias at play here!), the average number of hops was about 6. “Six degrees of separation” has become a cultural reference to strong social connections.

I'm still amazed at the thought that four ~ 10 people² contacts are able to connect with a random person in the ~10 network⁸ people, through a few hops.

How is that possible? Heuristics.

Let's imagine that I am asked to send a letter to a target person in Finland¹.

Unfortunately, I have no contacts in Finland. On the other hand, I know someone who lived in Sweden for many years. Maybe you know Finnish people. If not, you may still have them in Sweden, which is a neighboring country. You are my best bet to approach my target. The point is, even though I don't know the topology of a social network other than those I'm in contact with, I can use the rules of thumb to send a letter in the right direction.

Hello, Finland! Image from Illia Panasenko, Unsplash.

If we take the view of network nodes (people involved in the experiment), their available actions are to transmit a message (letter) to one of their outgoing edges (personal contacts). This message passing problem offers an opportunity to have fun with machine learning!

Nodes do not see the entire network topology. We can set up an environment that rewards them for sending a message along a known short path, while encouraging them to explore the best candidate paths. That sounds like a good use for reinforcement learning, don't you think?

For those interested in using the code, you can access the repo here.

The problem

We will be given a directed graph where the edges between the nodes are sparse, which means that the average number of edges coming out of a node is much smaller than the number of nodes. In addition, the edges will have corresponding costs. This additional feature typically creates a Small World Test situation, where each hop of a character is counted as 1 value.

The problem we will consider is to design a reinforcement learning algorithm that finds a path from an arbitrary start point to an arbitrary target point in an overlapping target graph, with the lowest possible cost, if such a path exists. Fixed solutions exist for this problem. For example, Dijkstra's algorithm finds the shortest path from a start node to all other nodes in a directed graph. This will be useful to test the results of our reinforcement learning algorithm, which is not guaranteed to find the correct solution.

Q-Reading

IQ-Learning is a reinforcement learning method in which an agent maintains a table of state-action pairs, corresponding to the expected discounted cumulative reward – the leveltherefore Q– Reading. With iterative testing, the table is updated until a stop criterion is met. After training, the agent can choose the action (column of the Q-matrix) that corresponds to the highest quality, for a particular situation (row of the Q-matrix).

The law of review, given the trial act a_jwhich leads to the evolution of the region s_i to say s_kreward rand an excellent measure of the quality of the next region s_kby:

[ Q(i, j) leftarrow (1 – alpha) Q(i, j) + alpha left( r + gamma max_{l} Q(k, l) right) ]

_{Figure 1: Q-Learning update rule.}

In equation 1:

α learning rate, which controls how quickly new results will erase old quality ratings.
γ is the discount factor, which controls how much future rewards are compared to immediate rewards.
Q quality matrix. Row index i is the root state index, and the column index j is an indication of the chosen action.

In short, equation 1 says that the quality (of the situation, the action) must be partially updated with the value of the new quality, made up of the sum of the immediate reward and the reduced rate of the high quality of the next state over possible actions.

In our problem statement, the possible structure of the state would be the pair (current node, target node), and the set of actions would be the set of nodes. The condition set will contain N² values, and the action set will contain N prices, where N number of nodes. However, because the graph is sparse, a given starting point has a small set of nodes as outgoing edges. This construction can lead to a Q-matrix where the majority of the N³ entries are not visited, consuming memory unnecessarily.

Distributed agents

Better use of resources to distribute agents. Each node can be thought of as an agent. The agent state is the target node, so it is Q-matrix has N lines and N_outside columns (number of outgoing edges for this particular node). With N agents, the total number of matrix entries is N²N_outsidelower than N³.

To summarize:

We will be training N agents, one for each node in the graph.
Each agent will learn a Q-dimensional matrix [N x N_out]. Number of outgoing edges (N_outside) may vary from one place to another. For a partially connected graph, N_outside << N.
References to the lines of Q-matrix corresponds to the state of the agent, i.e., the target location.
Column references for Q-matrix corresponds to the outgoing edge chosen by the agent to route the message to the target.
Q[i, j] represents a node's measure of the quality of message transmission to it j^th the trailing edge, given the target area i.
When a node receives the message, it will enter the target node. Since the sender of the previous message is not needed to determine the route of the next message, it will not be included in the latter.

The code

The main section, the node, will be named QNode.

class QNode:
    def __init__(self, number_of_nodes=0, connectivity_average=0, connectivity_std_dev=0, Q_arr=None, neighbor_nodes=None,
                 state_dict=None):
        if state_dict is not None:
            self.Q = state_dict['Q']
            self.number_of_nodes = state_dict['number_of_nodes']
            self.neighbor_nodes = state_dict['neighbor_nodes']
        else:  # state_dict is None
            if Q_arr is None:
                self.number_of_nodes = number_of_nodes
                number_of_neighbors = connectivity_average + connectivity_std_dev * np.random.randn()
                number_of_neighbors = round(number_of_neighbors)
                number_of_neighbors = max(number_of_neighbors, 2)  # At least two out-connections
                number_of_neighbors = min(number_of_neighbors, self.number_of_nodes)  # Not more than N connections
                self.neighbor_nodes = random.sample(range(self.number_of_nodes), number_of_neighbors)  # [1, 4, 5, ...]
                self.Q = np.zeros((self.number_of_nodes, number_of_neighbors))  # Optimistic initialization: all rewards will be negative
                # q = self.Q[state, action]: state = target node; action = chosen neighbor node (converted to column index) to route the message to

            else:  # state_dict is None and Q_arr is not None
                self.Q = Q_arr
                self.number_of_nodes = self.Q.shape[0]
                self.neighbor_nodes = neighbor_nodes

The class QNode has three fields: the number of nodes in the graph, the list of outgoing edges, and Q– matrix. I Q-matrix is initialized to zero. The rewards earned in the environment will be negative for the marginal costs. Therefore, the quality values will all be negative. For this reason, starting from zero is an optimistic start.

If the message reaches a QNode object, selects one of its outgoing edges using i epsilon-the greedy algorithm. When ε is small, the epsilon-greedy algorithm chooses most of the time the highest output edge. Q– price. At some point, it will choose an output edge at random:

def epsilon_greedy(self, target_node, epsilon):
        rdm_nbr = random.random()
        if rdm_nbr < epsilon:  # Random choice
            random_choice = random.choice(self.neighbor_nodes)
            return random_choice
        else:  # Greedy choice
            neighbor_columns = np.where(self.Q[target_node, :] == self.Q[target_node, :].max())[0]  # [1, 4, 5]
            neighbor_column = random.choice(neighbor_columns)
            neighbor_node = self.neighbor_node(neighbor_column)
            return neighbor_node

Another class is a graph, called QGraph.

class QGraph:
    def __init__(self, number_of_nodes=10, connectivity_average=3, connectivity_std_dev=0, cost_range=[0.0, 1.0],
                 maximum_hops=100, maximum_hops_penalty=1.0):
        self.number_of_nodes = number_of_nodes
        self.connectivity_average = connectivity_average
        self.connectivity_std_dev = connectivity_std_dev
        self.cost_range = cost_range
        self.maximum_hops = maximum_hops
        self.maximum_hops_penalty = maximum_hops_penalty
        self.QNodes = []
        for node in range(self.number_of_nodes):
            self.QNodes.append(QNode(self.number_of_nodes, self.connectivity_average, self.connectivity_std_dev))

        self.cost_arr = cost_range[0] + (cost_range[1] - cost_range[0]) * np.random.random((self.number_of_nodes, self.number_of_nodes))

Its main fields are a list of nodes and an array of edge costs. Note that the actual edges are part of the QNode class, as a list of output nodes.

If we want to generate a path from a start point to a target point, we call it QGraph.trajectory() method, which calls the QNode.epsilon_greedy() method:

    def trajectory(self, start_node, target_node, epsilon):
        visited_nodes = [start_node]
        costs = []
        if start_node == target_node:
            return visited_nodes, costs
        current_node = start_node
        while len(visited_nodes) < self.maximum_hops + 1:
            next_node = self.QNodes[current_node].epsilon_greedy(target_node, epsilon)
            cost = float(self.cost_arr[current_node, next_node])
            visited_nodes.append(next_node)
            costs.append(cost)
            current_node = next_node
            if current_node == target_node:
                return visited_nodes, costs
        # We reached the maximum number of hops
        return visited_nodes, costs

I trajectory() method returns a list of nodes visited on the path and a list of costs associated with the edges used.

The last missing piece is the update rule of equation 1, used by QGraph.update_Q() method:

def update_Q(self, start_node, neighbor_node, alpha, gamma, target_node):
   cost = self.cost_arr[start_node, neighbor_node]
   reward = -cost
   # Q_orig(target, dest) <- (1 - alpha) Q_orig(target, dest) + alpha * ( r + gamma * max_neigh' Q_dest(target, neigh') )
   Q_orig_target_dest = self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)]
   max_neigh_Q_dest_target_neigh = np.max(self.QNodes[neighbor_node].Q[target_node, :])
   updated_Q = (1 - alpha) * Q_orig_target_dest + alpha * (reward + gamma * max_neigh_Q_dest_target_neigh)
   self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)] = updated_Q

To train our agents, we also go through pairs of (start_node, target_node) with an inner loop over the neighbor nodes of start_nodeand we call update_Q().

Tests and results

Let's start with a simple graph of 12 nodes, with directed weighted edges.

Figure 1: A 12-node graph. Author's photo.

We can see from Figure 1 that the only incoming node from Node-1 is Node-7, and the only incoming node from Node-7 is Node-1. Therefore, no other node except these two can reach Node-1 and Node-7. When another node has the task of sending a message to Node-1 or Node-7, the message will hop around the graph until the maximum number of hops is reached. We expect to see a large negative Q-values in these cases.

When we train our graph, we get statistics about the cost and the number of hops as a function of time, as in figure 2:

Figure 2: Typical evolution of cost and path length (number of hops) as a function of time. Author's photo.

After training, this is Q-matrix for Node-4:

Table 1: Q-matrix for Node-4. Author's photo.

The trajectory from Node-4 to, say, Node-11, can be found by calling trajectory() method, setting epsilon=0 with a selfish solution to decide: [4, 3, 5, 11] with a total undiscounted cost of 0.9 + 0.9 + 0.3 = 2.1. Dijkstra's algorithm returns the same method.

In some rare cases, the correct method was not found. For example, from Node-6 to Node-9, a specific instance of the trained Q-graph is returned [6, 11, 0, 4, 10, 2, 9] with a total undiscounted cost of 3.5, while Dijkstra's algorithm returned [6, 0, 4, 10, 2, 9] with a total undiscounted cost of 3.4. As we mentioned earlier, this is expected in an iterative algorithm

The conclusion

We formulated the Small-World Experiment as the problem of finding the least-cost path between any nodes in a sparse directed graph with weighted edges. We modeled the nodes as Q-Learning agents, which learn by updating the rule to take the least expensive action, given the target location.

With a simple graph, we show that the training is solved to a solution close to one.

Thanks for your time, and feel free to try the code. If you have ideas for fun applications for Q-Learning, please let me know!

¹ Okay, I skipped the original Small World Magic, which should be called Small World Experiment.