Playing Connect Four with Deep Q-Learning

we explored how we can extend Reinforcement Learning (RL) beyond the table setting using task approximation. Although this allowed us to generalize to all states, our experiments also revealed an important limitation: in simple environments like GridWorld, standardized methods can be difficult to match the stability and efficiency of tabular methods. The main reason is that learning good representation is a difficult problem—it cannot outweigh the benefits of generalization when the state space is small.
To really unlock the power of performance measurement, we therefore need to go to places where tableau methods no longer work. This leads us naturally multiplayer gameswhen the regional space grows together and production becomes important – and at the same time it fits perfectly with this series of posts, since so far we have not been able to learn any meaningful behavior in complex multiplayer environments. In this post, we take this a step further by considering the classic Connect Four game and investigate how you can learn robust policies using Deep Q-Learning.
From Sarsa to Deep Q-Learning
To tackle this task, we are expanding our framework in several important ways.
First, we take off online updates of a integrated training setup. In our early use of Sarsa, we updated the model after every change. While being faithful to the original algorithm [1]this approach is computationally inefficient: each verification step incurs a non-trivial cost, and modern computing hardware—especially GPUs—is designed to run on clusters with only minimal overhead.
To address this, we introduce a play the save again. Instead of updating quickly, we save changes as they are encountered—either a constant volume or, in our case, until the end of one or more games. Then we do a cumulative review. This not only improves statistical efficiency but also stabilizes learning by reducing the variability of individual updates.
At this point, an important conceptual shift occurs. Taking a sample from past experience rather than strictly following current policy, we depart from Sarsa—an in policy way – to go Q-readingwhich is out of policy. Although we have not formally presented the reading of Q in the work measurement setting here, the extension from the tabular case is very straightforward. This combination of replay buffers and Q-learning forms the basis Deep Q-Networks (DQNs)which was popularized by DeepMind in their comprehensive work on Atari games [2].
Finally, we turn to scalability. Reinforcement learning is inherently data hungry, so scaling up is important. To do this, we use a vectorized environment wrapper which allows us to simulate multiple Connect Four games in parallel. Specifically, one call to step(a) now it processes a set of actions and optimizes all areas at once.
In practice, however, achieving true parallelism in Python is not trivial. The Global Interpreter Lock (GIL) ensures that only one thread executes Python bytecode at a time, which limits the performance of many configurations of CPU-bound workloads such as natural threading. We also tried multiprocessing, but found that the additional overhead (eg, communication between processes) greatly reduced any gains in our system. For the interested reader I recommend my previous post.
Despite these limitations, the combination of clustered updates and local vectorization yields a significant improvement in output, increasing the performance to approx. 50-100 games per second.
Implementation
In this post, I'm deliberately avoiding going into too much detail about natural vectorization and instead focusing on the RL aspects. In part, this is because vectorization itself is “just” an implementation detail—but also because, in all honesty, our current setup is not ideal. Much of this is due to the limitations imposed by the PettingZoo environment we use.
In future posts, we will explore different areas and revisit this topic with more emphasis scalability-An important aspect of modern reinforcement learning. For a detailed discussion of how we build multiplayer environments, manage agents, and maintain a team of opponents, I refer to my previous post on RL multiplayer. The vectorized setup used here is simply an extension of that framework for many games that run in parallel. As always, the full implementation is available on GitHub.
Revisiting Q-Learning
Let's revisit Q-learning briefly and connect it to our implementation.
The main revision rule is given by:
In contrast to Sarsa, which uses action derived from the following situation, Q-learning uses a a great conductor over all subsequent possible actions. This does it out of policysince the update does not depend on the behavior policy used to generate the data. In practice, this often leads to a rapid spread of value information, especially in critical areas such as board games.
When combined with neural networks, this method is often called Deep Q-Learning. Instead of storing a table of values, we train a neural network measuring the function of the value of the action. The update is then used as a regression problem, minimizing the difference between the current estimate and the bootstrapped target:

In our usage, this corresponds directly to batch_update work. A lot of changes have been made we start by calculating the predicted Q values for the steps taken:
q = self.q(batch.states, ...)
q_sa = q.gather(1, batch.actions.unsqueeze(1)).squeeze(1)
Next, we create a target using the maximum Q value of the next region. Since not all actions are valid in Connect Four, we use a mask to ensure that only valid moves are considered:
q_next = self.q(batch.next_states, ...)
q_next_masked = q_next.masked_fill(~legal, float("-inf"))
max_next = q_next_masked.max(dim=1).values
Finally, we include the reward and the following reduced state value, taking care to manage the states correctly:
target = batch.rewards + gamma * (~batch.dones).float() * max_next
The network is then trained by minimizing the Huber loss (a robust variant of the mean squared error):
loss = F.smooth_l1_loss(q_sa, target)
This stack-based architecture allows us to efficiently reuse experience gathered from many similar games, which is essential for scaling to more complex environments. At the same time, it highlights the main challenge of Deep Q-Learning: the targets themselves depend on the current network, which can cause instability during training.
For further reference, the official PyTorch tutorial on Deep Q-Learning provides a helpful overview.
Results
With that, let's turn to the results. To put ourselves in the right place, we first remember how table methods done in this work. After 100,000 steps, most policies were still close together in terms of win rate. In particular, even the unplanned policy is achieved approx 50% win ratewhich shows that none of the policies studied have been able to overcome the opportunity in a meaningful way.

In the following research, we focus on two agents: our DQN and a random base. Due to the previously introduced “zoo” setup, DQN is not a single fixed policy but a a collection of dynamic agents. We keep adding new versions and pruning weak ones, gradually increasing the strength of the opposing team.
This has important implications for interpreting metrics:
the win rate of “DQN vs. DQN” naturally fluctuates 50%as agents with similar capabilities compete. So a signal with more information is the performance of random policy. As DQN improves, the random agent should win fewer times.
With that in mind, let's look at the performance curve:

We see several interesting results. Most notably, the win rate of the random policy it goes down very quickly than in the standings—clear evidence that DQN really does read the game. However, after a while a million stepsdevelopment plateaus, and random policy still wins almost 20% of games.
To better understand what this means in practice, we can test the learned policy against a human player. In the following example, I assume the role of the red player who goes first:

The result is very revealing. The agent has clearly learned to act play aggressively-pursues its own four lines in a row. However, it is struggling defensive gamefailing to anticipate and prevent simple threats from opponents.
This is probably disappointing, but: we will come back to this. In future posts we'll learn how to scale better, learn faster, and beat people (at many things). Writing this series of posts about Sutton's great book has been an amazing journey (although there are still a few gaps left) – but we just went through the general outline with which we started to show all the algorithms available in Sutton's book, including both tabular methods and approximate solutions. Therefore, technology is the way to go – and in the future we will do just that, writing more efficient methods, designed for different problems.
The conclusion
In this post, we've moved from tabular Sarsa to Deep Q-Learningwe introduce replay buffers, cumulative updates, and task prediction. We applied this to Connect Four, a multiplayer game that we previously failed to solve with table methods, with an obvious result: our agent is no longer stuck at the level of chance—it learns, evolves, and continuously makes a random policy.
But most importantly, we also see boundaries.
Even after extensive training, the agent steps up and still shows clear weaknesses—especially in the defensive game. This is not just a matter of “more training.” In multiplayer settings, the problem itself becomes more complex: the opponents change, the environment is static, and the learning target keeps shifting.
This is where the real challenge begins.
So far, our framework—it follows loosely [1] – prioritize generality and clarity. But moving forward, that is no longer enough. Performance requires specialization.
In the next post, we'll start and continue to follow [1] – then it will focus on exactly that: building fast, stable, and highly scalable systems—going beyond a simple foundation to truly competitive agents.
Another Post in this thread
References
[1]
[2]



